Semantic Engine - Documentation (O11)

Stable version 1.0.0 (Compatible with OutSystems 11)

Uploaded

on 18 Jun (3 hours ago)

DB Results Labs

0.0

(0 ratings)

Documentation

1.0.0

Installation Guide

To ensure a clean environment setup, install and deploy the suite components in the following modular order:

Install the Core Extension (SemanticEngineV2): Download the asset via OutSystems Service Studio or Integration Studio and publish it to compile the .NET library on your application server.
Install the Wrapper Library (SemanticEngine_Lib): Deploy this module to handle your foundational integrations and public mapping structures.
Install Core Services (SemanticEngine_CS): Publish this module to create your data schema and activate the background BPT processing engine.
Publish Your Consuming Application: Open your end-user application, open the Manage Dependencies window, find SemanticEngine_CS, check the required service actions, and refresh your references.

Configuration Instructions

Before triggering your first document ingestion, you must establish your AI provider credentials:

Obtain API Keys: Set up an active developer account with OpenAI to generate an API key, or provision an Azure OpenAI resource with an active embedding model deployment.
Configure Settings: Use the internal settings mechanism (or map them to your application site properties) to supply the following per call:
- Provider: Specify whether you are using native OpenAI or Azure OpenAI.
- Model: Provide the specific deployment name or model string (e.g., text-embedding-3-small).
- Endpoint: Define your API base path (e.g., https://api.openai.com for OpenAI, or your specific resource URL for Azure).

General Usage Instructions

Implementing semantic search in your application follows a simple two-phase workflow:

Phase 1: Asynchronous Ingestion (One-time per document)

Pass the uploaded document information into the SaveFile service action inside SemanticEngine_CS.
The core services module automatically handles database creation and kicks off the background PDFIngestion BPT process.
In the background, the file is parsed into clean text, segmented using your configured chunkSize and charsOverlap, embedded by your AI provider, and automatically stored in the DocumentChunk and ChunkEmbedding tables.

Phase 2: Real-time Retrieval & Searching (Per user query)

Capture the search text query entered by your end-user.
Fetch the candidate vector record list from your database for the documents you want to search against.
Pass both the text query and the candidate vector list into the SearchVector service action.
Use the returned, ranked SearchResult collection (ordered descending by relevance score) to construct your RAG context window for your LLM interface.

1. Core Extension Reference: SemanticEngineV2

This extension exposes three foundational server actions that run directly on the application server. It works out of the box with OpenAI (text-embedding-3-small, text-embedding-ada-002, etc.) and Azure OpenAI embedding deployments.

Actions

PrepareVectorsFromPdfJson

Extracts text from a PDF, splits it into overlapping chunks, generates embeddings via an OpenAI-compatible endpoint, and returns a JSON array of vector records ready for storage.

Parameter	Type	Description
pdfBytes	Binary	Raw PDF binary content.
endpoint	Text	Base URL of the embedding API (e.g. `https://api.openai.com` or your Azure OpenAI resource URL).
apiKey	Text	API key (OpenAI) or subscription key (Azure).
model	Text	Model name (OpenAI) or deployment name (Azure), e.g. `text-embedding-3-small`.
isAzure	Boolean	True to use Azure OpenAI authentication and URL format; False for OpenAI-compatible endpoints.
maxPages	Integer	Maximum pages to process. 0 = all pages (default cap: 50).
maxCharsPerPage	Integer	Maximum characters to read per page. 0 = unlimited.
chunkSize	Integer	Target chunk size in characters (default: 1000).
overlap	Integer	Character overlap between consecutive chunks (default: 150). Must be less than chunkSize.
maxChunksTotal	Integer	Maximum total chunks across all pages. 0 = unlimited (default cap: 200).
PrepareVectorsFromPdfJson (out)	Text	JSON array of `IngestionVectorRecord` objects.

EmbedAndSearchTopKJson

Embeds a query string and performs in-memory cosine similarity search over a list of pre-indexed vector candidates, returning the top matches ranked by score.

Parameter	Type	Description
queryText	Text	Plain-language query to search for.
candidates	Text	JSON array of `VectorCandidateDto` objects. Pass the stored output from PrepareVectorsFromPdfJson.
endpoint	Text	Base URL of the embedding API.
apiKey	Text	API key (OpenAI) or subscription key (Azure).
model	Text	Model name or Azure deployment name. Must match the model used during ingestion.
isAzure	Boolean	True for Azure OpenAI; False for OpenAI-compatible endpoints.
topK	Integer	Number of top results to return. 0 = return all results meeting minScore.
minScore	Decimal	Minimum cosine similarity threshold (0.0 through 1.0). Results below this value are excluded.
EmbedAndSearchTopKJson (out)	Text	JSON array of `VectorSearchResultDto` objects ranked by score descending.

ExtractTextFromPdfWithPageMarkers

Extracts and normalizes text from a PDF, returning all pages concatenated with ===PAGE:N=== dividers. Useful for auditing ingestion quality before running embeddings.

Parameter	Type	Description
pdfBytes	Binary	Raw PDF binary content.
maxPages	Integer	Maximum pages to extract. 0 = all pages.
maxCharsPerPage	Integer	Maximum characters per page. 0 = unlimited.
ExtractTextFromPdfWithPageMarkers (out)	Text	Plain text with `===PAGE:N===` markers separating each page.

2. Wrapper Module Reference: SemanticEngine_Lib

The library layer abstracts the raw extension, standardizes logic, and introduces a unified output structure to make your error handling seamless.

Unified Result Structure

Most actions in this library return a standard Result structure:

IsSuccess (Boolean): True if the action completed without issues.
Message (Text): Contains error details or success confirmation messages.

Actions

GetSettings

A utility action used to retrieve the active AI provider configurations.

Setting (out Structure): Returns the active configuration profile containing Model and Provider details.

GetPDFText

Extracts plain text from a document to audit ingestion quality.

fileContent (in Binary): Raw PDF binary content.
Content (out Text): The extracted plain text.
Result (out Structure): Standard success or error structure.

IngestPdfToVectorRecords

Wraps the core ingestion pipeline. It automatically pulls your active settings, runs the extraction, handles chunking, calls your embedding provider, and gives you back cleanly structured vector data instead of raw text strings.

fileContent (in Binary): Raw PDF binary content.
Inputs include maxPages, maxCharsPerPage, chunkSize, charsOverlap, and maxChunksTotal.
Vector (out Record List): Structured collection of vector chunks ready for your database.
Result (out Structure): Standard success or error structure.

SearchVector

Evaluates a text query against a structured collection of candidate vectors using in-memory cosine similarity.

queryText (in Text): Plain-language question or query.
candidateVector (in Record List): The collection of vector records to search across.
topK (in Integer): Number of top results to return.
mininimumScore (in Decimal): Similarity threshold between 0.0 and 1.0.
SearchResult (out Record List): Ranked search results ordered by relevance score descending.
Result (out Structure): Standard success or error structure.

3. Core Services Module Reference: SemanticEngine_CS

This module handles the actual data orchestration, holding the physical tables and managing an asynchronous background pipeline to process files smoothly without locking up user interfaces.

Data Model Reference

The schema separates metadata from heavy binary and text objects to optimize database responsiveness:

DocumentStatus (Static Entity)

Manages the system processing states.

Id (Identifier)
Label (Text)
Order (Integer)
Is_Active (Boolean)

Document

The master tracking record for file metadata.

Id (Identifier)
DocumentKey (Text)
Title (Text)
OriginalFileName (Text)
FileSizeBytes (Integer)
MimeType (Text)
StatusId (DocumentStatus Identifier)
ErrorMessage (Text)
UploadOn (DateTime)
UploadedBy (User Identifier)

DocumentFile

Holds the raw binary contents isolated from the main metadata table.

Id (Document Identifier)
FileContent (Binary Data)
CreatedOn (DateTime)

DocumentText

Houses the complete, plain text extract after the file is parsed.

Id (Document Identifier)
FileContentText (Text, large capacity)
CreatedOn (DateTime)

DocumentChunk

Stores individual segmented text pieces after chunking.

Id (Identifier)
DocumentId (Document Identifier)
PageNumber (Integer)
ChunkIndex (Integer)
ChunkText (Text)
ChunkHash (Text)
CreatedOn (DateTime)

ChunkEmbedding

Holds the actual generated floating-point arrays.

Id (Identifier)
DocumentChunkId (DocumentChunk Identifier)
Provider (Text)
Model (Text)
VectorJson (Text)
CreatedOn (DateTime)

Service Actions

SaveFile

Handles document uploads by creating the base metadata records and dropping the file into the database, which automatically queues it for background processing.

File (in Record): Compound structure capturing DocumentKey, Title, FileName, FileContent, and MimeType.
Id (out Identifier): The unique identifier assigned to the new Document record.
Result (out Structure): Standard success or error structure.

SearchVector

An exposed service action providing cross-module access to search your indexed document store.

Parameters match the library wrapper, accepting a queryText, a structured candidateVector record list, topK, and mininimumScore.
SearchResult (out Record List): Returns the top ranked matches with their Rank, Score, PageNumber, ChunkIndex, ChunkText, and ChunkHash.
Result (out Structure): Standard success or error structure.

Asynchronous Background Processing (BPT)

PDFIngestion Process

To ensure a fast, responsive user experience, document parsing and embedding happen entirely in the background.

Trigger: Automatically launches the second a new record hits the DocumentFile table.
EmbedFile Activity: An automatic activity that triggers the wrapper logic to extract the text, split it into chunks, and fetch the embeddings from your configured AI provider.
Decision: Evaluates the outcome using the IsSuccess flag.
- Yes Route: Moves to the Success milestone and terminates cleanly.
- No Route: Moves to the Fail milestone, logging the error reasons straight to the document master record so you can audit what went wrong.

Recommended Parameter Values

Scenario	chunkSize	overlap	topK	minScore
General documents	1000	150	5	0.75
Dense technical content	600	100	8	0.70
Large documents (50+ pages)	1200	200	5	0.78