Semantic Engine - Overview (O11)

Stable version 1.0.0 (Compatible with OutSystems 11)

Uploaded

on 18 Jun (2 hours ago)

DB Results Labs

0.0

(0 ratings)

Details

Add semantic search to any OutSystems 11 application. This extension handles the full RAG ingestion pipeline, including PDF extraction, text chunking, embedding generation, and in-memory vector search. Everything runs entirely server-side, so you do not need any additional infrastructure. Background Note: This engine was originally designed and built as an architectural feasibility study for OutSystems Developer Cloud (ODC) to prove that native, zero-dependency vector storage and retrieval could be managed entirely within platform trust boundaries. You can read the original architectural deep-dive here: Proving Vector Storage & Retrieval Inside OutSystems Developer Cloud.

Architecture Overview

The suite is split into three clean layers to separate raw computation, logic wrappers, and database orchestration:

SemanticEngineV2 (Extension): The underlying .NET asset handling binary parsing, character-based chunking, and raw API communication.
SemanticEngine_Lib (Library): A developer-friendly wrapper that handles configurations and abstracts away raw JSON serialization.
SemanticEngine_CS (Core Services): The core database, service layer, and asynchronous background processing pipeline (BPT).

1. Core Extension: SemanticEngineV2

This extension exposes three foundational server actions that run directly on the application server. It works out of the box with OpenAI (text-embedding-3-small, text-embedding-ada-002, etc.) and Azure OpenAI embedding deployments.

Core Actions

Action	Purpose
PrepareVectorsFromPdfJson	Ingest a PDF: extract -> chunk -> embed -> return vector records as JSON
EmbedAndSearchTopKJson	Search: embed a query -> cosine similarity search over stored vectors -> return ranked results
ExtractTextFromPdfWithPageMarkers	Debug: extract raw PDF text with page markers for ingestion auditing

Important Core Parameters

chunkSize: Target chunk size in characters, defaulting to 1000.
overlap: Character overlap between consecutive chunks, defaulting to 150.
minScore: Minimum cosine similarity threshold between 0.0 and 1.0 to weed out irrelevant search results.

2. Wrapper Module: SemanticEngine_Lib

The library layer abstracts the raw extension, standardizes logic, and introduces a unified output structure to make your error handling a breeze.

Unified Result Structure

Most actions in this library return a standard Result structure:

IsSuccess (Boolean): True if the action completed without issues.
Message (Text): Contains error details or success confirmation messages.

Wrapper Actions Reference

GetSettings

A quick utility to check or fetch the active AI configurations.

Setting (out Structure): Returns the active configuration profile containing Model and Provider details.

GetPDFText

Extracts plain text from a document to audit ingestion quality.

fileContent (in Binary): Raw PDF binary content.
Content (out Text): The extracted plain text.
Result (out Structure): Standard success or error structure.

IngestPdfToVectorRecords

Wraps the core ingestion pipeline. It automatically pulls your active settings, runs the extraction, handles chunking, calls your embedding provider, and gives you back cleanly structured vector data.

fileContent (in Binary): Raw PDF binary content.
Inputs include maxPages, maxCharsPerPage, chunkSize, charsOverlap, and maxChunksTotal.
Vector (out Record List): Structured collection of vector chunks ready for your database.
Result (out Structure): Standard success or error structure.

SearchVector

Evaluates a text query against a structured collection of candidate vectors using in-memory cosine similarity.

queryText (in Text): Plain-language question or query.
candidateVector (in Record List): The collection of vector records to search across.
topK (in Integer): Number of top results to return.
mininimumScore (in Decimal): Similarity threshold between 0.0 and 1.0.
SearchResult (out Record List): Ranked search results ordered by relevance score descending.
Result (out Structure): Standard success or error structure.

3. Core Services Module: SemanticEngine_CS

This module handles the actual data orchestration, holding the physical tables and managing an asynchronous background pipeline to process files smoothly without locking up user interfaces.

Data Model

The schema separates metadata from heavy binary and text objects to keep things snappy:

DocumentStatus: A lookup entity managing processing states using Label, Order, and Is_Active.
Document: The master tracking record for file metadata, tracking properties like DocumentKey, Title, OriginalFileName, FileSizeBytes, MimeType, UploadOn, UploadedBy, StatusId, and an ErrorMessage field for pipeline auditing.
DocumentFile: Holds the raw binary contents (FileContent) isolated from the main metadata table.
DocumentText: Houses the complete, plain text extract (FileContentText) after the file is parsed.
DocumentChunk: Stores individual text segments along with their PageNumber, ChunkIndex, ChunkText, and a ChunkHash for easy deduplication.
ChunkEmbedding: Holds the actual generated floating-point arrays inside VectorJson, linked to its source chunk via DocumentChunkId, while logging the specific Provider and Model used.

Service Actions Reference

SaveFile

Handles document uploads by creating the base metadata records and dropping the file into the queue for background processing.

File (in Record): Compound structure capturing DocumentKey, Title, FileName, FileContent, and MimeType.
Id (out Identifier): The unique identifier assigned to the new Document record.
Result (out Structure): Standard success or error structure.

SearchVector

An exposed service action providing cross-module access to search your indexed document store.

Parameters match the library wrapper, accepting a queryText, candidateVector record list, topK, and mininimumScore.
SearchResult (out Record List): Returns the top ranked matches with their Rank, Score, PageNumber, ChunkIndex, ChunkText, and ChunkHash.

Asynchronous Background Processing (BPT)

PDFIngestion Process

To ensure a fast, responsive user experience, document parsing and embedding happen entirely in the background.

Trigger: Automatically launches the second a new record hits the DocumentFile table.
EmbedFile Activity: An automatic activity that triggers the wrapper logic to extract the text, split it into chunks, and fetch the embeddings from your configured AI provider.
Decision: Evaluates the outcome using the IsSuccess flag.
- Yes Route: Moves to the Success milestone and terminates cleanly.
- No Route: Moves to the Fail milestone, logging the error reasons straight to the document master record so you can audit what went wrong.

Known Limitations

No Scanned PDFs: Text extraction requires selectable text in the document. Run OCR on your files before passing them to the ingestion pipeline.
In-Memory Scale: Vector comparisons occur completely in-process per search request. This approach is highly performant for up to a few thousand chunks, but if you are looking to scale to massive document vaults, you will eventually want to couple this with a dedicated vector database.
Model Consistency: Make sure you use the exact same model for both ingestion and searching. Mixing models will cause your cosine similarity scores to calculate incorrectly.

Release notes (1.0.0)

License (1.0.0)

BSD-3 license (https://opensource.org/licenses/BSD-3-Clause)

Reviews (0)

Team