The suite is split into three clean layers to separate raw computation, logic wrappers, and database orchestration:
SemanticEngineV2 (Extension): The underlying .NET asset handling binary parsing, character-based chunking, and raw API communication.
SemanticEngine_Lib (Library): A developer-friendly wrapper that handles configurations and abstracts away raw JSON serialization.
SemanticEngine_CS (Core Services): The core database, service layer, and asynchronous background processing pipeline (BPT).
This extension exposes three foundational server actions that run directly on the application server. It works out of the box with OpenAI (text-embedding-3-small, text-embedding-ada-002, etc.) and Azure OpenAI embedding deployments.
text-embedding-3-small
text-embedding-ada-002
chunkSize: Target chunk size in characters, defaulting to 1000.
overlap: Character overlap between consecutive chunks, defaulting to 150.
minScore: Minimum cosine similarity threshold between 0.0 and 1.0 to weed out irrelevant search results.
The library layer abstracts the raw extension, standardizes logic, and introduces a unified output structure to make your error handling a breeze.
Most actions in this library return a standard Result structure:
IsSuccess (Boolean): True if the action completed without issues.
Message (Text): Contains error details or success confirmation messages.
A quick utility to check or fetch the active AI configurations.
Setting (out Structure): Returns the active configuration profile containing Model and Provider details.
Extracts plain text from a document to audit ingestion quality.
fileContent (in Binary): Raw PDF binary content.
Content (out Text): The extracted plain text.
Result (out Structure): Standard success or error structure.
Wraps the core ingestion pipeline. It automatically pulls your active settings, runs the extraction, handles chunking, calls your embedding provider, and gives you back cleanly structured vector data.
Inputs include maxPages, maxCharsPerPage, chunkSize, charsOverlap, and maxChunksTotal.
Vector (out Record List): Structured collection of vector chunks ready for your database.
Evaluates a text query against a structured collection of candidate vectors using in-memory cosine similarity.
queryText (in Text): Plain-language question or query.
candidateVector (in Record List): The collection of vector records to search across.
topK (in Integer): Number of top results to return.
mininimumScore (in Decimal): Similarity threshold between 0.0 and 1.0.
SearchResult (out Record List): Ranked search results ordered by relevance score descending.
This module handles the actual data orchestration, holding the physical tables and managing an asynchronous background pipeline to process files smoothly without locking up user interfaces.
The schema separates metadata from heavy binary and text objects to keep things snappy:
DocumentStatus: A lookup entity managing processing states using Label, Order, and Is_Active.
Document: The master tracking record for file metadata, tracking properties like DocumentKey, Title, OriginalFileName, FileSizeBytes, MimeType, UploadOn, UploadedBy, StatusId, and an ErrorMessage field for pipeline auditing.
DocumentFile: Holds the raw binary contents (FileContent) isolated from the main metadata table.
DocumentText: Houses the complete, plain text extract (FileContentText) after the file is parsed.
DocumentChunk: Stores individual text segments along with their PageNumber, ChunkIndex, ChunkText, and a ChunkHash for easy deduplication.
ChunkEmbedding: Holds the actual generated floating-point arrays inside VectorJson, linked to its source chunk via DocumentChunkId, while logging the specific Provider and Model used.
Handles document uploads by creating the base metadata records and dropping the file into the queue for background processing.
File (in Record): Compound structure capturing DocumentKey, Title, FileName, FileContent, and MimeType.
Id (out Identifier): The unique identifier assigned to the new Document record.
An exposed service action providing cross-module access to search your indexed document store.
Parameters match the library wrapper, accepting a queryText, candidateVector record list, topK, and mininimumScore.
SearchResult (out Record List): Returns the top ranked matches with their Rank, Score, PageNumber, ChunkIndex, ChunkText, and ChunkHash.
To ensure a fast, responsive user experience, document parsing and embedding happen entirely in the background.
Trigger: Automatically launches the second a new record hits the DocumentFile table.
EmbedFile Activity: An automatic activity that triggers the wrapper logic to extract the text, split it into chunks, and fetch the embeddings from your configured AI provider.
Decision: Evaluates the outcome using the IsSuccess flag.
Yes Route: Moves to the Success milestone and terminates cleanly.
No Route: Moves to the Fail milestone, logging the error reasons straight to the document master record so you can audit what went wrong.
No Scanned PDFs: Text extraction requires selectable text in the document. Run OCR on your files before passing them to the ingestion pipeline.
In-Memory Scale: Vector comparisons occur completely in-process per search request. This approach is highly performant for up to a few thousand chunks, but if you are looking to scale to massive document vaults, you will eventually want to couple this with a dedicated vector database.
Model Consistency: Make sure you use the exact same model for both ingestion and searching. Mixing models will cause your cosine similarity scores to calculate incorrectly.