Overview
SemanticChunkingLibrary is an ODC External Logic library that groups pre-split text units into semantically coherent chunks using embedding-based cosine distance analysis. It sits between the ChunkingLibrary SplitBySentence action and a vector store in a RAG pipeline.
SplitBySentence
The library exposes a single action, GroupBySemantic, which accepts a flat list of sentences or paragraphs and returns them grouped into semantic chunks via a single batched call to any OpenAI-compatible /v1/embeddings endpoint.
/v1/embeddings
Three chunking patterns are available:
Each output chunk carries a SHA-256 hash, token estimate, cosine boundary score, and full provenance metadata.
Prerequisites
https://api.openai.com/v1/embeddings
Installation
Option 1: Install from the OutSystems Forge (recommended)
Option 2: Install from GitHub Releases
SemanticChunkingLibrary.zip
Configuration
Store your credentials using Settings
Do not hardcode your embedding API key or endpoint in ODC logic. Use ODC Settings to manage these values per stage.
EmbeddingApiKey
EmbeddingApiEndpoint
EmbeddingModel
GroupBySemantic
Setting.EmbeddingApiKey
Setting.EmbeddingApiEndpoint
Setting.EmbeddingModel
Add the library to your ODC app
Basic Usage
Step 1: Split your document into text units
Call ChunkingLibrary\SplitBySentence with your document text. Capture SplitBySentence.Chunks[].Text as the input list for the next step.
ChunkingLibrary\SplitBySentence
SplitBySentence.Chunks[].Text
Step 2: Apply input defaults
Add an Assign node before GroupBySemantic to set safe defaults when parameters are left at their ODC initial values:
If ChunkingPattern = "" Then ChunkingPattern = "Cumulative" If ThresholdPct = 0 Then ThresholdPct = 80 If SensitivityK = 0 Then SensitivityK = 1.5 If WindowSize = 0 Then WindowSize = 1 If MaxChunkSize = 0 Then MaxChunkSize = 1500 If OverlapSentences < 0 Then OverlapSentences = 1
Step 3: Call GroupBySemantic
Set the following input parameters on the action node:
POLICY-2024-001
Cumulative
80
1.5
1
1500
Step 4: Check the result
Add an If node after the action call:
GroupBySemantic.Chunks.Length > 0
Step 5: Upsert chunks into your vector store
Iterate GroupBySemantic.Chunks and store the following fields for each chunk:
GroupBySemantic.Chunks
POLICY-2024-001-0001
true
Choosing a Chunking Pattern
Cumulative is the recommended default for most plain text documents. Use it for PDFs, policy documents, knowledge base articles, and any prose content where topics shift gradually. Start with ThresholdPct=80 and WindowSize=1. If chunks are too large, lower ThresholdPct to 75. If chunks are too fragmented, raise it to 85.
ThresholdPct=80
WindowSize=1
ThresholdPct
Consecutive suits structured documents with clear section breaks such as FAQs, technical specifications, and numbered procedure documents. Use ThresholdPct=80 and WindowSize=1 as the starting point. Raise WindowSize to 2 or 3 if the content has multi-sentence topic transitions that produce false splits.
WindowSize
Statistical is best when the document has uneven similarity distribution across sections. For example, a report that mixes dense technical content with short executive summaries. It adapts the split threshold to the document's own distance profile rather than a fixed percentile. Start with SensitivityK=1.5. Lower to 1.0 for more splits, raise to 3.0 to split only at extreme topic changes.
SensitivityK=1.5
Parameter ReferenceRequired parameters
TextUnits (List of Text)Pre-split strings to group. Accepts sentences, paragraphs, or any upstream splitter output. Must not be empty.
DocumentId (Text)Stamped on every ChunkId in the output. Use a stable identifier for the source document.
ApiEndpoint (Text)OpenAI-compatible /v1/embeddings URL. Set via Setting.EmbeddingApiEndpoint.
ApiKey (Text)Bearer token for the embedding endpoint. Set via Setting.EmbeddingApiKey.
EmbeddingModel (Text)Model name passed to the endpoint. Set via Setting.EmbeddingModel. Common values are text-embedding-3-small and text-embedding-3-large.
text-embedding-3-small
text-embedding-3-large
Optional parameters
ChunkingPattern (Text, default: Cumulative)Selects the grouping algorithm. Accepted values are Consecutive, Cumulative, and Statistical. Case-insensitive. Blank defaults to Cumulative.
Consecutive
Statistical
ThresholdPct (Integer, default: 80, range: 1 to 99)Percentile cutoff for distance boundary detection. Used by Consecutive and Cumulative. Silently ignored by Statistical. Lower values produce more splits. Higher values produce fewer, larger chunks.
SensitivityK (Decimal, default: 1.5, range: greater than 0)Standard deviation multiplier for the Statistical threshold. Used by Statistical only. Silently ignored by Consecutive and Cumulative. A value of 1.0 is aggressive, 1.5 is balanced, and 3.0 is conservative.
WindowSize (Integer, default: 1, range: 1 to 5)For Consecutive: the number of text units averaged on each side of a comparison gap. For Cumulative: the minimum number of seed units accumulated before split evaluation begins. Silently ignored by Statistical.
MaxChunkSize (Integer, default: 1500, range: greater than 0)Maximum character length per assembled chunk. This guard fires independently of the distance threshold. A text unit that exceeds this value on its own is emitted as a single chunk.
OverlapSentences (Integer, default: 1, range: 0 to unit count minus 1)Number of text units carried from the end of each chunk into the start of the next. Increases context continuity at chunk boundaries. Used by all patterns.
Output Reference
GroupBySemantic returns a SemanticChunkingResponse with two parts: Chunks and Stats.
SemanticChunkingResponse
Chunks
Each item in the Chunks list contains the following fields:
Text (Text)Joined text of all grouped units in this chunk.
Metadata.ChunkId (Text)Unique chunk identifier in the format {DocumentId}-{N:D4}, e.g. POLICY-2024-001-0001.
{DocumentId}-{N:D4}
Metadata.DocumentId (Text)Echo of the caller-supplied DocumentId.
Metadata.Sha256 (Text)SHA-256 fingerprint of the chunk text, prefixed with sha256-. Use for deduplication before upsert.
sha256-
Metadata.TokenEstimate (Integer)Approximate token count calculated as character length divided by 4.
Metadata.Strategy (Text)Always Semantic.
Semantic
Metadata.ChunkingPattern (Text)The pattern used to produce this chunk: Consecutive, Cumulative, or Statistical.
Metadata.UnitCount (Integer)Number of source text units grouped into this chunk.
Metadata.BoundaryScore (Decimal)Cosine distance at the split boundary that closed this chunk. Returns 0 for the final chunk, which is closed by document end rather than a distance anomaly.
Metadata.WindowSize (Integer)Echo of the WindowSize parameter used. Returns 0 when the pattern is Statistical.
Metadata.EmbeddingReady (Boolean)Always true. Safe to pass directly to an embedding or vector upsert action.
Stats
The Stats object summarises the grouping run:
TotalChunks (Integer): total number of chunks produced.
TotalUnits (Integer): total number of input text units processed.
AverageUnitsPerChunk (Decimal): average number of units per chunk.
AverageBoundaryScore (Decimal): average cosine distance across all split boundaries.
ChunkingPattern (Text): the pattern used.
WindowSize (Integer): echo of WindowSize. Returns 0 when Statistical.
ThresholdPct (Integer): echo of ThresholdPct. Returns 0 when Statistical.
SensitivityK (Decimal): echo of SensitivityK. Returns 0 when Consecutive or Cumulative.