Login to follow
SemanticChunking

SemanticChunking (ODC)

Stable version 0.1.5 (Compatible with ODC)
Uploaded on 18 Jun (2 weeks ago) by Michael Guzman
SemanticChunking

SemanticChunking (ODC)

Documentation
0.1.5

Overview

SemanticChunkingLibrary is an ODC External Logic library that groups pre-split text units into semantically coherent chunks using embedding-based cosine distance analysis. It sits between the ChunkingLibrary SplitBySentence action and a vector store in a RAG pipeline.

The library exposes a single action, GroupBySemantic, which accepts a flat list of sentences or paragraphs and returns them grouped into semantic chunks via a single batched call to any OpenAI-compatible /v1/embeddings endpoint.


Three chunking patterns are available:

  • Consecutive. Compares adjacent unit pairs. Best for structured documents with clear section breaks.
  • Cumulative. Builds a running chunk vector from all units accumulated so far. Best for prose-heavy documents such as PDFs and policy articles. This is the recommended default for plain text.
  • Statistical. Sets the split threshold at mean plus k times standard deviation, adapting to the document's own distance distribution. Best for documents with uneven similarity across sections.

Each output chunk carries a SHA-256 hash, token estimate, cosine boundary score, and full provenance metadata.



Prerequisites

  • An ODC Portal account with permission to upload External Libraries.
  • An OpenAI-compatible embedding endpoint and API key. The library works with OpenAI (https://api.openai.com/v1/embeddings), Azure OpenAI, or any compatible third-party provider.
  • The ChunkingLibrary Forge component, or any text splitter that returns a list of strings, to produce the text units this library groups.


Installation

Option 1: Install from the OutSystems Forge (recommended)

  1. In ODC Portal, go to Forge and search for SemanticChunkingLibrary.
  2. Click Install and select the target stage.
  3. Wait for the installation to complete. The library will appear as SemanticChunking in your External Libraries list.

Option 2: Install from GitHub Releases

  1. Go to the Releases page of the SemanticChunkingLibrary GitHub repository.
  2. Download SemanticChunkingLibrary.zip from the latest release.
  3. In ODC Portal, navigate to External Libraries.
  4. Click Upload and select the downloaded ZIP file.
  5. Wait for the upload to complete. The library will appear as SemanticChunking in your External Libraries list.


Configuration


Store your credentials using Settings

Do not hardcode your embedding API key or endpoint in ODC logic. Use ODC Settings to manage these values per stage.

  1. In ODC Studio, open your consuming app.
  2. Go to Data > Settingsand create the following Settings with type Text:
    • EmbeddingApiKey
    • EmbeddingApiEndpoint
    • EmbeddingModel
  3. Publish the app.
  4. In ODC Portal, go to Apps > [Your App] > Configuration.
  5. Set the values for each Setting per stage (Development, QA, Production).
  6. In your ODC logic, pass the Settings directly to GroupBySemantic:
    • ApiKey <- Setting.EmbeddingApiKey
    • ApiEndpoint <- Setting.EmbeddingApiEndpoint
    • EmbeddingModel <- Setting.EmbeddingModel

Add the library to your ODC app

  1. In ODC Studio, open your consuming app.
  2. Open Manage Dependencies (Ctrl+Q).
  3. Search for SemanticChunking.
  4. Select the GroupBySemantic action and click Apply.


Basic Usage


Step 1: Split your document into text units

Call ChunkingLibrary\SplitBySentence with your document text. Capture SplitBySentence.Chunks[].Text as the input list for the next step.

Step 2: Apply input defaults

Add an Assign node before GroupBySemantic to set safe defaults when parameters are left at their ODC initial values:

If ChunkingPattern = "" Then ChunkingPattern = "Cumulative"
If ThresholdPct = 0 Then ThresholdPct = 80
If SensitivityK = 0 Then SensitivityK = 1.5
If WindowSize = 0 Then WindowSize = 1
If MaxChunkSize = 0 Then MaxChunkSize = 1500
If OverlapSentences < 0 Then OverlapSentences = 1

Step 3: Call GroupBySemantic

Set the following input parameters on the action node:

  • TextUnits <- SplitBySentence.Chunks[].Text
  • DocumentId <- your document identifier, e.g. POLICY-2024-001
  • ApiEndpoint <- Setting.EmbeddingApiEndpoint
  • ApiKey <- Setting.EmbeddingApiKey
  • EmbeddingModel <- Setting.EmbeddingModel
  • ChunkingPattern <- Cumulative (default)
  • ThresholdPct <- 80 (default)
  • SensitivityK <- 1.5 (default)
  • WindowSize <- 1 (default)
  • MaxChunkSize <- 1500 (default)
  • OverlapSentences <- 1 (default)

Step 4: Check the result

Add an If node after the action call:

  • Condition: GroupBySemantic.Chunks.Length > 0
  • True branch: proceed to vector upsert
  • False branch: handle the empty result by logging, skipping, or raising an exception

Step 5: Upsert chunks into your vector store

Iterate GroupBySemantic.Chunks and store the following fields for each chunk:

  • Chunk.Text: the chunk text content
  • Chunk.Metadata.ChunkId: unique identifier, e.g. POLICY-2024-001-0001
  • Chunk.Metadata.Sha256: SHA-256 fingerprint for deduplication
  • Chunk.Metadata.TokenEstimate: estimated token count for cost planning
  • Chunk.Metadata.EmbeddingReady: always true, safe to embed immediately



Choosing a Chunking Pattern

Cumulative is the recommended default for most plain text documents. Use it for PDFs, policy documents, knowledge base articles, and any prose content where topics shift gradually. Start with ThresholdPct=80 and WindowSize=1. If chunks are too large, lower ThresholdPct to 75. If chunks are too fragmented, raise it to 85.

Consecutive suits structured documents with clear section breaks such as FAQs, technical specifications, and numbered procedure documents. Use ThresholdPct=80 and WindowSize=1 as the starting point. Raise WindowSize to 2 or 3 if the content has multi-sentence topic transitions that produce false splits.


Statistical is best when the document has uneven similarity distribution across sections. For example, a report that mixes dense technical content with short executive summaries. It adapts the split threshold to the document's own distance profile rather than a fixed percentile. Start with SensitivityK=1.5. Lower to 1.0 for more splits, raise to 3.0 to split only at extreme topic changes.



Parameter Reference

Required parameters


TextUnits (List of Text)
Pre-split strings to group. Accepts sentences, paragraphs, or any upstream splitter output. Must not be empty.

DocumentId (Text)
Stamped on every ChunkId in the output. Use a stable identifier for the source document.

ApiEndpoint (Text)
OpenAI-compatible /v1/embeddings URL. Set via Setting.EmbeddingApiEndpoint.

ApiKey (Text)
Bearer token for the embedding endpoint. Set via Setting.EmbeddingApiKey.

EmbeddingModel (Text)
Model name passed to the endpoint. Set via Setting.EmbeddingModel. Common values are text-embedding-3-small and text-embedding-3-large.

Optional parameters

ChunkingPattern (Text, default: Cumulative)
Selects the grouping algorithm. Accepted values are Consecutive, Cumulative, and Statistical. Case-insensitive. Blank defaults to Cumulative.

ThresholdPct (Integer, default: 80, range: 1 to 99)
Percentile cutoff for distance boundary detection. Used by Consecutive and Cumulative. Silently ignored by Statistical. Lower values produce more splits. Higher values produce fewer, larger chunks.

SensitivityK (Decimal, default: 1.5, range: greater than 0)
Standard deviation multiplier for the Statistical threshold. Used by Statistical only. Silently ignored by Consecutive and Cumulative. A value of 1.0 is aggressive, 1.5 is balanced, and 3.0 is conservative.

WindowSize (Integer, default: 1, range: 1 to 5)
For Consecutive: the number of text units averaged on each side of a comparison gap. For Cumulative: the minimum number of seed units accumulated before split evaluation begins. Silently ignored by Statistical.

MaxChunkSize (Integer, default: 1500, range: greater than 0)
Maximum character length per assembled chunk. This guard fires independently of the distance threshold. A text unit that exceeds this value on its own is emitted as a single chunk.

OverlapSentences (Integer, default: 1, range: 0 to unit count minus 1)
Number of text units carried from the end of each chunk into the start of the next. Increases context continuity at chunk boundaries. Used by all patterns.



Output Reference

GroupBySemantic returns a SemanticChunkingResponse with two parts: Chunks and Stats.


Chunks


Each item in the Chunks list contains the following fields:

Text (Text)
Joined text of all grouped units in this chunk.

Metadata.ChunkId (Text)
Unique chunk identifier in the format {DocumentId}-{N:D4}, e.g. POLICY-2024-001-0001.

Metadata.DocumentId (Text)
Echo of the caller-supplied DocumentId.

Metadata.Sha256 (Text)
SHA-256 fingerprint of the chunk text, prefixed with sha256-. Use for deduplication before upsert.

Metadata.TokenEstimate (Integer)
Approximate token count calculated as character length divided by 4.

Metadata.Strategy (Text)
Always Semantic.

Metadata.ChunkingPattern (Text)
The pattern used to produce this chunk: Consecutive, Cumulative, or Statistical.

Metadata.UnitCount (Integer)
Number of source text units grouped into this chunk.

Metadata.BoundaryScore (Decimal)
Cosine distance at the split boundary that closed this chunk. Returns 0 for the final chunk, which is closed by document end rather than a distance anomaly.

Metadata.WindowSize (Integer)
Echo of the WindowSize parameter used. Returns 0 when the pattern is Statistical.

Metadata.EmbeddingReady (Boolean)
Always true. Safe to pass directly to an embedding or vector upsert action.


Stats


The Stats object summarises the grouping run:

TotalChunks (Integer): total number of chunks produced.

TotalUnits (Integer): total number of input text units processed.

AverageUnitsPerChunk (Decimal): average number of units per chunk.

AverageBoundaryScore (Decimal): average cosine distance across all split boundaries.

ChunkingPattern (Text): the pattern used.

WindowSize (Integer): echo of WindowSize. Returns 0 when Statistical.

ThresholdPct (Integer): echo of ThresholdPct. Returns 0 when Statistical.

SensitivityK (Decimal): echo of SensitivityK. Returns 0 when Consecutive or Cumulative.