Login to follow
SemanticChunking

SemanticChunking (ODC)

Stable version 0.1.5 (Compatible with ODC)
Uploaded on 18 Jun (2 weeks ago) by Michael Guzman
SemanticChunking

SemanticChunking (ODC)

Details
Detailed Description

Semantic Chunking Library is an ODC External Logic library that groups pre-split text units into semantically coherent chunks using embedding-based cosine distance analysis. It sits right between the ChunkingLibrary SplitBySentence action and your vector store in a RAG pipeline.

The library exposes a single action called GroupBySemantic. This action accepts a flat list of sentences or paragraphs and returns them grouped into semantic chunks via a single batched call to any OpenAI-compatible /v1/embeddings endpoint.

You have three chunking patterns available to choose from:

  • Consecutive compares adjacent unit pairs, which is perfect for structured documents with clear section breaks.

  • Cumulative builds a running chunk vector, making it great for prose-heavy documents like PDFs and policy articles.

  • Statistical sets the split threshold at mean plus k multiplied by the standard deviation, automatically adapting to the document's own distance distribution.


Limitations
  • It requires an external OpenAI-compatible embedding endpoint and API key, as there is no embedded model or offline fallback included.

  • All text units are sent in a single HTTP batch, so very large inputs might exceed the token limits of your chosen model.

  • ODC enforces a hard execution limit of 95 seconds per action invocation, meaning large batches on slow endpoints could time out.

  • A full similarity matrix is excluded because it is simply incompatible with ODC's serverless execution budget.

  • There is no document-level size guard, so the upstream splitter needs to enforce character limits before calling GroupBySemantic.

  • OverlapSentences must be less than the total unit count, otherwise it throws an error.

  • Heading paths and character offsets are absent since positional metadata isn't recoverable after upstream splitting.

  • If a single text unit exceeds MaxChunkSize, it gets emitted as its own chunk because atomicity is preserved over size constraints.