ChunkingLibrary - Documentation (ODC)

Stable version 0.1.10 (Compatible with ODC)

Uploaded on 13 Jun (3 weeks ago) by Michael Guzman

Documentation

0.1.10

When building enterprise Retrieval-Augmented Generation (RAG) pipelines, the quality of your LLM responses depends heavily on how your input text is sliced. Standard character-blind text splitters break words, sever context mid-sentence, and ruin the formatting of structured content.

The ChunkingLibrary brings robust, professional data-engineering principles directly into your low-code workflow. Running natively as an optimized, stateless C# External Logic container within your ODC environment, it provides four distinct text-splitting strategies via ready-to-use Service Actions. Each action outputs a consistent, standardized contract consisting of the parsed text chunk, robust structural metadata, and precise summary metrics.

Why use this component?

The ChunkingLibrary is built specifically for documents that combine standard prose with code blocks, tables, and hierarchical headings, providing the perfect downstream parsing companion for document to Markdown converters.

Advanced Structure Awareness: It intelligently respects paragraph and sentence boundaries while splitting text, ensuring words and sentences remain whole and contextually complete.
Syntax and Data Layout Protection: It keeps programming code blocks and complex tabular data grids atomically unified across chunks, fully protecting syntax formatting and structural schema integrity.
Heading and Section Context Inheritance: It automatically appends parental section heading breadcrumbs directly onto localized child text fragments so isolated chunks never lose their document placement context.

Project Ecosystem & Resources

Official GitHub Repository: https://github.com/michaeldeguzman/odc-chunking-library-forge/

EXPOSED SERVICE ACTIONS & PARAMETER REFERENCE

All four exposed chunking strategies yield a consistent, highly structured response contract.

Unified Input Parameter Guidelines

text (Text, Mandatory): The raw content payload string to be sliced.
chunkSize (Integer, Optional - Default: 1000): Must be 1 or greater. Target maximum character boundary for an individual segment envelope.
overlapSize (Integer, Optional - Default: 200): Must be 0 or greater and strictly less than chunkSize. Sliding window context overlap length.
maxTotalChars (Integer, Mandatory - Default: 200000): Safety memory budget block constraint. Must be at least the total length of your input text. Raise this threshold for large enterprise documentation.
documentId (Text, Mandatory): Unique parent document tracking prefix utilized to build well-formed chunk IDs. Leaving this blank will yield fallback unassigned tracking indices.

Common Output Structures

Structure: Chunk

ChunkId (Text): Unique compound identifier combining the parent document ID and a sequential zero-padded index (e.g., DOC-001-0003).
Text (Text): The actual text content belonging to the isolated chunk.
ParentHeadings (Text List): A structured array tracking the parental document headers leading down to this chunk (Markdown strategy only).
StartCharIndex (Integer): The absolute zero-based character index where this chunk starts in the original source string.
EndCharIndex (Integer): The absolute zero-based character index where this chunk ends in the original source string.
TokenCountEstimate (Integer): Estimated token budget envelope calculated natively using characters divided by 4.

Structure: Summary

StrategyUsed (Text): Logs the explicit name of the splitting action invoked.
TotalChunks (Integer): The total count of fragments produced by the splitting sequence.
TotalTokensEstimate (Integer): Aggregated token budget estimation for the entire document.

1. SplitByCharacterSlices text strings strictly by absolute character count boundaries. Best for raw, continuous data streams like machine data logs, system transcripts, or flat cryptographic files.

2. SplitRecursivelySplits strings dynamically by cascading through a prioritized array of standard prose structural separators ("\n\n", "\n", " ", ""). It groups paragraphs and sentences together, avoiding splitting words mid-index.

3. SplitBySentenceEvaluates terminal punctuation boundaries (. , ! , ?) to keep individual chunks mapped directly to complete, well-formed semantic sentences. Highly precise for short, fact-driven items like FAQs, itemized dictionaries, or short definition lists.

4. SplitMarkdownAn advanced, syntax-aware structural layout parser. It actively parses your document layouts to keep highly complex, related technical segments indivisible. It isolates headers, encapsulates code fences, maintains markdown grid rows, and dynamically embeds ancestor section headers directly into child elements.

COMPONENT LIMITATIONS

Approximate Markdown Indices: The SplitMarkdown action prioritizes structural hierarchy over exact string calculations. It defaults to reporting StartCharIndex = 0. Character offsets across structural splits are approximate; the chunk text content itself is always the absolute authority.
Index Mapping Collisions: For SplitRecursively and SplitBySentence, absolute character offsets are identified using localized substring searches. If identical text phrasing appears repeatedly inside the exact same text block, indices can map onto the earlier occurrence. (Note: The chunk text fragment content remains completely unaffected).
Heading Run-On Seams: During recursive character text splitting (SplitRecursively), if an overlap region snaps exactly to the trailing termination of one section heading while the next chunk opens immediately into a new header, both headings can appear on the same line, flattening the visual hierarchy at that seam. Use SplitMarkdown if preserving strict heading separation is required.
Token Budget Estimates Only: Token count variables are structured using a standard division-by-four calculation model (characters divided by 4). They are highly accurate for general database budgeting or LLM API estimation but should not be treated as a strict tokenizer constraint.

ChunkingLibrary (ODC)

ChunkingLibrary (ODC)