Stable version 0.1.0 (Compatible with ODC)

Uploaded on 03 October 2025 by OutSystems Labs

Details

Detailed Description

High‑precision PDF text extraction and intelligent chunking utility for building Retrieval-Augmented Generation (RAG), semantic search, summarization, and embedding pipelines in OutSystems Developer Cloud.

Overview

Transforms raw PDF binaries into normalized UTF‑8 text and structured, overlap-aware chunks with page mapping, hash fingerprinting, and lightweight token estimation. Eliminates boilerplate parsing and prepares content for vector storage or AI enrichment.

Parameter Guidance

chunkSizeChars: 800–1500 (balance context vs. token cost)
overlapSizeChars: 10–20% of chunk size
normalizeWhitespace: True for embedding scenarios
collapseRepeatedNewlines: Reduces sparsity
includePageNumberPrefix: Enable if cross-referencing back to the source is needed
maxTotalChars: Set >0 for guardrails on very large PDFs

Limitations

## Limits & Safeguards

- PDF size max: 25 MB

- Extraction hard cap: 2,000,000 chars

- Output soft cap: ~5.5 MB aggregated chunk text (throws if exceeded)

- overlapSizeChars must be < chunkSizeChars

- Large attachment artifacts (>1 MB) skipped

License

BSD-3 license (https://opensource.org/licenses/BSD-3-Clause)

PdfContentChunker (ODC)

PdfContentChunker (ODC)