High‑precision PDF text extraction and intelligent chunking utility for building Retrieval-Augmented Generation (RAG), semantic search, summarization, and embedding pipelines in OutSystems Developer Cloud.
Overview
Transforms raw PDF binaries into normalized UTF‑8 text and structured, overlap-aware chunks with page mapping, hash fingerprinting, and lightweight token estimation. Eliminates boilerplate parsing and prepares content for vector storage or AI enrichment.
Parameter Guidance
## Limits & Safeguards
- PDF size max: 25 MB
- Extraction hard cap: 2,000,000 chars
- Output soft cap: ~5.5 MB aggregated chunk text (throws if exceeded)
- overlapSizeChars must be < chunkSizeChars
- Large attachment artifacts (>1 MB) skipped
BSD-3 license (https://opensource.org/licenses/BSD-3-Clause)