Login to follow
PdfContentChunker

PdfContentChunker (ODC)

Stable version 0.1.0 (Compatible with ODC)
Uploaded on 03 October 2025 by OutSystems Labs
PdfContentChunker

PdfContentChunker (ODC)

Details
Detailed Description

High‑precision PDF text extraction and intelligent chunking utility for building Retrieval-Augmented Generation (RAG), semantic search, summarization, and embedding pipelines in OutSystems Developer Cloud.

Overview

Transforms raw PDF binaries into normalized UTF‑8 text and structured, overlap-aware chunks with page mapping, hash fingerprinting, and lightweight token estimation. Eliminates boilerplate parsing and prepares content for vector storage or AI enrichment.

Parameter Guidance

  • chunkSizeChars: 800–1500 (balance context vs. token cost)
  • overlapSizeChars: 10–20% of chunk size
  • normalizeWhitespace: True for embedding scenarios
  • collapseRepeatedNewlines: Reduces sparsity
  • includePageNumberPrefix: Enable if cross-referencing back to the source is needed
  • maxTotalChars: Set >0 for guardrails on very large PDFs




Limitations

## Limits & Safeguards


- PDF size max: 25 MB

- Extraction hard cap: 2,000,000 chars

- Output soft cap: ~5.5 MB aggregated chunk text (throws if exceeded)

- overlapSizeChars must be < chunkSizeChars

- Large attachment artifacts (>1 MB) skipped