1. Overview
OmniDoc2MD converts multiple document formats (PDF, DOCX, PPTX, XLSX, HTML, TXT) into a unified, semantically structured Markdown string plus rich metadata. It standardizes ingestion for AI/RAG, indexing, summarization, compliance archiving, and knowledge base preparation.
2. Key Features
- DOCX style → Markdown: Headings H1–H6, bold/italic, hyperlinks, ordered lists (indented), tables (cell-capped)
- PPTX: Slide text aggregation with slide indexing
- XLSX: Sheet → Markdown table (row/column traversal with cell cap)
- PDF: Plain text page extraction (PdfPig path; simple line grouping)
- HTML: ReverseMarkdown (GitHub-flavored) cleanup
- Plain text passthrough + unknown binary hex preview
3. Architecture Summary
1. Detect format by extension
2. Dispatch extractor (PdfPig, OpenXML, ClosedXML, ReverseMarkdown, direct read)
3. Build Markdown with semantic enrichers (DOCX pipeline)
4. Apply table/cell caps (safety)
5. Enforce oversize policy (Trim|Fail)
6. Collect metrics + optional artifacts
7. Return Markdown + metadata + optional logs zip
4. Public Actions
`ConvertDocumentToMarkdown`\
Inputs: fileName (Text), fileBinary (Binary), includeImagesReferences (Boolean; placeholder not yet implemented), maxTableCells (Integer, 0=unlimited), oversizePolicy (Text: Trim|Fail|blank=Trim), oversizeSoftLimitChars (Integer, 0=unbounded), collectLogs (Boolean), attachArtifacts (Boolean).\
Outputs: Markdown (Text), metadata (Structure), logsZipFile (Binary).
`DetectDocumentMetadataOnly`\
Same inputs subset (no Markdown body generation returned) → metadata + optional logs.
5. Usage Flow (Typical)
1. Upload binary from OutSystems file upload or external source
2. Call `ConvertDocumentToMarkdown` with suitable `maxTableCells` (e.g., 5000) & soft limit (e.g., 120000)
3. Store Markdown & metadata in a data entity / external store
4. Optional: Apply downstream chunking or embedding generation
6. Suggested Default Parameter Set