Login to follow
OmniDoc2MD

OmniDoc2MD (ODC)

Stable version 0.1.4 (Compatible with ODC)
Uploaded on 8 May by OutSystems Labs
OmniDoc2MD

OmniDoc2MD (ODC)

Details
Detailed Description

OmniDoc2MD – Multi‑Format Document → Markdown & Metadata Converter


Convert PDF, DOCX (with headings H1–H6, lists, bold/italic, hyperlinks), PPTX (slide text), XLSX (sheet → Markdown tables with cell caps), HTML (cleaned → Markdown) and plain text into unified, semantically structured Markdown plus rich metadata (pages, slides, sheets, tables, headings, list items, word/char counts, version, timing, truncation flags).

Key Features


- Style‑aware DOCX rendering (headings, emphasis, links, ordered lists with indentation)

- Table extraction (DOCX/XLSX) with safety cell limits

- Oversize policy (Trim or Fail) + soft char limit

- Deterministic output & embedded version resource

- Optional logging + zipped diagnostic artifacts (metrics + samples)

- Extract Images from DOCX and PPTX

Ideal For

Ingestion, indexing, RAG pipelines, summarization, pre‑embedding normalization.


Empowers consistent multi‑source content preparation alongside PDF chunking components.

Limitations
  • No OCR: image‑only / scanned PDFs inside the multi‑format path yield no text.
  • Image Extraction is for PPTX and DOCX only
  • DOCX lists: bullet styles not distinguished yet (all numbering treated as ordered; nested bullet fidelity pending).
  • No language detection or locale metadata.
  • Unsupported formats: RTF, ODT, password/encrypted office documents.


Release notes 

adding support of converting document from a URL