OmniDoc2MD – Multi‑Format Document → Markdown & Metadata Converter
Convert PDF, DOCX (with headings H1–H6, lists, bold/italic, hyperlinks), PPTX (slide text), XLSX (sheet → Markdown tables with cell caps), HTML (cleaned → Markdown) and plain text into unified, semantically structured Markdown plus rich metadata (pages, slides, sheets, tables, headings, list items, word/char counts, version, timing, truncation flags).
Key Features
- Style‑aware DOCX rendering (headings, emphasis, links, ordered lists with indentation)
- Table extraction (DOCX/XLSX) with safety cell limits
- Oversize policy (Trim or Fail) + soft char limit
- Deterministic output & embedded version resource
- Optional logging + zipped diagnostic artifacts (metrics + samples)
- Extract Images from DOCX and PPTX
Ideal For
Ingestion, indexing, RAG pipelines, summarization, pre‑embedding normalization.
Empowers consistent multi‑source content preparation alongside PDF chunking components.
adding support of converting document from a URL
BSD-3 license (https://opensource.org/licenses/BSD-3-Clause)