Login to follow
TEXTractor

TEXTractor (ODC)

Stable version 2.4.0 (Compatible with ODC)
Uploaded on 5 Apr (6 days ago) by Bruno Gonçalves
TEXTractor

TEXTractor (ODC)

Details
Detailed Description

TEXTractor provides the functionality to extract text and/or metadata from 76 file types (PDF, Office, Images, Email, and more).

Please find the full list of supported file types here.

Built using a modified version of the Toxy library (https://github.com/bmlpg/toxy).


Try now: link

(or install "TEXTractor Demo" from Forge)

Release notes 

Document Structured Extraction Enhancements:

  • New Table Support: Introduced a dedicated Table element type alongside the existing Paragraph type. This allows for structured tabular data retrieval from DOC and DOCX files.
  • Human-Readable Styles: Improved DOC extraction to return actual Paragraph Style Names (e.g., "Heading 1") instead of internal style indexes.
  • Page Referencing: Added a PageNumber attribute to all document elements extracted from PDFs, enabling easier navigation and source tracking.
  • Enhanced PDF Segmentation: Upgraded the PDF page segmentation algorithm from "Recursive XY Cut" to "Docstrum". This change significantly improves the reliability of element detection, especially in complex layouts with tight margins or overlapping content.