textractor
Service icon

TEXTractor

Stable version 2.4.0 (Compatible with OutSystems 11)
Uploaded
 on 5 Apr (2 days ago)
 by 
5.0
 (1 rating)
textractor

TEXTractor

Details
Extract text and/or metadata from 76 file types (PDF, Office, Images, Email, and more).
Read more

TEXTractor provides the functionality to extract text and/or metadata from 76 file types (PDF, Office, Images, Email, and more).

Please find the full list of supported file types here.

Built using a modified version of the Toxy library (https://github.com/bmlpg/toxy).

Release notes (2.4.0)

Document Structured Extraction Enhancements:

  • New Table Support: Introduced a dedicated Table element type alongside the existing Paragraph type. This allows for structured tabular data retrieval from DOC and DOCX files.
  • Human-Readable Styles: Improved DOC extraction to return actual Paragraph Style Names (e.g., "Heading 1") instead of internal style indexes.
  • Page Referencing: Added a PageNumber attribute to all document elements extracted from PDFs, enabling easier navigation and source tracking.
  • Enhanced PDF Segmentation: Upgraded the PDF page segmentation algorithm from "Recursive XY Cut" to "Docstrum". This change significantly improves the reliability of element detection, especially in complex layouts with tight margins or overlapping content.
License (2.4.0)
Reviews (1)
by 
2025-11-03
in version 1.0.0
Useful tool for getting metadata for a file, making the job easy. This came just in time.
Team
Other assets in this category