Login to follow
TEXTractor

TEXTractor (ODC)

Stable version 2.4.0 (Compatible with ODC)
Uploaded on 5 Apr (6 days ago) by Bruno Gonçalves
TEXTractor

TEXTractor (ODC)

Documentation
2.4.0

Available Actions

  • GetText - Get file content in plain text.
  • GetMetadata - Get file metadata in a structured format.
  • GetDocument - Get document content in a structured format.
  • GetDom - Get dom content in a structured format.
  • GetEmail - Get email content in a structured format.
  • GetSlideshow - Get slideshow content in a structured format.
  • GetSpreadsheet - Get spreadsheet content in a structured format.
  • GetVCard - Get vcard content in a structured format.


Alternative action sets with suffixes "_FromREST" and "_WithREST" are also available, to enable input file content retrieval and result posting via REST APIs. These actions are meant to be used in situations where the external library input/output 5.5MB payload limit needs to be avoided.



OCR Capabilities (Tesseract 5)

TEXTractor supports text extraction from scanned PDFs and from the following image formats: bmp, gif, jpeg, pbm, png, tiff, webp.


English is the default language, but you can choose to use any of the tesseract supported languages (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).


TEXTractor installation includes English, Spanish, Portuguese and German trained data, but any other Tesseract supported language can be used by including the corresponding trained data file as an ODC App resource, and by passing its URL to TEXTractor. The expected trained data files can be found at the github tessdata_fast repository (https://github.com/tesseract-ocr/tessdata_fast).


When adding .traineddata files as resources, you must set the Deploy Action to "Deploy to Target Directory", and rename the file to include a .bin extension (e.g., fra.traineddata.bin). This bypasses OutSystems platform restrictions on non-whitelisted file extensions and ensures TEXTractor can fetch the resource.



Security & Privacy

All processing is performed entirely in-memory within the server context. No data is persisted, and no data ever leaves your environment.