In ODC server side we want to extract any type of files information in text format

Rahul sisodiya

Discussion

Hi Team

we need your suggestions we want to extract all types of files information in text format also the file size have no limit and file format will be any type .How we can archive it servers side in ODC ?

although we have one forge component TEXTractor but it have limitations of file size upto 5.5MB also not working in scan pdf.

26 Dec 2025

Ozan Can Çali

Champion

Hi Rahul,

I don't think you can easily find a "do-it-all" solution for this, neither in OutSystems nor in any other platform. There are usually different libraries that are specialized for parsing different kind of files; e.g. there is the Excel Library or PDF2Text.

Working with web technologies also brings its own limitations such as the timeout thresholds and data size limits that you have to take into consideration.

Such requirements usually require some limitations at the user side; e.g. you should limit the user about what kind of file they can upload or how big the file can be. This is also a best practice in terms of security and performance.

Using AI agents might be a solution for you, but it can be too expensive and overengineered for what you want to achieve.

26 Dec 2025

Rahul sisodiya

Thanks Ozan ,

The AI agent not able to do it for this we need such functionality

26 Dec 2025

Kilian Hekhuis

MVP

Hi Rahul,

There are literally tens of thousands of file formats that contain text - you really need to set a realistic limit. Your users very likely do not want to extract text from a WordPerfect DOS 5 file or text inside an 3D Studio Max file. I'd say you should limit the extraction to a number of predefined types, like PDF, DOCX or ODF. As for extracting text from (embedded) images, that's a different ballgame alltogether.

The tl;dr is: there is no easy one-size-fits-all solution. You need to find libraries that can do this, per file type.

27 Dec 2025

Rahul sisodiya

Thanks Kilian file limit would be 50MB and file format would be PDF, Word, Excel, eml, msg, text, ppt. Rar file and as we know currently agent accept PDF not scan PDF and text .

28 Dec 2025

Bruno Gonçalves

Hi Rahul,

I'm the creator of TEXTractor, and I just want to share that while you were making this post I was releasing a new version of the component that can deal with input files larger than 5.5MB.

Alternatively to passing the file binary, you can now pass an API endpoint from which the input file can be fetched by the external logic. I have also changed the component demo (TEXTractor Demo) to illustrate how to use this new feature.

Coincidently happens that I've recently came across a use case of text extraction from scanned PDF documents (complex insurance forms), and it might be helpful to share my conclusions:

I've tested OpenAI for extracting the document text and it proved to be unreliable. Also tested Tesseract OCR and it proved to have trouble in dealing with the complexity of the documents I was dealing with.

Next to this, I've also tested a bunch of online OCR services, and while most proved not to be good enough, there was one that proved to be quite good: the OCR PDF API part of the Adobe PDF Services API (https://developer.adobe.com/document-services/docs/overview/pdf-services-api/howtos/ocr-pdf/). This API can be used to convert a scanned/flattened PDF into a regular one, that then you could pass through TEXTractor (or any other component/library/service of your preference) to get the actual text.

The Adobe PDF Services API requires a subscription, but its free tier includes up to 500 calls per month.

Hope this information helps!

30 Dec 2025

2 replies

Last reply 30 Dec 2025

Show thread

Hide thread

Rahul sisodiya

Thanks Bruno for this wonderful update we are using your forge component to extract text information from files .yes you are right I have also try to many OCR API for scan PDF I will also try Adobe PDF API. I have have a few questions regarding TEXTractor Now how much MB files it will support?

30 Dec 2025

Bruno Gonçalves

Replying to Rahul sisodiya's comment on 30 Dec 2025 11:03:36

It's very challenging to come up with a size limit figure.

I believe that the most relevant limiting factor is the external logic execution timeout (95 seconds), and different file types (and even instances within those types) have different processing complexities.

Just to illustrate: I have just tested the text extraction from two distinct pdf files, one with 50MB and the other with 100MB, and I got a timeout for the 50MB one while it succeeded for the 100MB one.

30 Dec 2025

Rahul sisodiya

@Bruno Gonçalves still i am getting this error while extracting text from .msg files and file size 5.51MB while testing file in the demo app also tested one standard pdf file of 3.04 MB so i am getting only page number it is not scan pdf any more .for your reference i have attached screenshot

Screenshot (205).png

Screenshot (204).png

30 Dec 2025

1 reply

30 Dec 2025

Show thread

Hide thread

Bruno Gonçalves

I get the impression that content of the PDF pages might be images, but i can be wrong of course. Can you share the file with me using the email address you can find under the TEXTractor component details in ODC Forge?

The .MSG error is a trickier one: OutSystems has been making changes on the external logic DLL loading mechanism. I know this because I have been in touch with OutSystems support as I was experiencing issues. I believe it is related with that since MSG files processing was working fine a couple of days ago and I have made no changes.

I will wait a day or two as OutSystems might fix this, and if not I will report it. In the meantime you can always use the O11 version demo if you want just to test the functionality: https://brunogoncalves.outsystemscloud.com/TEXTractorDemo/

30 Dec 2025

Rahul sisodiya

@Bruno Gonçalves I have shared files with you over your support email.

30 Dec 2025

Bruno Gonçalves

@Rahul sisodiya, I can't seem to find your files in my mailbox.

In the meantime, I managed to fix the issue your were having with processing ".msg" files. Since it stopped working from one moment to the other, I believe it had to do with changes OutSystems made to the external logic runtime that affected its available encoding types.

I have just released a new version of TEXTractor (1.7.0) that includes this fix. Please give it a try and let me know if it is working as expected.

10 Jan

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.

See the full guidelines