28
Views
11
Comments
In ODC server side we want to extract any type of files information in text format
Discussion

Hi Team

we need your suggestions we want to extract all types of files information in text format also the file size have no limit  and file format will be any type .How we can archive it servers side in ODC ?

although we have one forge component TEXTractor but it have limitations of file size upto 5.5MB also not working in scan pdf.

2024-09-14 05-42-00
Ozan Can Çalı
Champion

Hi Rahul,

I don't think you can easily find a "do-it-all" solution for this, neither in OutSystems nor in any other platform. There are usually different libraries that are specialized for parsing different kind of files; e.g. there is the Excel Library or PDF2Text.

Working with web technologies also brings its own limitations such as the timeout thresholds and data size limits that you have to take into consideration.

Such requirements usually require some limitations at the user side; e.g. you should limit the user about what kind of file they can upload or how big the file can be. This is also a best practice in terms of security and performance.

Using AI agents might be a solution for you, but it can be too expensive and overengineered for what you want to achieve.

2024-06-23 06-21-39
Rahul sisodiya

Thanks Ozan , 

The AI agent not able to do it for this we need such functionality

2020-09-15 13-07-23
Kilian Hekhuis
 
MVP

Hi Rahul,

There are literally tens of thousands of file formats that contain text - you really need to set a realistic limit. Your users very likely do not want to extract text from a WordPerfect DOS 5 file or text inside an 3D Studio Max file. I'd say you should limit the extraction to a number of predefined types, like PDF, DOCX or ODF. As for extracting text from (embedded) images, that's a different ballgame alltogether.

The tl;dr is: there is no easy one-size-fits-all solution. You need to find libraries that can do this, per file type.

2024-06-23 06-21-39
Rahul sisodiya

Thanks Kilian file limit would be 50MB and file format would be PDF, Word, Excel, eml, msg, text, ppt. Rar file and as we know currently agent accept  PDF not scan PDF and text . 

UserImage.jpg
Bruno Gonçalves

Hi Rahul,

I'm the creator of TEXTractor, and I just want to share that while you were making this post I was releasing a new version of the component that can deal with input files larger than 5.5MB.

Alternatively to passing the file binary, you can now pass an API endpoint from which the input file can be fetched by the external logic. I have also changed the component demo (TEXTractor Demo) to illustrate how to use this new feature.

Coincidently happens that I've recently came across a use case of text extraction from scanned PDF documents (complex insurance forms), and it might be helpful to share my conclusions:

I've tested OpenAI for extracting the document text and it proved to be unreliable. Also tested Tesseract OCR and it proved to have trouble in dealing with the complexity of the documents I was dealing with.

Next to this, I've also tested a bunch of online OCR services, and while most proved not to be good enough, there was one that proved to be quite good: the OCR PDF API part of the Adobe PDF Services API (https://developer.adobe.com/document-services/docs/overview/pdf-services-api/howtos/ocr-pdf/). This API can be used to convert a scanned/flattened PDF into a regular one, that then you could pass through TEXTractor (or any other component/library/service of your preference) to get the actual text.

The Adobe PDF Services API requires a subscription, but its free tier includes up to 500 calls per month.

Hope this information helps!



2024-06-23 06-21-39
Rahul sisodiya

Thanks Bruno for this wonderful update we are using your forge component to extract text information from files .yes you are right I have also try to many OCR API for scan PDF I will also try Adobe PDF API. I have have a few questions regarding TEXTractor Now how much MB files it will support? 

UserImage.jpg
Bruno Gonçalves

It's very challenging to come up with a size limit figure.

I believe that the most relevant limiting factor is the external logic execution timeout (95 seconds), and different file types (and even instances within those types) have different processing complexities.

Just to illustrate: I have just tested the text extraction from two distinct pdf files, one with 50MB and the other with 100MB, and I got a timeout for the 50MB one while it succeeded for the 100MB one.

2024-06-23 06-21-39
Rahul sisodiya

@Bruno Gonçalves  still i am getting this error while extracting text from .msg files and file size 5.51MB while testing file in the demo app also tested one standard pdf file of 3.04 MB so i am getting only page number it is not scan pdf any more .for your reference i have attached screenshot 

Screenshot (205).png
Screenshot (204).png
UserImage.jpg
Bruno Gonçalves

I get the impression that content of the PDF pages might be images, but i can be wrong of course. Can you share the file with me using the email address you can find under the TEXTractor component details in ODC Forge?

The .MSG error is a trickier one: OutSystems has been making changes on the external logic DLL loading mechanism. I know this because I have been in touch with OutSystems support as I was experiencing issues. I believe it is related with that since MSG files processing was working fine a couple of days ago and I have made no changes.

I will wait a day or two as OutSystems might fix this, and if not I will report it. In the meantime you can always use the O11 version demo if you want just to test the functionality: https://brunogoncalves.outsystemscloud.com/TEXTractorDemo/

2024-06-23 06-21-39
Rahul sisodiya

@Bruno Gonçalves I have shared files with you over your support email. 

UserImage.jpg
Bruno Gonçalves

@Rahul sisodiya, I can't seem to find your files in my mailbox.

In the meantime, I managed to fix the issue your were having with processing ".msg" files. Since it stopped working from one moment to the other, I believe it had to do with changes OutSystems made to the external logic runtime that affected its available encoding types.

I have just released a new version of TEXTractor (1.7.0) that includes this fix. Please give it a try and let me know if it is working as expected. 

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.