I need help with a PDF file that contains text and tables. How can I read the data from the tables and extract the heading text
Dear,
Please try this componenthttps://www.outsystems.com/forge/component-documentation/1819/pdf-helper-o11/0
Action name ReadTextFromPDF()Based on the table position you can extract the header valuehttps://miguel-antunes.outsystemscloud.com/PDFHelperDemoApp/ReadTextFromPDF.aspx?(Not.Licensed.For.Production)=
Hi @Vignesh Sekar
thanks for your reply, I tried that forge component it should only give the text format for the pdf file, But i want to read and get the data from pdf inside tables and text
Does the PDF have any standard formats?
If yes, we can use this component because it will extract each word. For example, if your table was on the 3rd line, you can read the 3rd line of the extracted text and do a small workaround after extracting the word.(I tried with this attached sample pdf i can able to extract and read data by position)
If not, we can't use this component.
this is the pdf format i want to get the headings of Red and Yellow color and remaining table
dear
can you share the pdf (change some dummy value and share if its confidential)
Hi ,Can you please take a look on this discussion https://www.outsystems.com/forums/discussion/75875/extract-data-from-pdf/
Hi @SreenivasuluReddy Lingala
You can build a small Integration Studio extension using iText7 to read table data from a PDF.
This would support table detection, can read text inside table cells.
This works on server and no 3rd-party API calls so would be secure.
Hi,
In practice this depends on how the PDF is generated.
If the PDF has selectable text (not scanned), you can extract text and tables using a PDF parser (e.g. PDFBox / iText) and then identify headings based on layout information such as font size, position, or style.
If the PDF is scanned or you need to detect headings by colour (red/yellow), you’ll need an OCR / Document AI approach. Services like AWS Textract, Azure Form Recognizer or Google Document AI can extract tables and structured text. If colour is a requirement, an additional image-processing step is needed to detect coloured regions before or after OCR.
From an OutSystems perspective, the usual approach is to:
Integrate one of these services via REST,
Receive structured JSON (headings + table rows/columns),
Map the result into OutSystems structures.
Pure OutSystems logic alone is usually not enough for reliable table and heading extraction from PDFs.
Hope this helps.