Hello Everyone,
I need to read data that is present in the format of a Table in PDF but whenever I convert it to Text the complete data of PDF gets mixed i.e. all rows and columns get mixed up(1st image). I need to read the data by rows or columns. Data in PDF looks something similar to this image(2nd image).
Thank you
Shubham Mishra
Hi Shubham,
May be below component could help you extract pdf data to text
https://www.outsystems.com/forge/component-overview/1819/pdf-helper
Thanks,
Siddhant
Thank You but I can convert PDF data to text(Image1 is converted data) but I need to get the data present in the table at once either row or column-wise.
try converting the data with pipe delimited values then pick the values accordingly based on |. you would be able to separate the data as required.
or could you pass me the pdf, may be I can help with conversion.
Sure, I have uploaded the PDF.
Hi Shubam,
if you are willing to create your own extension you might take a look at tabula-sharp BobLd/tabula-sharp: Extract tables from PDF files (port of tabula-java) (github.com). This one detects tables in a document and extracts rows and columns. I personally only did some small tests as i have access to professional document extraction solution.
Best
Stefan
Hello, I'm currently checking this tabula-sharp in github. Can you help on how I can use this in my extension? Thank you!