636
Views
6
Comments
Extract data from pdf
Question

Hi ,

I have a scenario, I need to extract particular field data from pdf file and stored that data into database entity , but pdf file data have their own naming convention like name and period and my database entity has it own naming convention like extracted data of name field from pdf need to be store in the master name field into the database.                                  so how can I map this data.

2018-10-29 08-31-03
João Marques
 
MVP

Hi Shivani,


You basically need to perform OCR (Optical Character Recognition) on your document.

There are a few components on the Forge that do this:

 

In some, an image is needed but you can also convert PDF to images, using other components.


Kind Regards,
João

2024-03-28 06-35-29
Shivani Rajoriya

Hi Joao,

with the help of these components ,I can only read the data from pdf, but I need to extract the particular field and then I have to store the data of that field into the database.                                                                                                                 

Regards,                                                                                                                                                                                                   Shivani

2018-10-29 08-31-03
João Marques
 
MVP

Hi Shivani,


That's the part you have to build. A parser to extract the particular field from the extracted text and save it in the database.


Kind Regards,

João

2021-10-09 07-57-44
Stefan Weber
 
MVP

Hi,

actually not a real answer your question. You may want to take a look at my pdf annotations component PDF Annotations - Overview | OutSystems. Iam using an external pdf component iText to extract pdf annotations. iText can also extract pdf fields and basically all other stuff. You can also create and modify pdfs with iText.

Note that iText has an AGPL license.

Best

Stefan

2024-03-28 06-35-29
Shivani Rajoriya

Hi Stefan,

I have one pdf which has data and one excel file with some naming convention of pdf data like it has column named name and in pdf "name = ABC " so what i want to do is mapping like My entity has attribute named "mastername" , so the data of field name in pdf fetch and store in the mastername attribute into database.


2021-10-09 07-57-44
Stefan Weber
 
MVP

The data in your PDF. Are that PDF Fields or just written text.  Then you either have to extract the plain text from PDF, if the PDF already contains a text layer or do an OCR first as recommended by Joao. Having plain text and retrieving data from text can be really challenging. Especially when there are OCR misinterpretations, or locations of data elements change between documents. If you have a larger use case in processing PDF documents you might be better off with a professional solution like Rossum AI for data extraction. (www.rossum.ai)

Community GuidelinesBe kind and respectful, give credit to the original source of content, and search for duplicates before posting.