[Simple OCR Sample] I need help about a bug I am experiencing this Forge Component (Simple OCR)

Forge Component
(1)
Published on 2018-11-30 by Takasi Moriya
1 vote
Published on 2018-11-30 by Takasi Moriya

I was tasked by my supervisor to study Forge Components that would generate text from images of forms. I was able to find one and I simply installed it on my Personal Environment since the Forge component doesn't seem to be compatible with our Development environment, which is currently in Outsystems 10.

I was able to successfully install the Forge Component and was able to make it run. I chose an image to generate text from and I was able to get a result. Now the problem is that the result seems to have a messy spacing and at times it generates text that is erroneous. To better understand what I mean, kindly refer to the images below.

I chose this image to generate the text from:




This is the text that was generated, I have marked the beginning and the end with dashed lines "----"

-----------------------------------------------------------------------------------------------------------------------------------------

COMPANY REGISTRATION FORM

1. Name of the company:
Limited
(insert name of company as reserved)

2. Type of company:
public Clprivate. “C1 Limited by guarantee 1 Unlimited (either public or private)
(select the type of company that applies)
3. The company—
Chas prepared its own articles of association; or
will adopt the model articles of association appropriate to the company; or
C will adopt some of those model articles and has prepared its own articles of association to supplement or modify

those model articles.
(select the option which applies)

“ifthe company has prepared its own articles or articles to supplement or modify the model articles, those articles
hhave been printed, dated and signed by the applicants and are attached to this application.

 

Target business start date:

Target accounting period end mont!

 

Number of employees at target business start date:

4. Physical address

 

 

 

Name of building/Plot No. Floor/Room No.
Street/Road Town’
District County

 

 

 

 

5. Contact address

 

P.O. Box Postal Code

 

Office No. Mobile No.

 

 

 

 

Email address™

 

 

 
----------------------------------------------------------------------------------------------------------------------------------------



As you may have noticed, the spacing is messed up and it doesn't seem to generate the correct spacing for texts that are displayed in a columnar way. In addition, it seems to generate erroneous texts. Like for example, the checkboxes, it generated a letter "C" from it.

I would like to know what suggestions you may have into addressing this. I am open if you can give an alternative Forge Component that I can use given that it is "Free". I am also open if you can give me an instruction on how to address it code-wise if there are no better alternatives in the Forge. I am having difficulty with this issue because the Server Actions that are being used on this app are "Extension" modules which I don't know if I can even tweak or edit it on the Integration Studio.

You can reproduce this issue by simply installing the Forge Components, shown on the image below, on your Service Studio. Publish and Run Simple OCR Sample. Next step is to simply upload an image with text, you can use what I gave as an example, choose the English Language from the Dropdown. And finally, you then click on the Recognize Characters button.




Actions of SimpleOCR extension suppose simple form of text area.
SimpleOCR extention is based on tesseract OCR library and is prividing functions of recognition of characters.
When you need functions to split characters from non-characters and to recognize complicate layout, you have to use further technologies.

You can modify the code of SimpleOCR extension by using both Integration Studio and Visual Studio.
The libraries the code are using can be found on GitHub.

I recommend you to use cloud engine like Google Cloud Vision if you can. Their OCR function consists of many technologies to recognize complecate document image.
OutSystems can use these cloud engines probably.
Unfortunately I have no experience of using them from OutSystems.

Sorry not to help you.

Takasi Moriya wrote:

Actions of SimpleOCR extension suppose simple form of text area.
SimpleOCR extention is based on tesseract OCR library and is prividing functions of recognition of characters.
When you need functions to split characters from non-characters and to recognize complicate layout, you have to use further technologies.

You can modify the code of SimpleOCR extension by using both Integration Studio and Visual Studio.
The libraries the code are using can be found on GitHub.

I recommend you to use cloud engine like Google Cloud Vision if you can. Their OCR function consists of many technologies to recognize complecate document image.
OutSystems can use these cloud engines probably.
Unfortunately I have no experience of using them from OutSystems.

Sorry not to help you.


I see. I understand. Thank you for responding Takashi!