• Announcements

    • Daphne

      Your Wishlist!   12/13/2017

      htmlentities(

      'Tis the season! Let the PDFelement team know what features and tools are on your wishlist by submitting your idea or voting for others here.

      )
    • Daphne

      New Year Gift!   12/25/2017

      htmlentities(

      We have prepared the gifts as the thanks to you, for more details, please click here . 

       

      )
Krishna

OCR for extracting Japanese character - PDF to Excel

12 posts in this topic

Hi All,

 

I have attached an pdf for your reference. The task is to convert a PDF file (data from each column including geometric shapes) to an Excel file. Let us know whether "PDFElement 6" is suitable for requirement (to read Japanese kanji character, data written in a circle or square etc.). Is software is having any dependency on MS Excel version? 

Whether API or SDK version available for developer to use?(C++/C#)

OS: Windows

It would be great if your tool can support our requirement. We are interested in buying this tool if our requirement can be fulfilled. 

We tried PDFElement6 trial version to evaluate the software with normal PDF file. Attached sample output of PDFElement 6 tool. It is not extracting some of the symbols like circle,square etc. Also we found some alignment problem if table is having merged cells. Any way to resolve these problems?

We wanted to evaluate OCR tool also before buying. Any way to evaluate the functionality before buying the tool?

 

Regards,

Krishna

Table.pdf

table_pdfelement.xlsx

Share this post


Link to post
Share on other sites

Hi Krishna,

 

For your information that the OCR function is available in the trial version of PDFelement 6 Pro, actually you have already tested it. Since your PDF file is the scanned file, our program has used OCR function to convert it to the excel file directly. Without using OCR, the converted excel file will be blank when your PDF file is the scanned file. 

 

For the case, it seems the OCR does not recognize well of the data written in a circle or square. OCR will recognize the text according to its glyph, so if there is the square or circle around the glyph, it will affect the recognition result. 

 

Thanks

Share this post


Link to post
Share on other sites

Hi Daphne,

 

Thanks for your reply.

 

Sample output which I attached is generated using standard PDF file(soft copy). Not using scanned copy. If I Input scanned PDF copy to PDFElement 6 pro trial version, even though we can perform the OCR, I will not be able to save the file and convert to excel file. Is there any way to achieve using trial version?

 

Is there way to support custom data extraction like data written in a circle?

 

Also whether this software is having any dependency on MS Excel version? 
Whether API or SDK version available for developer to use?(C++/C#)
OS: Windows

 

Regards,

Krishna

Edited by Krishna

Share this post


Link to post
Share on other sites

Hi Krishna,

 

1) The file Table.pdf is an image-based file, so it also needs the OCR function to convert it. And if you open a scanned PDF file in our program, you can also enable OCR function to convert. After clicking the Home>To Others>Convert to Excel button, then in the new dialog window, please click the Settings button, you can set to use OCR function for all files or for scanned files only.

setting.png

 

If you still fail to convert your scanned PDF file, then please make a screenshot to show me your situation or the error you get. And please also attach the failed PDF file in the email to send to me for further tests again.

 

2) No matter using the data extraction function or the converting function, the OCR recognition needs to be used as long as your PDF file is the scanned file. So it will be the same result that the text in circles wont be recognized well.

 

3) No, our program does not depend on MS Excel version. You can use our program to convert your PDF file to both of the .xls and .xlsx format excel files.

图像 5.png

 

4) I am sorry that we do not provide SDK or API version.

 

Thanks

Share this post


Link to post
Share on other sites

Hi Daphne,

 

Thanks for your reply.

 

I have checked option in the setting. Only scanned pdf is already selected.

I have attached the output file of scanned pdf.

Please provide your feedback.

 

Regards,

Krishna

pdfelement6_output.xlsx

Share this post


Link to post
Share on other sites

Hi Krishna,

 

If you have already select the option of "only scanned PDF", then when you open a scanned PDF file in the program to convert, it will use the OCR function in the procession automatically. I have checked the converted excel file you sent and it seems you did not select the correct OCR language to use.

 

After opening your PDF file in our program, please click the File>Preference button, in the OCR tab, please select the correct languages of your PDF content in the OCR language list, for this file you may need both of Japanese and Traditional Chinese. Then close the window and click the Home>To Others>Convert to Excel button to convert your PDF file to the excel file again. You should get better result this time.

 

Thanks

Share this post


Link to post
Share on other sites

Hi Daphne,

 

Thanks for your reply.

 

As per your suggestion, I have enabled Japanese and Traditional chinese option. Please check the output file.

But this output(data extracted from scanned PDF) accuracy is less when we compare the output of data extraction from normal PDF file.

 

Is it possible to improve the accuracy or quality of PDF to excel conversion? 

Whenever I am opening the scanned PDF I will get warning message to perform OCR (as shown in attached pdfelement6_cap.png). 

Please provide your feedback on the same.

 

Regards,

Krishna

Input_Table.pdf

pdfelement6_output2.xlsx

pdfelement6_cap.png

Share this post


Link to post
Share on other sites

Hi Krishna,

 

I have checked your excel file and I am sorry for the result. Since the OCR function does not recognize the text in circle or square, so it will also affect the conversion result. 

 

Yes, when you open a scanned PDF file or an image-based PDF file in our program, you will get this notice to perform OCR. Without using OCR, your scanned PDF file is not editable and the converted file wont be editable either, so we give this notice directly as the reminder to use OCR for the scanned PDF files.

 

Thanks for your understanding.

Share this post


Link to post
Share on other sites

Hi Daphne,

 

Thanks for your reply.

 

Is it possible to add custom user pattern to tool (like text in circle or square) or any sort of customization possible to increase the accuracy and support our requirement?

 

Regards,

Krishna

Share this post


Link to post
Share on other sites

Hi Krishna,

 

I am sorry that our program does not support to add the custom user pattern currently. However, we will keep improving the program to be better and better.

 

Thanks 

Share this post


Link to post
Share on other sites

Hi Daphne,

 

We have urgent requirement. Is it possible to support customization to match our requirement? or Is it possible for us to customize if you provide any SDK?

 

Regards,

Krishna

Share this post


Link to post
Share on other sites

Hi Krishna,

 

I am sorry we do not provide SDK. However, there seems to have other methods to add text in circle, after after opening your PDF file in our program, please click the Edit>Add Text button to add the text first. Then click the Comment tab to choose the square or circle shape to add around the added text, and you can also change the shapes properties on the right side. Whether this function can meet your needs?

图像 1.png

 

Thanks

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0