James McEvoy

OCR of previously part-OCRed document

8 posts in this topic

I have downloaded a very old book  from Gooreader which had already been OCRed by Google. I inserted some pages into the document which were image only. The print is very faint so I was curious as to what would happen if the whole document was OCRed through PDFElement. The process took far longer than for a similar length book of only image pages.

 

The result came out in a darker print which was an improvement. Google's previous OCR was ignored and the process was repeated. The added images were OCRed and so were all the other pages for a new version of the text.Understandably it was not as good as Google's version but that was to be expected. I am guessing there is no standard format for holding OCRed text, so PDFElement was not able to access Google's version of the text.

 

I wondered what the program was doing which took so long, since it seemed to treat the document as images only.

 

Is there an alternate procedure which which would have improved on the outcome I obtained?

Share this post


Link to post
Share on other sites

Hi James,

 

To help us better figure out the problem, can you send us your original PDF file and OCRed file so that we can have a further check for you?

 

Looking forward to your reply.

Heidi

Share this post


Link to post
Share on other sites

Hi Heidi

 

Thanks for the response. The original file is 36MB and the OCRed version is 94Mb, which are too large to email. I could load them on Google Drive and provide a link I suppose. It's a while since I have done that and I will have brush up on the process. Alternatively I can send portions of the files which total less than 10Mb.

 

What do you think?

 

James

Share this post


Link to post
Share on other sites

Hi James,

 

That sounds good to me. Can you send us the portions of the files to pdfelement@wondershare.com and I will forward the case to our development team for further look?

 

Thanks for your cooperation in advance.

Heidi

Share this post


Link to post
Share on other sites

Hi Heidi

 

Attached find extracts of the same 55 pages (of 720) of the searchable original  and the OCRed (by PDFElement) versions of the book. There were four pages missing from the original download, near the end of the extract, and these have been added. The original is searchable but the added pages were image only.

 

I have done this through the forum rather than by email due to my email server file size restrictions.

 

James

 

Spencer & Gillen OCR pp 29-83.pdf

Spencer & Gillen orig pp 29-83.pdf

Share this post


Link to post
Share on other sites

Hi James,

 

Thanks for uploading the two files for our reference. I have forwarded those to our development team as a source files in improving the OCR quality.

 

For your information, our development team are coordinating with the ABBYY to work on improving the OCR technology. And it is estimated that the new OCR engine will be applied to the coming updated version around April. 

 

Please look forward to that!

 

Best Regards,

Heidi

Share this post


Link to post
Share on other sites

Hi Heidi

It is now over one month since you passed the files on to the development team. Was there any feedback?

Regards

James

Share this post


Link to post
Share on other sites

Hi James, 

 

I am afraid that the result you get is the best that our program can do so far. And we used some other PDF program to perform OCR on your searchable PDF and the result is no much better than ours.

 

The OCR quality is actually related with the original file DPI. So please kindly understand.

 

It is estimated that our program will be integrating with a new OCR engine in May this year.

 

Regards,

Heidi

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now