Jonathan  Simmons

Extract from scanned PDF

4 posts in this topic

I am trying to extract data from large number of scanned PDFs, each is a "spreadsheet" but none are identical in layout.  Arguably these are not very good scans as they are downloaded "as is" from United Nations archive files but these are the only versions I have access to for these periods of time. 

 

I have attempted to use the Data Extraction tool and am selecting individual rows, groups of rows, and even an entire portion of the page.  This is an effort to eliminate unneeded data, etc.  None of my attempts has been successful.  The .csv file presents the data out of order, does not reflect proper spacing (allowing me to use "text to columns" in Excel), and is some cases the grid from the spreadsheet itself is recognized as text and is inserted into the data.  I have tried using OCR before Data Extraction and the results are even worse.   I've attached an example of one file.  The first page has a red box the shows the data I'm trying to extract.

 

I have also attempted to use the Export to Excel feature but this too is not working well - I'm still having to spend as much time moving data to the correct column/row as it would take to retype the entire spreadsheet.  

 

Because I have hundreds of pages to go through I was hoping to find a faster solution.  Any assistance would be appreciated.

feb-1993.pdf

Share this post


Link to post
Share on other sites

Hi Jonathan,

 

We downloaded your attached PDF and had a test to extract data from marked area of your scanned PDF file and the exported .csv file is not good either. It has something to do with the original scanned file quality and our program's capability. I have forwarded this case to our development team and they will try to improve the quality in terms of this feature.

 

On our side, we tried performing OCR before converting this PDF file to Excel, it turned out that the performance is much better. Here we enclose the converted Excel file for your reference. Hope this can be a workaround. feb-1993 data extraction_OCR.xlsx

 

Thanks for your feedback.

Heidi

Share this post


Link to post
Share on other sites

Thanks, this is what I've been doing as well.  I've also cropped each page in an effort to reduce what I'd call "noise" in the final result.  The end result, either way, is still not the best solution with so many pages to work through.  I can actually type the data in faster manually - which is disappointing.  I would be interested to know what your development team comes up with.  Thanks for your efforts otherwise - Jonathan 

Share this post


Link to post
Share on other sites

Hi Jonathan,

 

I understand. And I am afraid that currently we have tried our best. 

 

For your information, our development team are coordinating with the ABBYY to work on improving the OCR technology. And it is estimated that the new OCR engine will be applied to the coming updated version around May. 

 

Please look forward to that!

Heidi

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now