Jonathan Simmons

Member
  • Content count

    2
  • Joined

  • Last visited

Community Reputation

0 Poker-Face

About Jonathan Simmons

  • Rank
    Lurker
  1. Extract from scanned PDF

    Thanks, this is what I've been doing as well. I've also cropped each page in an effort to reduce what I'd call "noise" in the final result. The end result, either way, is still not the best solution with so many pages to work through. I can actually type the data in faster manually - which is disappointing. I would be interested to know what your development team comes up with. Thanks for your efforts otherwise - Jonathan
  2. I am trying to extract data from large number of scanned PDFs, each is a "spreadsheet" but none are identical in layout. Arguably these are not very good scans as they are downloaded "as is" from United Nations archive files but these are the only versions I have access to for these periods of time. I have attempted to use the Data Extraction tool and am selecting individual rows, groups of rows, and even an entire portion of the page. This is an effort to eliminate unneeded data, etc. None of my attempts has been successful. The .csv file presents the data out of order, does not reflect proper spacing (allowing me to use "text to columns" in Excel), and is some cases the grid from the spreadsheet itself is recognized as text and is inserted into the data. I have tried using OCR before Data Extraction and the results are even worse. I've attached an example of one file. The first page has a red box the shows the data I'm trying to extract. I have also attempted to use the Export to Excel feature but this too is not working well - I'm still having to spend as much time moving data to the correct column/row as it would take to retype the entire spreadsheet. Because I have hundreds of pages to go through I was hoping to find a faster solution. Any assistance would be appreciated. feb-1993.pdf