Zanran Blog

 

Cloud-based PDF to Excel - review

Posted by Zanran News on 16-Mar-2016 14:03:00

Zanran Blog1.png

If you’ve ever tried to copy a table from a PDF document, you’ll know it’s a pain.  You can only do so one cell at a time – which is very laborious.  However there are a lot of companies on the web that claim to be able to do ‘PDF to Excel’ processing – either cheaply or for free. 

We’ve reviewed these services to establish the quality of their technologies.

 

As you have most likely discovered by now – and perhaps indeed, your reason for being on this website – is that it is notoriously difficult to process and extract information from PDF files. Their fixed-format nature ensures consistent and reliable printing across devices, but acts as a tremendous obstacle to anyone looking to extract content from them. Despite this there are now many free, or cheap, cloud-based table-extraction or table-conversion services on the internet.

They fall into two groups:

Clean extractors: these extractors separate the tables from the surrounding text and graphics, and

Messy extractors: those that convert the whole PDF page in question into an Excel worksheet.

You can see a visual comparison of them below:

ZanranExtractorJPG.jpg

As you can see, PDF clean table extraction gives much less ‘noise’. Only the table - no surrounding text - is extracted from the document and transferred to Excel. On the other hand, a messy extraction gives the table plus everything else on the page.

So for anyone looking for information extraction at a large scale without manual intervention, clean extractors are the way to go. If you’re looking at the occasional table conversion for personal use, you can be less fussy.

To better understand the competition within the cloud-based PDF processing space, and determine which extractors are actually useful, we conducted a series of tests to establish the limit of their functions and capabilities.

Disclosure: bear in mind, PDF processing is our area of expertise.

We conducted the tests using real PDF documents to ensure a fair evaluation of each product; the economic stats report was taken from the South African Statistics website, which you can find here – and the engineering document from a supplier, which you can find here.

The assessments we used for each cloud-based PDF extraction were as follows:

  • Excellent - the output tables had the right boundaries and all columns and rows were correct, without additional columns or rows. Minor errors were ignored – e.g. font size or background colour. False tables were also ignored – e.g. where a block of text was categorised as a table.
  • Good - similar to above, but where the errors would have needed some form of human intervention – e.g. merged cells or incorrect column headers.
  • Usable - it would be necessary for someone to manually extract the table from the surrounding content – boundaries were not clearly defined.
  • Poor - similar to ‘Usable’, but where additional human editing was required.
  • Useless - too much human effort was required to extract the table, therefore rendering the program useless.

Zanran_Review_Image.png


As you can see, table separation seems to be a major limitation for most cloud-based PDF extractors, with six of the nine failing, and two separating them in a ‘questionable’ fashion.

We’ve also compiled a brief list of the comments made while testing:

  • Adobe– impressed that it combined a continuous multi-page PDF into a single table.
  • CometDocs– their results didn’t have background fill, which made some tables more difficult to follow. Their software is also used in: ‘able2extract’, ‘pdfconverter.com’, and ‘pdftoexcel.org’.
  • FoxyUtils– all their output files came up with error messages – they didn’t seem to be proper .xls
  • Nitro– when the table boundaries were clear – no nearby graphics or other tables -– then results were good.
  • PDFconvert– their system kept returning 'Conversion Failed' - we collected no reliable data.
  • smallPDF– creates many false tables. A human can skim through them quickly, but it would not be suitable for automated processing.
  • Zamzar– their other format conversion algorithms are very good

Depending on the quality of extraction you’re looking for, there’s an enormous range of capabilities. Our suggestion: when you come to look for a table extractor, have a look at smallPDF and Zanran first. Then if not happy with those, try Adobe or Nitro.

Though if you’re not convinced, click here to try our demo to see our PDF Table Extraction software in action.

Topics: Data Extraction