Automatically extracting tables from PDFs – whether native or scanned – is hard. Small-scale extraction of one or two tables is fine, either manually or using one of the free services on the web. When you want to extract dozens – or thousands – of pages per day then the problem is different. That is what we are addressing here.
This testing focused on financial statements - annual reports - (as PDFs), to determine which software was most reliable for the automatic extraction of tables.
The testing was to establish the percentage of extracted tables that were 100% correct. The row and columns had to be correct, their headers have to be correct, and all the data had to be correct.
We processed 4 pages with tables at random from each of 25 published annual reports: 10 from FTSE100 companies, 10 from S&P500, and 5 from DAX-listed firms. The companies were all chosen at random. The randomness meant we tested a wide range of tables, both in terms of content and layout. There were a total of 173 tables in these 100 pages.
Zanran was substantially the most accurate, with almost 80% of extracted tables fully correct.
The Scaffolder produced a separate worksheet for each table, and named the worksheet according to the page number, which made it easy to follow manually.
The Excel also included any notes at the bottom of the tables – as blocks of text in the Excel.
Zanran’s Scaffolder outputs tables in a range of formats, XML, JSON, & HTML, as well as Excel. If you select XML or JSON, you can choose to get the text as well. The text is in blocks – titles, paragraphs, etc. The XML or JSON outputs are particularly suitable for NLP (natural language processing) or intelligent search. All the content of the page is available in a machine-readable form.
Zanran's software was able to process 'locked' PDFs, and could also handle European number formats reliably.
Unlike the other two products tested, FineReader puts all the tables from a page into a single worksheet:
A person would be able to judge where the tables should be separated. But in most cases it would be hard for a computer programme to make that assessment. That in turn makes it hard to write a programme to extract the column headers automatically.
Otherwise, its biggest errors were from incorrect column headers - either partly or fully missing.
The FineReader has one feature that can be both a blessing and a curse. By default it runs OCR even when the characters exist as readable code. The outcome is that it sometimes gets characters (letters, numbers or symbols) right and sometimes wrong. The technical explanation is given below in the Details section.
Finally, the FineReader was not able to process three of the German reports where 'Content Copying' of the PDFs was not allowed. The Zanran software did not have this limitation.
PowerPDF did produce an Excel worksheet for each table – but the worksheets do not indicate which page the table came from. It was the weakest in judging table boundaries and in merging the words in column headers. Here are examples:
PowerPDF was the worst of the three products in terms of actually missing tables altogether. And, like Abbyy, PowerPDF could not process the three 'locked' German reports.
However, the PowerPDF was better than FineReader in handling unusually-coded characters. It only used its OCR capability when the Unicode characters were out of normal range.
Testing was carried out in May 2020.
All the files were ‘native’ PDFs (not scanned) – to remove the effects of OCR.
And the analysis of the number of tables affected by each error type is as follows:
|Table boundary errors|
|Tables needed to be split/separated||70||6||9|
|Tables needed to be combined||0||6||1|
|Tables missing part or all of column heading||44||23||4|
|Row headers missing||3||5||0|
|Internal table structure errors|
|Column headers wrongly merged||4||24||1|
|Column header split across rows||4||5||10|
|Column headers nested wrongly||8||9||5|
|Row header split across 2 or more rows||1||0||5|
|Row headers combined incorrectly||3||3||1|
|Extra column added for $ or % character||22||12||0|
|Data cells merged together||1||5||1|
|Super-script merged with data||1||2||4|
|Data error (OCR problem or data combined)||11||0||1|
|Number format error||4||13||0|
Tables could have two or more errors.
Technical explanation (from FineReader, above) Native PDFs - not scanned ones - have a shape for each character. That is what is printed on the page. Each character also has a Unicode description – for example, the character is described as the letter ‘A’. If the writer of the document doesn’t use standard Unicode descriptions then the PDF code won’t be read correctly by a computer. For example, if the shape of the letter is a delta Δ, and the Unicode description is ‘D’ then then person reading the page would see the Δ, and a computer would see the D. Abbyy takes the OCR character as being correct – so in the above example would correctly extract a delta. However there are other instances where it gets it wrong. For example, in one document, the author had used a larger-than-normal dot as a decimal point. The FineReader decided they were asterisks.