Table extraction review

Abbyy FineReader 15   vs   Kofax PowerPDF   vs.   Zanran Scaffolder

Introduction

Automatically extracting tables from PDFs – whether native or scanned – is hard.  Small-scale extraction of one or two tables is fine, either manually or using one of the free services on the web.  When you want to extract dozens – or thousands – of pages per day then the problem is different.  That is what we are addressing here.

This testing focused on financial statements - annual reports - (as PDFs), to determine which software was most reliable for the automatic extraction of tables. 


The testing was to establish the percentage of extracted tables that were 100% correct.  The row and columns had to be correct, their headers have to be correct, and all the data had to be correct. 

We processed 4 pages with tables at random from each of 25 published annual reports:  10 from FTSE100 companies, 10 from S&P500, and 5 from DAX-listed firms.  The companies were all chosen at random.  The randomness meant we tested a wide range of tables, both in terms of content and layout.    There were a total of 173 tables in these 100 pages.

Conclusions

Zanran was substantially the most accurate, with almost 80% of extracted tables fully correct.  

Results - table showing percentage of totally correct extraction


Observations

Zanran Scaffolder

The Scaffolder produced a separate worksheet for each table, and named the worksheet according to the page number, which made it easy to follow manually.

The Excel also included any notes at the bottom of the tables – as blocks of text in the Excel.

Zanran’s Scaffolder outputs tables in a range of formats, XML, JSON, & HTML, as well as Excel.  If you select XML or JSON, you can choose to get the text as well.  The text is in blocks – titles, paragraphs, etc.  The XML or JSON outputs are particularly suitable for NLP (natural language processing) or intelligent search.  All the content of the page is available in a machine-readable form.

Zanran's software was able to process 'locked' PDFs, and could also handle European number formats reliably.

 

Abbyy FineReader 15

Unlike the other two products tested, FineReader puts all the tables from a page into a single worksheet:

AbbyyCombiningTables


A person would be able to judge where the tables should be separated.  But in most cases it would be hard for a computer programme to make that assessment.  That in turn makes it hard to write a programme to extract the column headers automatically. 

Otherwise, its biggest errors were from incorrect column headers - either partly or fully missing. 

The FineReader has one feature that can be both a blessing and a curse.  By default it runs OCR even when the characters exist as readable code.  The outcome is that it sometimes gets characters (letters, numbers or symbols) right and sometimes wrong.  The technical explanation is given below in the Details section.

Finally, the FineReader was not able to process three of the German reports where 'Content Copying' of the PDFs was not allowed.  The Zanran software did not have this limitation.

Kofax PowerPDF

PowerPDF did produce an Excel worksheet for each table – but the worksheets do not indicate which page the table came from.  It was the weakest in judging table boundaries and in merging the words in column headers.  Here are examples:

KofaxColumnHeaders

 

PowerPDF was the worst of the three products in terms of actually missing tables altogether.  And, like Abbyy, PowerPDF could not process the three 'locked' German reports.

However, the PowerPDF was better than FineReader in handling unusually-coded characters.  It only used its OCR capability when the Unicode characters were out of normal range.

 

Test Details

Testing was carried out in May 2020.

All the files were ‘native’ PDFs (not scanned) – to remove the effects of OCR.

The test pages themselves are available for download, as zip files, from:
FTSE_examples (39MB)
S&P examples (4MB)
DAX examples (53MB)

 

And the analysis of the number of tables affected by each error type is as follows:

  Abbyy Kofax Zanran
Table boundary errors      
Tables missing 7 28 2
Tables needed to be split/separated 70 6 9
Tables needed to be combined 0 6 1
Tables missing part or all of column heading 44 23 4
Row headers missing 3 5 0
Columns missing 3 7 2
Rows missing 2 0 1
       
Internal table structure errors      
Column headers wrongly merged 4 24 1
Column header split across rows 4 5 10
Column headers nested wrongly 8 9 5
Row header split across 2 or more rows 1 0 5
Row headers combined incorrectly 3 3 1
Extra column added for $ or % character 22 12 0
       
Data-point errors      
Data cells merged together 1 5 1
Super-script merged with data 1 2 4
Data error (OCR problem or data combined) 11 0 1
Number format error 4 13 0

 

Tables could have two or more errors.

 

Technical explanation (from FineReader, above) Native PDFs - not scanned ones - have a shape for each character.  That is what is printed on the page.  Each character also has a Unicode description – for example, the character is described as the letter ‘A’.  If the writer of the document doesn’t use standard Unicode descriptions then the PDF code won’t be read correctly by a computer.  For example, if the shape of the letter is a delta Δ, and the Unicode description is ‘D’ then then person reading the page would see the Δ, and a computer would see the D.  Abbyy takes the OCR character as being correct – so in the above example would correctly extract a delta.  However there are other instances where it gets it wrong.  For example, in one document, the author had used a larger-than-normal dot as a decimal point.  The FineReader decided they were asterisks.

 

If you would like to try table extraction to Excel using Zanran's Scaffolder, please try the demo

Convert your PDF: