PDFs are designed for reliable and consistent printing – not for the web.
The problems stem from the unstructured nature of PDFs as there’s no logic for computers to assess. Here are three examples:
Typical vector graphic with isolated graphic elements (legend), and ‘unconnected’ text (title, axis labels, source notes)
Zanran’s built a core engine for ‘scaffolding’ PDFs – adding back the structure. It determines the different blocks on the page: text, graphics (diagrams, graphs, maps...), photographs, and tables. And it gives the layout of those blocks.
This process is also known as ‘decomposition’, ‘layout analysis’, or ‘page segmentation’.
The software is robust and scaleable to many hundreds of servers.
Zanran has written its own complicated algorithms for ‘segmenting’ the text – grouping the letters into words and blocks (titles, paragraphs, etc). These algorithms are based on visual parameters, not linguistic or semantic ones. As a result, the software works with any language that reads left-to-right and is made of letters rather than ideograms. So Serbian and Icelandic are fine; Hebrew or Chinese are not.
For the graphics, we first assess the connected components and group those together. Then we have algorithms for deciding what other text and graphics are associated with that ‘core’.
Example - connected components
Most PDF to Excel software struggles with tables because of the problems that are caused by the huge range of table styles that exist. The idealised table, several rows and columns of numbers separated by neat lines, does exist - but so do tables that are: largely text, tables that are broken by sub-headers, tables that are only two rows, tables that have no row-headers on the left and tables that change the number of columns as you go down the page. Almost every rule you could think of that defines a table is broken by one or another.
Zanran's algorithms first establish the probability of digits (or a cluster of words) being part of a table. Then it works outwards to determine the outer limits of the table – bearing in mind there can be several tables on a page.
Having determined the boundaries of the table, we then employ rules for extracting the content into best-fit rows and columns. The most difficult part is making judgements about the headers, which may be across multiple lines, and may be hierarchical with headers over headers (e.g. ‘Revenue’ above ‘Q1’ and ‘Q2’).
We use image classification techniques to remove bar charts, which tend to be visually similar to tables - to a computer - in that they often have repeat patterns of digits and words.
Finally, after we’ve written the content into Excel, we add back the lines and the background colours that have been extracted separately from the PDF.
All these approaches are essentially statistical, and we get accuracies of around 95%. We can improve the accuracy by focusing on a style of document and optimising the PDF to Excel software for it. For example, a credit card statement has a very different layout and appearance from a brokers report.
For more details, please do not hesitate to contact us.