Zanran has needed to put a huge amount of effort into its PDF-to-Excel software. What seems intuitive to a human – it “looks like a table” – is full of issues, exceptions and special cases for computer software.
In this, the first of a number of articles about content extraction from PDFs, I want to look at some of the fundemental problems.
1. PDFs’ lack of structure
If you’ve read any of our technical pages, you’ll have picked up that there’s no structure or tagging of content in a PDF. Imagine this (below) is part of a PDF page, with a paragraph on the left and a table on the right:
Then the only difference between the ‘19’ on the left, and the ‘87’ on the right would be the position of the numbers on the page. You’d have no idea that the ‘87’ was part of a table layout.
If you’re familiar with HTML code, you’ll appreciate the difference. With HTML you get tags indicating it’s a table, and tags showing the contents of each cell – quite unambiguously.
2. Difficulty of spotting lines
A lot of tables use straight lines to give an idea of layout. So it would be helpful if you knew where the lines were – like the white lines in the example above. This is sadly not straightforward.
In a PDF, lines are normally ‘vectors’ – a graphic going from A to B (i.e. not a series of dots). You might think therefore that it would be easy to judge where the lines are. You might reasonably expect the code in the PDF to state, for example, ‘purple line of 2 pixels width going from A to B’.
Not so. In practice the lines are made up of many small segments (each with their own A’s to B’s). These segments can overlap. And some can be subsequently overlaid by other graphics or text. Unbelievably, it seems almost impossible to read a PDF’s code to determine where the lines are - reliably.
3. A problem with human beings
Humans are wonderfully creative – and that means that tables are written in any number of sloppy ways. Here are just a few examples where it’s very hard to for a computer to figure out what’s going on:
- the numbers in the rows don’t line up horizontally:
- there are 2 parameters in each cell of the table (biologists & medics are especially guilty):
- two vaguely-related tables are joined together:
- the number of columns in the table changes as you go down:
As an example: see if you can work out what this column of $-signs is all about:
[Answer: the $ signs should be immediately in front of the numbers on the right e.g. $273]
People are amazingly effective at seeing patterns – and have trouble understanding why computers can’t do this simple task.
For example, I bet you can roughly work out the layout of the table in this Japanese document:
But ask yourself: “what is the title of the table?”. Is it one line above, or both lines above – or neither?
You can get a long way using visual clues – and that’s what our PDF conversion software does. Then for the next stage, you need to look at vocabulary and meaning – semantics. But that will have to be the subject of another blog.