Zanran has needed to put a huge amount of effort into its PDF-to-Excel software. What seems intuitive to a human – it “looks like a table” – is full of issues, exceptions and special cases for computer software.
In this, the first of a number of articles about content extraction from PDFs, I want to look at some of the fundemental problems.