Zanran Blog


Buried Data

Posted by Zanran News on 14-Oct-2016 17:33:00

This is about the amount of data that you won’t find normally – graphs and charts.

One of the reasons we originally got into PDFs was that the quality of the content – especially numeric content - was so high.  We were finding far less junk than on HTML pages.  I appreciate that this is necessarily a generalisation - no insult is intended to the producers of top-quality content in HTML.

Buried DataAt one stage we decided to try to quantify what was going on – for our own satisfaction.  The numbers involved are very large:  billions of HTML pages and millions of PDF files.  So we had to look at a sampling scheme.

We took a random selection of 13,000 sites from the public internet.  Then, using Bing Image search, we took one image, randomly, per site (so large sites weren’t overly represented).  We ended up with 5,469 images of which 67 were graphs (1.2%).  At the time, there were around 26B images on the internet, according to Google.  It suggested there would be about 300M graphs as images on HTML pages.

Then we estimated that there were about 550M PDFs on the internet which would contain about 150M graphs within those PDFs.  Overall, then, 450M graphs, of which 150M are in PDFs.  And it doesn’t even include the graphs and charts in many publications, like scientific journals, that are behind pay-walls.

In other words, a third of graphs on the internet are present in PDF documents.  These graphs can’t be found using a Google image search.  Google doesn’t/can’t extract graphs & diagrams from PDFs.

150M graphs and charts - that’s a lot of buried data (‘buried treasure’, perhaps?). 

Topics: Data Extraction