Cognitive computing and PDFs

Icon 4 - PDF to Mobile.png

As we have mentioned elsewhere, PDFs have virtually no structural information - they have no tags or descriptors to delimit words, paragraphs, diagrams or tables.  The lack of structural information explains why it isn't possible to copy and paste a table into excel, and why one cannot edit and change a PDF file directly.

As a result, there are a whole range of issues that affect cognitive computing - especially when the PDFs have tables or images within the text.  Download full paper

Issue 1 - text flow

Sometimes the flow of text on a page is important for the cognitive or NLP processing.  For example, to extract the context for any of the sentences in the example below, the system needs to understand how the text flows on the page. 

AnnualReport01.jpg

Issue 2 - interference

If the system doesn’t recognise a table – like the one above - then the table content (headers, data, notes...)  will be processed as ordinary text – which can lead to errors.

The same applies to a chart or diagram, where the title, axis labels, notes, etc can interfere with text extraction and subsequent AI processing.  Look at the large amount of words and digits in the charts below:

BrokersReport04.jpg

 

The main text needs to be separated from the text in the charts before cognitive or NLP analysis is applied.

Issue 3 - table extraction

If some of the content of interest is in tabular form, then you need to be able to extract the internal structure of the table – the rows and columns.  Conventional AI or cognitive computing cannot do this.

 

Layout analysis

Many of the above problems can be solved, or partially solved, through ‘document layout analysis’.

Document layout analysis first divides up the document into regions or blocks. This ‘dividing up’ step is called ‘physical layout analysis’, and is used to identify the geometric page structure. It’s essentially a visual process – not reliant on the meaning of the content. It’s what Zanran’s ‘Scaffolder’ does (see below).

The next step is to take the regions and label them (as titles, captions, footnotes, sections etc.) - ‘logical layout analysis’.  It is essentially semantic – it relies on analysis of language. Document layout analysis is the combination of geometric and logical labelling.

On top of these two processes one can add a text-flow analysis layer.

All this information can easily be kept in a single XML file - a tree-like representation layer of the document content.

stack_new4_transparent.png

Zanran’s ‘Scaffolder’

Zanran’s ‘Scaffolder’ software is good at the first stage of the document analysis – dividing up the page into blocks and separating tables and graphics.  It also gives the rows and columns of tables.  All this is output as XML.

 The output XML is very flexible and can include:

  • coordinates of words, lines and blocks of text on the page
  • character formatting (bold, underline, italic)
  • coordinates of table boundaries on the page
  • internal structure of tables (HTML-like presentation)
  • coordinates of image boundaries – both vector graphics (graphs, diagrams, maps...) and images like photographs.

Zanran will also work with clients to develop software for any of the other layout analysis stages – each application will require slightly different algorithms.

Please contact us to discuss your application.

For more detail, please see our paper here.

Your testing

This demo software has been optimised for brokers reports.  You can process your PDF, and see the results as HTML - to make it easier to visualise.

Try PDF Scaffolder:

What is meant by ‘cognitive computing’?

It’s a broad term, and generally understood to be part of artificial intelligence (AI) that deals with ‘understanding’ written or spoken language.  Some examples of cognitive computing might help

  • ‘sentiment analysis’, e.g.
    - trying to assess attitudes towards a consumer product from thousands of Tweets
  • article summarisation – creating a paragraph that gives an overview of a long document
  • drug discovery - ‘mining’ lifescience reports for new connections
  • speech recognition – e.g. giving instructions via Apple’s Siri or Microsoft’s Cortana

In all these, a computer is doing a complex language task.