As we have mentioned elsewhere, PDFs have virtually no structural information - they have no tags or descriptors to delimit words, paragraphs, diagrams or tables. The lack of structural information explains why it isn't possible to copy and paste a table into excel, and why one cannot edit and change a PDF file directly.
Sometimes the flow of text on a page is important for the cognitive or NLP processing. For example, to extract the context for any of the sentences in the example below, the system needs to understand how the text flows on the page.
If the system doesn’t recognise a table – like the one above - then the table content (headers, data, notes...) will be processed as ordinary text – which can lead to errors.
The same applies to a chart or diagram, where the title, axis labels, notes, etc can interfere with text extraction and subsequent AI processing. Look at the large amount of words and digits in the charts below:
The main text needs to be separated from the text in the charts before cognitive or NLP analysis is applied.
If some of the content of interest is in tabular form, then you need to be able to extract the internal structure of the table – the rows and columns. Conventional AI or cognitive computing cannot do this.
Many of the above problems can be solved, or partially solved, through ‘document layout analysis’.
Document layout analysis first divides up the document into regions or blocks. This ‘dividing up’ step is called ‘physical layout analysis’, and is used to identify the geometric page structure. It’s essentially a visual process – not reliant on the meaning of the content. It’s what Zanran’s ‘Scaffolder’ does (see below).
The next step is to take the regions and label them (as titles, captions, footnotes, sections etc.) - ‘logical layout analysis’. It is essentially semantic – it relies on analysis of language. Document layout analysis is the combination of geometric and logical labelling.
On top of these two processes one can add a text-flow analysis layer.
All this information can easily be kept in a single XML file - a tree-like representation layer of the document content.
Zanran’s ‘Scaffolder’ software is good at the first stage of the document analysis – dividing up the page into blocks and separating tables and graphics. It also gives the rows and columns of tables. All this is output as XML.
The output XML is very flexible and can include:
Zanran will also work with clients to develop software for any of the other layout analysis stages – each application will require slightly different algorithms.
Please contact us to discuss your application.
For more detail, please see our paper here.
This demo software has been optimised for brokers reports. You can process your PDF, and see the results as HTML - to make it easier to visualise.
It’s a broad term, and generally understood to be part of artificial intelligence (AI) that deals with ‘understanding’ written or spoken language. Some examples of cognitive computing might help
In all these, a computer is doing a complex language task.