It’s very hard to convert an unstructured document like a PDF into the logical structure of an XML file *
Zanran take an unusual visual approach – our 'PDF Scaffolder' uses computer-vision techniques to break each page up into blocks: text (paragraphs), tables, and graphics. Then the tables and text are output as XML.
For details of the full XML file output, please download XML Overview
Once you have the document in this structured format, you can run
- data-point extraction from tables
- computation on data in tables
- Natural Language Processing (NLP) on text
For a good example of a value-added application – based on the XML – please see how Zanran’s software speeds up the checking of annual reports.
Zanran has developed its ‘PDF Workbench’ software for any human interaction with the XML. It visualises the XML and enables:
- editing the XML (especially the tables and their data)
- adding content, tags or descriptions to the text or tables
Please see PDF Workbench
* XML (Extensible Markup Language) is a format based around simplicity and general usability across the internet. It is a format that is both human-readable and machine-readable. Much modern publishing uses XML. Here is a very simple example of a line of text in XML representing the phrase: "Hello high-technology!":