PDF to XML

Icon 6 - pdf to xml_Vs2


It’s very hard to convert an unstructured document like a PDF into the logical structure of an XML file *

Zanran take an unusual visual approach – our 'PDF Scaffolder' uses computer-vision techniques to break each page up into blocks:  text (paragraphs), tables, and graphics.  Then the tables and text are output as XML.

Text and tables to XML

 For details of the full XML file output, please download  XML Overview


 

Downstream applications

Once you have the document in this structured format, you can run

  - data-point extraction from tables

  - computation on data in tables

  - Natural Language Processing (NLP) on text

For a good example of a value-added application – based on the XML – please see how Zanran’s software speeds up the checking of annual reports.

Interface with the XML

Zanran has developed its ‘PDF Workbench’ software for any human interaction with the XML.  It visualises the XML and enables:

  - editing the XML (especially the tables and their data)

  - adding content, tags or descriptions to the text or tables

Please see PDF Workbench


 

Extract the tables from your PDF - you will be able to download them as Excel or XML.

Convert your PDF:

 

* XML (Extensible Markup Language) is a format based around simplicity and general usability across the internet. It is a format that is both human-readable and machine-readable. Much modern publishing uses XML.  Here is a very simple example of a line of text in XML representing the phrase: "Hello high-technology!":

<text>Hello high-technology!</text>

.