Automatic Data Point Extraction
- using computer vision Zanran for PDF processing.png

 

PDF data-point extraction can be a very laborious task for any business that receives large volumes of PDF documents. This is especially the case for documents containing important information such as statistical tables, financial results, or time-series data.  (Data-point extraction refers to the extraction of specific, predefined data or ranges of data from within tables in PDFs.)

Zanran enables you to pull out the tables as Excel or XML, then extract the data you're looking for:

PDF_data_point_extraction3aYou make use of your understanding of the language in the tables:

PDF_data_point_extraction3bHaving the PDF tables in a structured form makes it much easier to find the right data points.  And remember, the table extraction is done without manual intervention and without needing or using any templates.


RPA:  the process can be managed end-to-end by a Robotic Process Automation (RPA) system, for example UiPath's RPA.

Extraction into XML or Excel

Extraction.png

Zanran’s ‘PDF Scaffolder', is best-of-breed for the automatic finding and extraction of tables - without using templates - into Excel or XML.  The Scaffolder uses robust algorithms that can locate and identify tables in PDFs with 98% accuracy. To learn more about the PDF Scaffolder, click here.

 

Finding the relevant rows and columns

design.png

In general, we first identify the correct table(s) by matching against a set of words within the table.  Then we identify the relevant rows and columns using regular expressions.  Creating these rules is necessarily a manual process – but you only have to do it once.  The actual words and phrases used will obviously depend on the specific application.

Quality Assurance
- 'PDF Workbench'

Quality assurance.png

To ensure a very high degree of quality for your table extraction, Zanran has developed its visual PDF Workbench to facilitate and speed up the checking process. The Workbench allows a human operator to see the original PDF table overlaid with the grid that the software has determined.  If the operator spots an error, he or she can change the layout by merging or splitting cells, columns or rows.

For more about Zanran’s PDF Workbench,
click here.

 


Zanran’s PDF data-point extraction technology can a save a massive amount of time in any organisation that has a constant need for information to be extracted from hundreds of reports on a frequent basis.  The software is very scaleable.

To discuss your data-point extraction – please contact us.

View samples of extracted tables:

   Demo our PDF Table Extraction Software