Data Point ExtractionZanran for PDF processing.png

 

PDF data extraction can be a very tedious task for any business that receives large volumes of PDF documents. This is especially the case for documents containing important information such as statistical tables or time series data, or if the PDF data extraction process needs to be done on a daily, weekly or monthly basis.

Furthermore, as PDF files are unstructured by nature and there’s no built-in way to extract data from the tables within them, this process is often done manually – which can be both time consuming and error-prone.

However, Zanran’s Data-point technology can help automate this process. Data-point extraction refers to the extraction of specific, predefined data or ranges of data from within tables in PDFs. Zanran’s data-point extraction technology uses a multi-stage process:

    1. Identifing & extracting tables from the PDF.
    2. Optional:  Quality assurance checking - a manual process using Zanran's QA Editor.
    3. Writing the content to Excel or other structured format.
    4. Using a manually-designed template to extract data points from the structured content
    5. Optional:  Final accuracy check

By using this multi-stage process, PDF data extraction becomes much easier - as locating key phrases and terms is faster and less error prone. That in turn enables quicker and more efficient data extraction.

Extraction into Excel

Extraction.png

In order for Zanran’s data-point extraction technology to extract tables automatically from a PDF and transfer them to Excel, we use Zanran’s ‘Xtractor’, which is best-of-breed for automatic identification and extraction of tables into Excel. Zanran’s Xtractor uses very precise algorithms that can locate and identify tables in PDFs with 98% accuracy. To learn more about the Xtractor technology, click here.

 

Designing templates 

design.png

For the data extraction process, Zanran’s  system requires a template for each document - it specifies the data you want to extract. Where data or statistical analyses are published weekly or monthly, the structural intricacies and design of the documents tend to be fairly unchanging. This means that the data labels: titles, headers, dates, etc. – can be considered as reasonably constant. Creating the template is a manual process – but once it’s done, it should be valid for that publication for many months or years.

 

Quality Assurance

Quality assurance.png

To ensure a very high degree of quality for your table extraction, Zanran has developed its visual PDF Workbench to facilitate and speed up the checking process. The Workbench allows a human operator to see the original PDF table overlaid with the grid that the software has determined.  If the operator spots an error, he or she can change the layout by merging or splitting cells, columns or rows.

For more about Zanran’s PDF Workbench,
click here.

 


Zanran’s PDF data-point extraction technology can a save a massive amount of time for any organisation or institution that has a constant need for information to be extracted and assimilated from hundreds of reports on a weekly or monthly basis. Where the structural design of documents you are processing is similar, Zanran’s PDF data-point extraction technology can be deployed to quickly identify, locate and extract specific data – and is capable of working with large volumes of PDFs.

View samples of extracted tables:

   Demo our PDF Data Extraction Software