Where data extraction and table extraction comprises a large part of an organisation’s daily activity, and many of the documents they operate with are PDFs, it can be unnecessarily time-consuming to extract both data and tables containing data manually.
Automating the process should be the solution, but as PDFs are unstructured by nature, it is notoriously difficult for computers to ‘understand’ what’s happening on a page. Extracting tables from PDFs is particularly troublesome as machines often struggle to recognise PDF tables reliably.
To solve this problem, we have developed Zanran’s ‘Xtractor’, which uses machine learning in conjunction with complex rule-based algorithms to:
This results in a highly accurate extraction. The Xtractor finds the table and gets the boundaries right in over 95% of cases (it typically misses some very small tables and can get the boundaries wrong when there are large gaps in the table). The accuracy can be increased further by ‘tuning’ the software for the style of document and the type of table.
Zanran’s Xtractor software runs either on in-house servers or as a cloud service. It is enterprise-scale software and can be run on hundreds of servers concurrently.
The software itself is based around computer-vision (see 'Our Technology') and is essentially independent of language. It works with any language that is comprised of letters rather than logograms (e.g. Chinese) and reads left-to-right.
Companies that use Zanran's Xtractor technology are able to automate the proccesses behind jobs which need to extract tables from PDFs at a large scale, saving considerable time, resources and money.
Zanran’s Xtractor can be used in conjunction with other applications – and is an integral component in Zanran’s Data-Point Extraction technology. To read more about Zanran’s Data-Point Extraction technology, click here.
For large-scale data-orientated businesses such as scientific organisations, financial businesses and government departments, Zanran makes it easy to automate table extraction at a large scale, removing the need for any manual copying-and-pasting.