Extracting Data from PDFs Goes Big

February 12, 2014
FOXITBLOG

Extracting useful information from PDFs can be a challenge when you’re talking about a gigantic number of PDF documents. Which is why the Sunlight PDF Liberation Hackathon took place. Unlike its name, the hackathon was not about breaking into anyone’s private database of PDF documents but rather, was dedicated to improving tools for PDF extraction.

Why the need? There are many organizations, including public interest groups, that want to search PDF documents en masse.

Everyday Examples of Extracting Data from PDFs

For example, one of the Foundation’s challenges centered on the financial performance of the nation’s major cities. Most large US cities publish Comprehensive Annual Financial Reports (CAFRs) in the form of PDFs. These documents contain a large set of audited financial statements with footnotes. The challenge was to extract a single statement – a ten-year history of revenues and expenditures – from the latest CAFR for four cities (Chicago, New York, San Francisco, and Washington DC) participating in the hackathon. The results enable comparison of revenue sources and spending priorities across cities over a number of years—an obvious benefit to local, state and national governmental agencies, not to mention taxpayers.

As another example, Members of the House of Representatives file a yearly report on their personal finances. Though this report is often submitted electronically, it is only made available in PDF form on the Clerk of the House’s Website. The challenge was to find a reliable and sustainable way to extract the information entered on the form, which shifts with downloads and content.

As such, Sunlight’s PDF Liberation Hackathon aimed to tackle real-world PDF data extraction problems and bring coders together to add features, extensions, and plugins to existing PDF extraction frameworks, making them more flexible, useful, and sustainable.

Foxit PDF Editor

Foxit eSign

Foxit AI

Foxit PDF Reader

Foxit APIs

Foxit PDF SDK

Enterprise Document Automation

Admin Console

Education

Government

Banking

Insurance

Legal

Healthcare

Manufacturing

IT Services

Extracting Data from PDFs Goes Big

Everyday Examples of Extracting Data from PDFs

More Information on How to Extract Content from PDF

Leave a Reply Cancel reply

For Teams and Individuals

Foxit PDF Editor

Foxit eSign

Foxit AI

Foxit PDF Reader

For Developers and Enterprise

Foxit APIs

Foxit PDF SDK

Enterprise Document Automation

Admin Console

Shop by Industry

Education

Government

Banking

Insurance

Legal

Healthcare

Manufacturing

IT Services

Shop by Integration

<img alt="" width="24" height="24" src="https://www.foxit.com/static/company_new/assets/images/mainMenu/products/pdf-editor.png" />PDF Editor Integrations

<img alt="" width="24" height="24" src="https://www.foxit.com/static/company_new/assets/images/mainMenu/products/esign.png" />eSign Integrations

Shop by Feature

<img alt="" width="24" height="24" src="https://www.foxit.com/static/company_new/assets/images/mainMenu/products/pdf-editor.png" />PDF Editor Features

<img alt="" width="24" height="24" src="https://www.foxit.com/static/company_new/assets/images/mainMenu/products/esign.png" />eSign Features

Smarter Document Management

Online Tools

Resources

Resource Hub

Blog

Press Releases

Foxit for Business

Watch a Demo

Support Center

Extracting Data from PDFs Goes Big

Everyday Examples of Extracting Data from PDFs

More Information on How to Extract Content from PDF

Leave a Reply Cancel reply

PDF Editor Integrations

eSign Integrations

PDF Editor Features

eSign Features