3 tips on shrinking born digital documents


Here are a few examples of how born digital file sizes can get out of control and some tips to rein them in:

How born digital documents can get bigger than they should:

  1. Hybrid documents
    • Hybrid documents contain not only born digital pages but also images. For example, a scanned page containing signatures might be embedded in a born digital document.  These images are frequently not well compressed and therefore dramatically increase the file size.
  2. Object Stream Compression
    • Internal data structures in PDFs can be stored in a compressed form using flate which is a lossless compression technology used by most modern PDF tools today. However, if your files were created a long time ago or by an older PDF creation tool that did not use flate, then there might be a compression opportunity.
  3. Unreferenced Metadata
    • Sometimes PDFs contain information that is not used for anything. This can happen when a file is a subset (extract) of a much larger file, and the tool that performed the extraction just transferred all common metadata from the original file, not considering whether it is still needed or even relevant for the extracted pages.
  4. Deduplicate Objects
    • Imagine a 100-page PDF file, consisting of 100 1-page invoices, all containing the company’s logo graphics. Simple PDF software would just concatenate these 100 invoices and end up with 100 copies of the logo graphics. PDF supports the concept of object references, so one could achieve the same visual quality by including the logo graphics just once and referencing it for each of the 100 pages.
  5. Unwanted Objects
    • Some PDF tools give you the option to remove unwanted objects such as fonts, thumbnails or accessibility tagging in order to reduce the file size.
      • Caution: Be careful when implementing this option to ensure that you won’t miss any of the removed objects later.

Tips :

  1. In order to shrink hybrid documents, consider leveraging advanced document compression (MRC compression, JBIG2 and/or JPG2000) in order to significantly reduce file sizes. The born digital content will be fully retained while the images are compressed to often 50% or better compression without loss of image quality.  In addition, the images can also be made searchable to ensure that nothing is missed when searching for content in the document.
  2. To clean up metadata, deduplicate objects, remove objects or compress data streams, tap into the power of modern PDF document conversion tools. There are tools that can efficiently handle thousands or even millions of documents or for individual files, most modern PDF Editors can help to optimize documents.
  3. Not all PDF solutions are created equal, so be sure to test carefully and with a large document set to ensure that there will be no surprises down the road.

Leave a Reply

Your email address will not be published. Required fields are marked *