- November 28, 2019
In biblical history, Noah, his family and many animals survived a great flood. To make this possible Noah brought pairs from every species aboard his ark. But is Noah’s model – gathering the full diversity of file and document formats – genuinely suitable for archiving digital content?
Certainly, many providers of Enterprise Content Management (ECM) systems rely on this method, advising their users to simply consider that original documents in the system are “archival”. All manner of formats are collected this way; Microsoft Word, Excel, PDF, HTML, TIFF, JPG, AFP and PCL are common, but there are always many other types, sometimes even in unknown formats. When examining ECM systems, a veritable “format zoo” is revealed, with every imaginable format represented. Do you have a large-scale ECM system? Take a look for yourself!
Problems in the format zoo
Replying on collections of arbitrary files for archival purposes is problematic in practice. Users will need ever-expanding access to diverse software in order to view the different formats their institution has collected. Nonetheless, even with native software a correct representation is far from guaranteed. Especially with older files, there are often problems opening them, as many vendors disregard backwards compatibility, and format support is sometimes deprecated. Most companies have to keep important documents for a long time, according to a 2017 survey by AIIM (PDF). Over long periods, system changes are very likely and migration can be very complex in its own right, even without considering the file format zoo.
The result, for most organizations, is that an accurate representation of original content is not guaranteed to be available for a contract dispute or other legal dispute, among other worst case scenarios. Outdated formats hinder future business operations generally, often complicating the exchange of documents with external companies.
The ark’s “passengers” may be safe on the boat, but they might not be prepared for the environmental conditions when it comes time to leave the ark-ive and start producing information for new users.
PDF must be on-board
The key to success in digital document archives is to pass the content and secure the future. For this, the ISO-standardized archival subset of the Portable Document Format (PDF) is eminently suitable. Thus, the U.S. National Archives and Records Administration (NARA), the US federal government’s authority on archiving records, recommends PDF and PDF/A for long term preservation, as PDF/A is designed to ensure the presentation and exchange of documents, regardless of original software, hardware or operating system, even after decades.
Keep the originals, embed them in archival-quality PDF/A
Relying on PDF/A does not imply that original formats should be abandoned. For users who want to archive both the original format and a PDF representation, PDF allows authors or ECM systems to embed source documents within each PDF/A document.
PDF facilitates many other archival-related functions, and all the more so as a container for the source document. For example, PDF files may contain institutional metadata that the source document lacks, or may be tagged for accessibility even if the source file is inaccessible. There are many associated features and uses of PDF depending on the country and industry.
If only original formats are stored the archive’s ability to deliver usable content will always be in doubt. Instead, with every original document, a PDF/A document should also go “on board” so as to ensure functionality in an unknown environment after the “flood”.