Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1, With Embedded Files

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ISO 19005-3. Document management - Electronic document file format for long-term preservation - Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3)
Description

PDF/A-3 is a constrained form of Adobe PDF version 1.7 (as defined in ISO 32000-1) intended to be suitable for archiving of page-oriented documents for which PDF is already being used in practice. PDF/A-3 adds a single and highly significant feature to its predecessor PDF/A-2 (ISO 19005-2) specification, to permit the embedding within a PDF/A file a file, or files, in any other format, not just other PDF/A files (as permitted in PDF/A-2).

See PDF/A_family for more information about the PDF/A family of standards. See PDF/A-2 for information about the version of the PDF/A standard that PDF/A-3 extends.

As in PDF/A-2, the PDF/A-3 standard defines three levels of conformance: conformance level A satisfies all requirements in the specification; level B is a lower level of conformance, satisfying requirements intended to be those minimally necessary to ensure that the rendered visual appearance of a conforming file is preservable over the long term. The specification notes that "Level B conforming files might not have sufficiently rich internal information to allow for the preservation of the document's logical structure and content text stream in natural reading order, which is provided by Level A conformance." An intermediate level of conformance, Level U conformance corresponds to Level B conformance with the additional requirement that all text in the document have Unicode equivalents.

PDF/A-3 allows for embedding of files of any type, but imposes requirements beyond those in "regular" PDF 1.7 files as defined by ISO 32000-1. Files that comply with these requirements are termed "associated" files; an explicit association must be made between each embedded files and the containing PDF or object or structure (e.g., image, page, or logical section) within the PDF. See Notes below for more detail on the association mechanism. Predefined values for relationships for associated files (in the required AFRelationship key) are Source, Data, Alternative, Supplement, and Unspecified. MIME types must be provided for associated files. The PDF/A-3 specification requires the use of application/octet-stream if a more specific MIME type is not known. The compilers of this resource have not determined whether more explicit characterization necessary to support long-term preservation (e.g., version) can be indicated. Comments welcome. Human-readable descriptions for the associated files can be provided and are recommended. Conforming readers must provide a mechanism for a user to choose to extract and save (not open) associated files.

See Notes below for use cases and examples that illustrate motivation for adding support for embedded files to the PDF/A-3 standard.

Production phase A final-state format for delivery to end users and long-term preservation of the document as disseminated to users.
Relationship to other formats
    Subtype of PDF_family, Portable Document Format
    Subtype of PDF_1_7, PDF, Version 1.7 (ISO 32000-1:2008)
    Extension of PDF/A_family, PDF for Long-term Preservation
    Extension of PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7)
    Has subtype PDF/A-3a, PDF/A-3u, PDF/A-3b, not separately described at this website.
    Has later version PDF/A-4, PDF for Long-term Preservation, Use of ISO 32000-2

Local use Explanation of format description terms

LC experience or existing holdings LC was represented on the working group for the original PDF/A standard and continues to participate in the development of new versions.
LC preference

One way in which the Library of Congress expresses preferences for formats for content (primarily in physical form) for its collections is through the "Best Edition" specification from the U.S. Copyright Office in Circular 7b. Circular 7b (as revised in September 2017) listed formats acceptable for mandatory deposit of Electronic Serials available only online, in order of preference. For page-oriented renditions, PDF/A appears first on the list. Other forms of PDF are acceptable, preferably with searchable text. The preference for PDF/A was established before PDF/A-3 was published, and the preference should not be interpreted as acceptance for copyright deposit of any files embedded in a PDF/A-3 file. The Library has not expressed a preference regarding PDF/A-3, pending community-wide experience with this version of the PDF/A format.

See PDF/A_family for information about the second way in which the Library of Congress expresses preferences for digital formats, the Recommended Formats Statement.


Sustainability factors Explanation of format description terms

Disclosure

Open standard, published by ISO in October 2012. Developed under the auspices of ISO/TC 171 SC2 WG5, Document Imaging Applications, Application Issues, PDF/A, for which AIIM (The Association for Information and Image Management) was acting as secretariat at the time of publication.

    Documentation

ISO 19005-3:2012. Document management -- Electronic document file format for long-term preservation -- Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). The standard cannot be used without ISO 32000-1. Document management -- Portable document format -- Part 1: PDF 1.7, which it uses as a normative reference.

Adoption

See Notes below for use cases presented by proponents of the extension to allow embedding of files in other formats in a PDF/A document. One important motivation was to support the German ZUGFeRD standard for electronic invoices, in which a visible, printable PDF document has a machine-processable version of the invoice based on an XML schema embedded in the PDF file. In March 2020, identical versions of ZUGFeRD (v 2.1) and the French equivalent Factur-X (v 1.0) were published. See Factur-X V1.0 – ZUGFeRD 2.1 common publication (EN). These agencies are now developing a similar standard (Order-X), also based on a PDF document with embedded structured data in XML. See Order-X - A common standard for electronic orders in Germany and France. Because this is an important application, it is widely supported, often in specialized applications. For example, see TX Text Control. An open-source project is available at https://www.mustangproject.org/.

Another application of PDF/A-3 is as one of the component formats in the new publishing framework, known as "V3", for RFCs from the IETF (Internet Engineering Task Force). V3 uses an XML document as the master format from which plain text, HTML, and PDF versions are derived. The PDF is a PDF/A-3u document with the XML master embedded. The first RFC published in the new format was RFC 8650, published in November 2019. For more background on this choice, see RFC 7995: PDF Format for RFCs (December 2016) and additional Useful References below.

Another important use case for PDF/A with embedded files, is in manufacturing. See PDF in Manufacturing: The future of 3D documentation, developed and published jointly by the 3D PDF Consortium and the PDF Association in May 2020. A PDF/A-3 file can present an interactive 3D model in the document and related files can be embedded, creating what is sometimes called a "technical data package" or TDP. Changes introduced with PDF/A-4e, a PDF/A-4 profile intended as a successor to the PDF/E-1 standard, are likely to increase adoption of PDF/A in this domain.

The embedding of arbitrary files in a PDF poses challenges for archival institutions both for ingestion workflows and for long-term preservation management and access. The British Library's PDF Format Preservation Assessment Part 2: PDF/A Profile recommends that "Receipt or deposit of PDF/A is recommended to prefer the PDF/A-1 profile rather than PDF/A-2 and 3 to reduce the risk concerning attached files." Several lists of preferred formats from archival institutions list PDF/A-1 and PDF/A-2 as preferred formats for textual content but explicitly do not list PDF/A-3. These include the U.S. National Archives and Records Administration (NARA); Library and Archives Canada; and the Canadian Government's National Heritage Digitization Strategy -- Digital Preservation File Format Recommendations. The fact that the specification for PDF/A-3 provides no straightforward mechanism for quickly identifying that a compliant document has one or more embedded files is also problematic. The Benefits and Risks of the PDF/A-3 File Format for Archival Institutions, a 2014 report from the National Digital Stewardship Alliance, strongly recommended that tools that create PDF/A-compliant documents use PDF/A-2 rather than PDF/A-3 for documents with no embedded files or with only PDF/A files embedded. The PDF/A-3 Overview from PDF Tools AG also argues that PDF/A-3 should only be used if non PDF/A documents are embedded. See Useful References below for some links to blog posts that generated extensive and illuminating comments, including All In! Embedded Files in PDF/A, a 2012 post in The Signal, a blog from the Library of Congress.

Most mainstream commercial PDF creation, editing, and conversion applications and libraries support PDF/A-3, for example: pdflib; products from Callas software, particularly pdfaPilot; 3-Heights PDF to PDF/A Converter from PDF-Tools; PhantomPDF and a PDF SDK, which has a PDF/A Compliance add-on, from Foxit Software; iText 7 Core; and DocBridge Conversion Hub from Compart. Products that focus on archiving documents in a corporate or government setting using PDF/A-3 include: PDFreactor. Some products that focus on OCR or scanning to PDF claim to provide support for PDF/A-3b or PDF/A-3u, including: ABBYY FineReader Engine; iText pdfOCR; and PDF Creator from pdfforge.

See PDF/A_family for a discussion of adoption of PDF/A in general, bearing in mind that much written about PDF/A considers primarily PDF/A-1 and PDF/A-2. Comments welcome.

    Licensing and patents No concerns for PDF/A_family per se. Licensing or patent concerns may arise for embedded files.
Transparency See PDF/A_family in relation to PDF/A-1 and PDF/A-2. For PDF/A-3, transparency and characterization of embedded files are primary concerns for long-term preservation.
Self-documentation See PDF/A_family.
External dependencies See PDF/A_family.
Technical protection considerations See PDF/A_family.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering See PDF/A_family.
Integrity of document structure See PDF/A_family.
Integrity of layout and display See PDF/A_family.
Support for mathematics, formulae, etc. In PDF/A-3u as an archival format for Accessible mathematics, Ross Moore discusses ways to embed mathematics as MathML or LaTeX source in a PDF/A-3u document. See also PDF/A_family.
Functionality beyond normal rendering See PDF/A_family.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension pdf
The standard does not indicate that a different extension should be used to distinguish PDF from PDF/A.
Internet Media Type See related format.  See PDF/A_family.
Magic numbers See related format.  See PDF/A_family.
Indicator for profile, level, version, etc. See note.  The standard specifies that the PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in the standard. This schema has two mandatory elements: pdfaid:part (integer) and pdfaid:conformance (closed list of text values). A PDF/A-3 file should have the integer value 3 for pdfaid:part.
Pronom PUID fmt/479
fmt/481
fmt/480
There is no PRONOM entry specifically for PDF/A-3. See https://www.nationalarchives.gov.uk/PRONOM/fmt/479 for profile PDF/A-3a. See https://www.nationalarchives.gov.uk/PRONOM/fmt/481 for profile PDF/A-3u. See https://www.nationalarchives.gov.uk/PRONOM/fmt/480 for profile PDF/A-3b.
Wikidata Title ID Q26547917
Q26549229
Q26548590
There is no Wikidata Title ID specifically for PDF/A-3. See https://www.wikidata.org/wiki/Q26547917 for profile PDF/A-3a. See https://www.wikidata.org/wiki/Q26549229 for profile PDF/A-3u. See https://www.wikidata.org/wiki/Q26548590 for profile PDF/A-3b.

Notes Explanation of format description terms

General

Requirements associated with embedded files: Annex E of ISO 19005-3:2012 specifies that each embedded file in a PDF/A-3 file must be identified in a file specification dictionary (as described in section 7.11.3 of ISO 32000-1). The inclusion (in a Desc key) of a human-readable description of the file is recommended. The associated embedded bitstream dictionary (essentially a header for the embedded bitstream itself) must include a MIME type (in a Subtype key), using application/octet-bitstream if a more precise MIME type is not known. The embedded bitstream dictionary must also include a Params key that contains at least a ModDate key to indicate the latest modification date of the file embedded. The specification for the Params key (in section 7.11.4.1 of ISO 32000-1) does not seem to allow for embedding file characteristic details more specific than a MIME type. Comments welcome.

Relationships for associated files must be expressed in a new key introduced for PDF/A-3 (and expected to be in the forthcoming ISO 32000-2 standard for regular PDF). This AFRelationship key is located in the file specification dictionary. It indicates a relationship type of Source, Data, Alternative, Supplement, and Unspecified. Relationship links are established from the document or parts of the document by use of the AF key, which contains an array of file specification dictionaries (as described above). Files associated with the entire document are represented by an AF key in the Catalog for the PDF file. Files associated with a page are represented by an AF key in the relevant Page dictionary. Files associated with an Image XObject or A Form XObject are represented by an AF key in the attributes dictionary for the object. Similarly, a file associated with a logical structure (such as an article or a table) or an annotation is represented by an AF key in the structure dictionary or annotation dictionary. A mechanism also exists to relate an associated file to a marked section of content; however, use of relationships to structures is preferred.

Motivation for extension of PDF/A to allow arbitrary embedded files: The specification does not present the motivation for extending the PDF/A-2 format to support the embedding of files of any type. A single illustrative example appears in Annex E. This example is a PDF text document that displays a mathematical equation and includes a chart based on a spreadsheet. Associated files that might be embedded in the PDF/A-3 include a word-processing file (with relationship Source to the whole document), a MathML expression of the equation (with relationship Supplement to the structure or Form Xobject that displays the equation), and a spreadsheet file and CSV file (both associated with the same chart with relationship Data). A richer expression of use cases is made in a recording of a webinar, from vendor Luratech, entitled PDF/A-3, All change for document based processes! The webinar makes statements about the intent of the standard that are not found in the standard itself. Most significantly, the presenter indicates that in a PDF/A-3 file, the embedded files should be considered "non-archival." In other words, the source or supplementary material is considered as only of short-term or temporary use. What should be considered as "archived" for the long term is only the primary PDF content with its visible page display. Secondly, uses cases chosen for what was primarily a marketing presentation with existing customers in mind indicate workflows and use contexts during the primary lifecycle of a document that could have benefits for entities responsible for longer-term archiving. These use cases included:

  • Hybrid archiving. If every revision of a document is accompanied by storing a PDF/A-3 file with the source word-processing file embedded, the document is archive-ready whenever editing stops;
  • Embedding richer metadata in "native" discipline-specific or application-specific format: Although PDF/A already requires and supports XMP, many metadata schemes have rich XML representations that are not easily converted to RDF, on which XMP is based. However, rich metadata in a well-known application-specific format could be embedded and associated with the document as a whole -- in addition to XMP metadata using built-in schema.
  • German ZUGFeRD standard under development at the time of the webinar for electronic invoice exchange format based on PDF/A-3. The invoice data can be represented in machine-readable form as an XML file embedded in a human-readable PDF/A-3.

What is PDF/A-3 good for?, an April 2018 presentation by Dietrich von Seggern, also described use cases for PDF/A-3. He mentioned the plan by the Ministry of Transport of Quebec to archive engineering documents as PDF/A-3 with embedded CAD files and presented the use case of what he terms a "digital dossier," using PDF/A-3 as a container for project documentation. See also his System-independent archiving of project files with PDF/A-3, also from 2018.

See also PDF/A_family.

History

PDF/A-3 is equivalent to PDF/A-2 except for allowing files in any format to be embedded. The primary difference between PDF/A-1 and PDF/A-2 was the use of a later underlying version of PDF. Added capabilities, all in compliance with ISO 32000-1, included:

  • Improvements to tagged PDF (for enhanced accessibility)
  • Compressed Object and XRef streams (for smaller file sizes)
  • Support for embedding of PDF/A-compliant file attachments, portable collections and PDF packages
  • Support for transparency in images
  • Support for JPEG 2000 compression for images

PDF/A-4, a version of PDF/A based on PDF 2.0 (ISO 32000-2) was published as ISO 19005-4:2020 in November 2020. In early presentations it was referred to as PDF/A-NEXT.

See also PDF/A_family.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 12/30/2020