Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

NCBIArch_1, NCBI/NLM Journal Archiving and Interchange Document Type Definition (NLM DTD), Version 1

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Formal name: NCBI/NLM Journal Archiving and Interchange Document Type Definition (DTD), versions 1.0 and 1.1. Common name: NLM DTD, version 1.x
Description

Developed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM). This DTD, intended as a common format in which publishers and archives can exchange final journal content, is described in the November 2003 version of the Archiving and Interchange Tagset (version 1.1) of the modular Journal Archiving and Interchange DTD Suite. For convenience, in this format description, "NLM DTD" is used to refer to all chronological versions of the tag suite developed and published under the auspices of NCBI/NLM. "NLM DTD" may also refer to both the Archiving and Interchange Tagset and the Publishing Tagset, two variant article models derived from the overall tag suite. The intent of the Archiving and Interchange variant was as a broad target for conversion of any article content; the Publishing variant was more prescriptive, encouraging good practice by publishers creating new content.

An article marked up in NCBIArch_1, or any version of the NLM DTD, has <article> as its root element. Within this element, the following elements are permitted:

  • <front> -- a mandatory element that holds: metadata for the parent journal; metadata for the article, including details for volume, issue, pages, title, contributors, copyright, publication date, keywords, etc.
  • <body> -- the main textual and graphic content of the article, usually consisting of paragraphs and sections, which may themselves contain figures, tables, sidebars (boxed text), etc.
  • <back> -- an optional element that might contain a list of references, glossary, or appendices
  • <response> -- an optional, and seldom used, element for a commentary on the article itself
  • <sub-article> -- an optional, and seldom used, element for a small article completely contained within the main article

Version 1.1 of the Journal Archiving and Interchange DTD was a fully backward compatible revision of the DTD. That is, all documents that were valid according to version 1.0 of the DTD were also valid according to version 1.1. The changes to the DTD were based on experience converting articles tagged according to other DTDs and largely to prevent loss of valuable semantic information associated with tags. See History Notes below.

The DTD was designed to hold the entire contents of a published article marked up in XML. However, in practice, it has often been used for recording article metadata, while article content has been stored in PDF format.

Production phase Generally used for exchanging works in their final state. A related Publishing DTD (built from the same tag set) was intended as an initial- or middle-state format for authors and publishers.
Relationship to other formats
    Has later version NCBIArch_2, NCBI/NLM Journal Archiving and Interchange DTD, version 2.x
    Has later version NCBIArch_3, NCBI/NLM Journal Archiving and Interchange DTD, version 3.0
    Has later version JATS_1, JATS, Journal Article Tag Suite, NISO Z39.96, versions 1.x
    Defined via XML_DTD, XML Document Type Definition (DTD)

Local use Explanation of format description terms

LC experience or existing holdings

A pilot activity with journal publishers on archival deposit of journals began in 2006. This pilot brought in content conforming to at least one version of the NLM DTD.

In January 2010, after a period of public comment, a regulation on Mandatory Deposit of Published Electronic Works Available Only Online was published in the Federal Register. eDeposit for eSerials began in late 2010, after the first demands for mandatory deposit of serials issued in electronic form only were issued in October. Some publishers are depositing journal content conforming to a version of the NLM DTD format; often using NLM DTD headers (article metadata) with article content in PDF.

LC preference

On April 19, 2006, the Library of Congress and the British Library jointly announced support and advocacy for the NLM DTD. To quote from the press release the two institutions "will work with the National Library of Medicine to ensure the open and transparent evolution of the NLM DTD standard by encouraging early adoption by an internationally recognized standards body."

The information related to formats that would be accepted for eDeposit was added to the edition of Circular 7B: Best Edition of Published Copyrighted Works for the Collections of the Library of Congress published in August 2010 and still current as of February 2017. Circular 7B lists the NLM journal archiving DTD as first in order of preference.

See the Library of Congress Recommended Formats Statement (RFS) for Textual Works - Electronic Serials for preferences.


Sustainability factors Explanation of format description terms

Disclosure Openly documented. All components of the Journal Archiving and Interchange DTD Suite are in the public domain.
    Documentation
Adoption

This XML-based format for marking up journal articles was the first in a sequence of steadily enhanced and increasingly widely adopted DTDs for marking up articles in scholarly publications.

The interest of NCBI/NLM in the development of the tag suite was to have an XML-based article format as the basis for publishers to contribute journal content to PubMed Central (PMC). A participating journal must provide PMC the full text of articles in an XML format that conforms to an acceptable journal article DTD (Document Type Definition). According to PubMed Central: General Tagging Practice, PMC will accept content tagged in any version of the NLM DTD or NISO JATS 1.0, using the Publishing variant.

As of 2016, many publishers who used versions 1.x of the NLM DTD appear to have updated their workflows to use more recent versions of the NLM DTDs or its successor, JATS. According to information provided by Portico on article submissions for its archival service, 10 publishers submitted a total of about 25,000 articles in version 1.x of the Archiving and Interchange DTD, while 40 publishers submitted over 1 million articles in version 3.0.

    Licensing and patents No licensing or patent issues. The tag sets are in the public domain.
Transparency Rates highly for transparency. Text content for articles is in XML, and hence viewable in basic editors, web browsers, etc. Elements have understandable tag-names, and document instances are in natural reading order.
Self-documentation The DTD includes a rich set of elements for metadata at the article and journal level. The <article> element is expected to include the article content and full descriptive metadata.
External dependencies None.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering Good support.
Integrity of document structure The logical structure of a document is an essential feature of the NCBI/NLM Archiving DTD.
Integrity of layout and display As stated in the Introduction to the NLM DTD Tag Suite, the intent is to "preserve the intellectual content of journals independent of the form in which that content was originally delivered".
Support for mathematics, formulae, etc. MathML or TeX math markup can be embedded. Integrity of rendering is constrained by the capabilities of MathML and rendering tools.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension xml
For textual content files.
Magic numbers See note.  As for many XML-based formats, there is no guaranteed magic number or other internal signature to identify the format automatically. See discussion in Notes below.
Pronom PUID See note.  See XML
Wikidata Title ID See note.  See XML

Notes Explanation of format description terms

General

The DTD was created from the Journal Archiving and Interchange DTD Suite, which provides a set of XML modules that define elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews. The intent of this DTD Suite is to preserve the intellectual content of journals independent of the form in which that content was originally delivered. The Suite is a set of XML DTD modules, each of which is a separate physical file. No module is an entire DTD by itself, but these modules can be combined into a number of different DTDs.

Although it will usually be obvious when looking at the beginning of a file conforming to one of this family of DTDs that it uses a particular chronological version and article model variant derived from the tag suite, there is no guaranteed magic number or other signature to identify the format automatically. If the file was generated using the DTD (rather than the W3C XML Schema), it is likely to have the following string or something similar near the beginning of the file:

  • <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v1.1 20031101//EN"

According to PubMed Central: Minimum XML Requirements, PubMed Central validates submitted journal content data based on this PUBLIC ID in the DOCTYPE declaration in each source file. However, PubMed Central: General Tagging Practice states, "we strongly prefer the DOCTYPE declaration for associating DTDs and the @schemaLocation or @noNamespaceSchemaLocation attributes for W3C XML Schema association, but we will accept the <?xml-model?> processing instruction for DTD, W3C Schema, RELAX NG, and RELAX NG compact syntax," and provides rules for encoding such processing instructions.

History

The NLM DTD has its origins in the convergence of work NLM was doing to create an archiving DTD for Life Sciences journals (for PubMed Central (PMC)) and a project at Harvard University funded by the Mellon Foundation to address the problems of archiving scholarly journals in electronic form (E-journals). See Cover Pages on Harvard University E-Journal Archive Project. The second phase of the Harvard E-journal Archiving project was a feasibility study to investigate the development of a common markup formalism that can be used to "reasonably represent the intellectual content (text, tables, formulas, still images, and links) of archived journal articles." The study was carried out by Inera Corporation, using input from ten publishers who were asked to provide their existing DTDs, documentation and sample SGML documents for analysis.

Version 1.0 of the NLM DTD was released in December 2002. NLM contracted with Mulberry Technologies, Inc. of Rockville, MD to act as Archiving and Interchange Tagset Secretariat. To keep the DTD relevant to the publishing and archiving communities, NLM created the XML Interchange Structure Working Group in 2003. This group advised NLM on recommended changes in and/or additions to the tagset that were included in version 1.1, released in November 2003. Some users converting articles tagged according to other DTDs into Archiving and Interchange articles had found that they had to discard information (such more detailed tag semantics) in the transformation. The changes for version 1.1 were largely to support retention of such details in archival repositories.

Version 2, involving substantial structural changes was released in December 2004. The changes were fully backwards compatible, but customizations based on 1.0 or 1.1 were not necessarily compatible with 2.0. In 2005, in version 2.1, a third article model was added: the Article Authoring DTD to support article creation in XML.

Version 3.0 of the NLM DTDs was published in November 2008. In version 3.0, changes were made that rendered documents that conformed to a prior version invalid if checked against 3.0. The idea was to make the Tag Sets as logical, internally consistent, and complete as possible going forward. Therefore, publishers using previous versions would require changes to the backfile to make it compliant. The decision to update to version 3.0 was left up to each publisher and archive.

The transition to standardization through NISO as "Standardized Markup for Journal Articles" took some months, with the newly constituted NISO Working Group starting work in late 2009. The original plan had been to submit version 3.0 as a draft standard to NISO, but comments received on version 3.0 in the interim led the group to choose to update the Tag Suite and the three article models. It was decided that the Journal Article Tag Suite under NISO should start with a new numbering scheme as well as adopting the JATS acronym more formally. After circulation of a public draft in 2011, JATS version 1.0 was published in 2012.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 03/23/2023