|Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact|
|Full name||DOCX, (Office Open XML, WordprocessingML) ISO 29500:2008-2012, also ECMA-376, Editions 1-4.|
The Office Open XML-based word processing format using .docx as a file extension has been the default format produced for new documents by versions of Microsoft Word since Word 2007. The format was designed to incorporate the full semantics and functionality of the binary .doc format produced by earlier versions of Microsoft Word. For convenience, this format description uses DOCX to identify the corresponding format. The primary content of a DOCX file is marked up in WordprocessingML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology -- Document description and processing languages -- Office Open XML File Formats (OOXML). This description focuses on the specification in ISO/IEC 29500:2012 and represents the format variant known as "Transitional." Although editions of ISO 29500 were published in 2008 and 2011, the specification in the standard has had very few changes other than clarifications and corrections to match actual usage in documents since WordprocessingML was first standardized in ECMA-376, Part 1 in 2006. Hence, this description should be read as applying to all WordprocessingML versions published by ECMA International and by ISO/IEC through 2012. See Notes below for more detail on the chronological versions and minor differences.
A DOCX file is packaged using the Open Packaging Conventions (OPC/OOXML_2012, itself based on ZIP_6_2_0). The package can be explored, by opening with ZIP software, typically by changing the file extension to .zip. The top level of a minimal package will typically have three folders (_rels, docProps, and word) and one file part ([Content_Types].xml). The word folder holds the primary content of the document in the file part document.xml. The other folders and contained parts support efficient navigation and manipulation of the package:
The word folder contains at a minimum document.xml and files and subsidiary folders that support presentation styles and themes. Headers and footers are stored in separate parts if present. The minimal structure for document.xml will include a nested set of elements:
Optional elements <w:pPr> and <w:rPr> define the formatting properties of a particular paragraph or run.
The standards documents that specify this format run to over six thousand pages. Useful but thorough introductions to the DOCX format can be found at:
|Production phase||Can be used in any production phase. Particularly used for creating documents (initial state) and for editing and review (middle-state). Documents that are formally published are often converted to a format that is designed for final publication and not for convenient editing.|
|Relationship to other formats|
|Subtype of||OOXML_Family, OOXML (ISO/IEC 29500) Format Family|
|Subtype of||OPC/OOXML_2012, Open Packaging Conventions (Office Open XML), ISO 29500-2:2008-2012|
|May contain||MCE/OOXML_2012, Markup Compatibility and Extensibility (Office Open XML), ISO 29500-3:2008-2012, ECMA-376, Editions 1-4|
|Has modified version||DOCX/OOXML_Strict_2012, DOCX Strict (Office Open XML), ISO 29500-1: 2008-2012. The Strict variant of DOCX disallows legacy markup as specified in Part 4 of ISO/IEC 29500. Hence the Strict variant has less support for backwards compatibility when converting documents from older formats.|
|Has modified version||Associated template format using extension .dotx, not described separately on this website. A .dotx template file is a WordprocessingML document based on the same schema and namespaces (specified in ISO/IEC 29500) as a .docx file. The difference is its intended use.|
|Affinity to||Associated format for WordprocessingML documents or templates with embedded macros, using file extensions .docm and .dotm, not described separately at this website. The language used by Microsoft for macros, VBA, is not covered by the ISO/IEC 29500 specification, but is fully documented by Microsoft. Macros are embedded as parts in the OPC package.|
|Defined via||XML, Extensible Markup Language (XML)|
|LC experience or existing holdings||Used by Library of Congress staff. Sometimes used as the master for documents published by the Library of Congress as PDFs, for example for oral history transcripts in the Civil Rights History Project.|
|LC preference||For works acquired for its collections, the list of Library of Congress Recommended Format Specifications for Textual and Musical works, as of June 2014, includes the OOXML family of formats, which includes the DOCX format, as acceptable for textual works and electronic serials. Since the binary (.doc) format is not listed as either preferred or acceptable, the DOCX format is implicitly preferred over the binary equivalent.|
|Disclosure||International open standard. Maintained by ISO/IEC JTC1 SC34/WG4. Originated by Microsoft Corporation and first standardized through ECMA International in 2006. Approval as ISO/IEC 29500 was in 2008.|
ISO/IEC 29500-1, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 1: Fundamentals and Markup Language Reference and ISO/IEC 29500-4, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 4: Transitional Migration Features. Latest version (dated 2012 as of August 2014) is available from ISO/IEC Publicly Available Standards.
The Transitional variant of DOCX is specified by applying the differences described in Part 4 (Transitional Migration Features) to the specification in Part 1. Part 4 cannot be read without detailed reference to subclauses in Part 1.
Annex L of Part 1 is a Primer (informative rather than normative) that introduces key features of WordprocessingML, relating elements and attributes to intended functionality through examples.
Very widely used. DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the .doc file extension. Since Word 2007, DOCX has been the default format for the Save operation. Although the market share for the Microsoft Office productivity suite is declining, in the enterprise arena, it was still 90% in 2012, according to Gartner, as reported by CNN Money in Nov 2013. That article sees Google Docs as the primary competitor; Google Docs can export in six formats, with DOCX top of the list (as of September 2014). A June 2014 blog post by LifeHacker reported that the Google Docs App for Android could now edit DOCX files natively, without format conversion. A Google Drive blog post from June 25, 2014 confirms this introduction and indicates that the same feature is available online to users of the Chrome browser.
Wikipedia's Office Open XML: Application Support and List of software that supports Office Open XML document support in a wide variety of word-processing applications and file conversion software, including the open source Libre Office (Read and Write support) and Apache OpenOffice (Read support). In June 2014, Microsoft released its Open XML SDK (first released for use in 2007), as open source.
The compilers of this resource are not aware of any word-processing applications other than Word 2013 that can create the Strict variant of DOCX (as defined in Part 1 of the ISO/IEC 29500 standard). Tests in September 2014 indicated that Google Docs and Libre Office both created new documents in the Transitional variant described in this document, as indicated by the namespace declarations, even when the document includes no elements or attributes not present in the Strict versions of the schemas. This corresponds to the default behavior of Microsoft Word 2013.
DOCX is an acceptable format for a number of national archival institutions, including the Library of Congress, the U.S. National Archives, National Archives of Australia, and Library and Archives Canada. Many journal publishers prefer or even mandate DOCX for article submission; some provide associated templates (see examples among Useful References, below).
|Licensing and patents||
The specification originated from Microsoft Corporation. Current and future versions of ISO/IEC 29500 and ECMA-376 are covered by Microsoft's Open Specification Promise, whereby Microsoft "irrevocably promises" not to assert any claims against those making, using, and selling conforming implementations of any specification covered by the promise (so long as those accepting the promise refrain from suing Microsoft for patent infringement in relation to Microsoft's implementation of the covered specification).
Features introduced into DOCX through the MCE mechanism may be subject to patent protection. However, Microsoft's interoperability principles indicate "Microsoft will also make available a list of any of its patents that cover any extensions, and will make available patent licenses on reasonable and non-discriminatory terms."
The structure and text of a DOCX file are all represented in XML and hence viewable without special tools, although XML-aware tools that can show the element hierarchy make viewing and interpretation more convenient. The most commonly used parts, elements, and attributes have recognizable names. Simple documents can be interpreted with very basic tools. However, interpreting the semantics of some elements and the correspondence of some elements and attributes to word-processing functionality will require understanding of both the schema and the textual specification. The specification provides valuable examples, for example of text effects, and not all normative constraints for DOCX can be represented fully in the W3C XML Schema Language (XML_Schema_1_0).
The transparency of embedded image, audio, and video files depends on the formats of those files.
For transparency of the package containing the constituent parts of the DOCX file, see OPC/OOXML_2012.
The property file /docProps/core.xml is usually present for OPC packages, although all elements in this Core Properties part are optional and the part can be omitted if none of its elements are used. For more on self-documentation of the package containing the constituent parts of the DOCX file, see OPC/OOXML_2012.
A single optional part with a pre-defined set of extended properties for the package is permitted. Microsoft uses the part name /docProps/app.xml for this and it is always present in DOCX files created by Microsoft. The extended properties (each optional and non-repeatable) are primarily administrative and are not related to the intellectual or bibliographic nature of the document. Elements include: name of creating application; version of creating application; various size metrics (pages, words, etc.); template used; document security level; and a list of embedded hyperlinks. Judging from tests in October 2014, Libre Office and Google Docs use the same part names for the core and extended properties parts. The extended properties part typically records many fewer properties than in files created by Microsoft; both applications identify themselves as the creating application for non-empty documents.
The nature of the OPC package would permit the addition of a part that included rich XML-based metadata, preferably in a well-known schema, and that was listed in the relationships file associated with the Core Properties part with an appropriate relationship type. However, no part of ISO/IEC 29500:2012 predefines such a relationship. Embedding such a part in an OPC package could be done without affecting the primary document content.
None beyond XML-aware software.
See also OPC/OOXML_2012.
|Technical protection considerations||
|Normal rendering||Editable document, with embedded support for powerful word-processing functionality. Textual content is conveniently extractable for quotation and for indexing. Full support for Unicode character set.|
|Integrity of document structure||Paragraphs and sections are easily recognized, as are headers and footers. Excellent support is available for higher-level constructs through the consistent use of named styles (e.g., for headings), automatically generated tables of contents and indexes, and structured templates. However, use of such styles is not required, and structural semantics may only be reflected through font usage and paragraph indenting.|
|Integrity of layout and display||Excellent support for layout choices. Represents entire layout and formatting as intended by an author who used a word-processor for which DOCX is a native format. Bi-directional and vertical display of text can be specified. Differences in detail can occur on display if the original fonts used are not available in the system used for viewing or due to conversion from another word-processing format with different markup semantics.|
|Support for mathematics, formulae, etc.||
ISO/IEC 29500 defines Office Math Markup Language, a mathematical markup language that can be embedded in WordprocessingML. Microsoft has published XSLT transformations to convert between MathML and Office Math Markup Language. Key reasons given for not using MathML directly in DOCX include:
|Functionality beyond normal rendering||
In contrast to formats designed for documents as publications, word-processing formats such as DOCX typically store much information associated with the process of creating and reviewing documents, including tracked changes, threaded comments, and other annotations. DOCX supports embedding of other OOXML content (including spreadsheetML, presentationML, DrawingML, and Office Mathematical Markup Language), embedding of media objects in binary formats, and links to external media objects, such as images, audio, or video.
DOCX files may include markup to support building an index or bibliography from references entered in the text. DOCX documents may include tables of contents generated automatically from section headings; such files will include elements and attributes to support regeneration of the table of contents using the author's choice of levels to include and of layout style.
DOCX files may include forms designed to be filled in by a reader. The DOCX specification includes markup to support convenient navigation between fields in a form and to constrain information entered in forms (for example, to be a date or a choice from a drop-down menu.
In contrast to the Strict variant of DOCX, the Transitional variant described here may include markup to support backwards compatibility and to preserve visual and functional characteristics of documents originating in earlier word-processing formats.
||Three closely related filetypes have different extensions: .dotx for template files; .docm for document files with embedded macros; and .dotm for template files with embedded macros. All are based on the same WordprocessingML specification and on ISO/IEC 29500.|
|Internet Media Type||application/vnd.openxmlformats-officedocument.wordprocessingml.document
||From IANA registration.|
|XML namespace declaration||http://schemas.openxmlformats.org/wordprocessingml/2006/main
||This namespace declaration is for the Transitional variant of DOCX. It occurs in the mandatory Main Document part of a DOCX file (package), which usually has the name /word/document.xml and is mapped to the prefix w. The use of /word/document.xml as the name of the main part is conventional, rather than mandated in ISO 29500.|
||This signifier assumes the usual name of the main part of an XLSX file. The target declaration will occur in the top-level Relationships part (\_rels\.rels part in an OPC package of a DOCX file, as an attribute of a <Relationship> element within the <Relationships> element. In a Transitional DOCX, it will be the target of a relationship of type http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. See root namespace and source relationship for Main Document Part in ISO/IEC 29500-4:2012, §9.1.10, which refers to ISO/IEC 29500-1:2012, §11.3.10.|
This description uses filenames (e.g., core.xml) that are used by most, if not all, implementations. As parts are defined by their content type in the mandatory [Content_Types].xml file part, use of these names is conventional rather than mandatory.
Relationship between DOCX and binary .doc format: Conversion from the binary .doc format to DOCX using the Save As operation in Microsoft Word is designed to have 100 percent fidelity. For Word 2007, the formats should be equivalent. Features added since Word 2007 will usually not be supported in the binary format; when converting from DOCX to .doc, later versions of Word will attempt to "down-convert" to supported features and will present a compatibility check that indicates which features will be converted or lost.
Conversion between DOCX and ODT: Acknowledging the interest in whether conversion between DOCX and ODT (OpenDocument Format word-processing) files could be reliable, ISO started a work item to explore this issue. ISO/IEC TR 29166:2011 Information technology -- Document description and processing languages -- Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats is the output of that expert working group. The report documents the challenges of translation between OOXML and ODF formats, including the word-processing formats, based on the standards as documented at the time. This report, available from ITTF, describes features and functionality for the three primary types of office document and characterizes the translatability of features and functions as high, medium, or low. The challenges are significant since the two formats use different underlying models. Although simple documents can be effectively converted, a round-trip to an identical document should never be expected. Display differences will be common after conversion, most of no semantic significance, but many resulting in different pagination or spacing. Among the features that are particularly problematic for conversion, and could lead to problems of more substance, are:
Microsoft documents how it handles features that do not correspond when the Save As .odt feature is used in Differences between the OpenDocument Text (.odt) format and the Word (.docx) format.
The Open Source Business Alliance (OSBA) has a crowd-funded project to improve the handling of OOXML files within LibreOffice and Apache OpenOffice. Funding is provided by interested institutions. Phase 1, completed in September 2013, emphasized the visual presentation of documents and covered formatting of borders, tables, lists, and comments and embedding of fonts. A proposed specification for Phase 2 was published in Spring 2014. This includes application enhancements to function more like Microsoft Word, particularly for mail-merge, and production of a revised, more complete specification for change-tracking markup within the ODF format.
When considering tools for conversion from OOXML to ODF, it is important to understand which version of ODF is the target. Significant extensions to the standard have been made since ODF 1.1, but ODF 1.1 is the only version that has completed the ISO/IEC standardization process as of August 2014, with some amendments and corrections. ODF 1.2 is nearing approval as an ISO standard. Office 2013 supports export to ODF 1.2, but without change tracking. ODF 1.3 is already in the works, and LibreOffice offers the option to save as "1.2 Extended." See Wikipedia entry for Open Document Format and ODF Implementer Notes from LibreOffice Development wiki. The compilers of this resource believe that some of the amendments and features added in new versions of ODF are expected to improve the fidelity of conversion when supported in conversion tools but have no direct experience. New editions of ISO/IEC 29500 were published in 2011 and 2012; however, the changes were primarily corrections and clarifications to reflect DOCX documents as produced in practice. Of more relevance in relation to fidelity of conversion is whether a document includes any of the few new features introduced in recent versions of Word and marked up in the Markup Compatibility and Extensibility namespace (MCE/OOXML_2012). Microsoft has documented these extensions in [MS-DOCX] Word Extensions to the Office Open XML (.docx) File Format.
The original DOCX specification was published in ECMA-376, Part 1 in 2006. Between then and 2012, the main change to the specification for WordprocessingML has been the split between WML Strict (as defined in Part 1) and WML Transitional (as defined in Part 4 in conjunction with Part 1). Editions of ISO/IEC 29500 and ECMA 376 between 2008 and 2012 related to wordprocessingML have primarily been corrections and clarifications. The chronology of editions specifying DOCX/OOXML_2012 is:
Another chronology of relevance to digital archivists is the support for DOCX in different versions of the Office software. Files created by the Microsoft Office applications have a /docProps/app.xml part that contains properties for the document as a whole, including <Application> and <AppVersion>. Values for AppVersion are numeric, representing internal version numbers used by Microsoft during development: 12 = Windows Office 2007 or Office for Mac 2008; 14 = Windows Office 2010 or Mac Office 2011; 15 = Office 2013.