Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Microsoft Compound File Binary File Format, Version 3

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Microsoft Compound File Binary File Format, Version 3
Description

Microsoft Compound File Binary (CFB) file format is also known as the Object Linking and Embedding (OLE) or Component Object Model (COM) structured storage compound file implementation binary file format.  CFB implements a simplified file system through a hierarchical collection of storage objects and stream objects. A storage object is comparable to a file system directory in that just as a directory can contain other directories and files, a storage object can contain other storage objects and stream objects. A parent storage object can also track the locations and sizes of the child storage object and stream objects nested beneath it. A stream object is comparable to a file in that a stream contains user-defined data stored as a consecutive sequence of bytes. A compound file consists of the root storage object with optional child storage objects and stream objects in a nested hierarchy.

The purpose of structured storage is to reduce the performance penalties and overhead associated with storing separate objects in a flat file. Structured storage solves performance problems by eliminating the need to totally rewrite a file whenever a new object is added, or an existing object increases in size. The new data is written to the next available free location in the file, and the storage object updates an internal structure that maintains the locations of its storage objects and stream objects. At the same time, structured storage enables end users to interact and manage a compound file as if it were a single file rather than a nested hierarchy of separate objects.

There are two active versions of CFB, version 3 and version 4. One major distinction between the versions is that the sector size for version 3 is of 512 bytes and the sector size for version 4 is 4096 bytes.

A compound file is divided into equal-length sectors, the smallest addressable unit of a disk. The first sector contains the compound file header. Subsequent sectors are identified by a 32-bit non-negative integer number, called the sector number. A group of sectors can form a sector chain, which is a linked list of sectors forming a logical byte array, even though the sectors can be in non-consecutive locations in the compound file. The main structure used to manage sector allocation and sector chains is the file allocation table (FAT). The FAT contains an array of 32-bit sector numbers, where the index represents a sector number, and its value represents the next sector in the chain, or a special value. This allows a compound file to contain many sector chains in a single file.

The known size of all structures within a compound file must be specified when the compound file is transmitted or retrieved. For this reason, CFB is not recommended for real-time streaming, progressive rendering, or open-ended data protocols where the size of streams is unknown at the time of transmission.

The minimum size of a compound file is three sectors: one header, one FAT sector and one directory sector.

  • A 512-byte sector compound file MUST be no greater than 2 GB in size for compatibility reasons. This means that every stream, including the directory entry array and mini stream, inside a 512-byte sector compound file must be less than 2 GB in size.
  • The maximum number of directory entries (storage objects, stream objects, and unallocated objects) in a 512-byte sector compound file is limited by the 2 GB file size, resulting in slightly less than 16 million directory entries.
  • The maximum size of the mini stream is slightly less than 256 GB. The maximum size of the mini stream in a 512-byte sector compound file is limited by the 2 GB file size.

A file in version 3 of the CFB format begins as follows with a 512-byte header:

  • Header Signature for the CFB format with 8-byte Hex value D0CF11E0A1B11AE1. Gary Kessler notes that the beginning of this string looks like "DOCFILE"
  • 16 bytes of zeros
  • 2-byte Hex value 3E00 indicating CFB minor version 3E. The specification states that the minor version should always be indicated as 3e.
  • 2-byte Hex value 0300 indicating CFB major version 3.
  • 2-byte Hex value FFFE indicating little-endian byte order for all integer values. This byte order applies to all CFB files.
  • 2-byte Hex value 0900 (indicating the sector size of 512 bytes used for major version 3)
  • 480 bytes for remainder of the 512-byte header, which fills the first sector for a CFB of major version 3

The structured storage profile of CFB formed the basis for the AAF specification and for the Microsoft Office Binary Formats that were the default formats for Word, PowerPoint, and Excel from products released in 1997 through 2004 (MS-DOC, MS-PPT, and MS-XLS).

Relationship to other formats
    Has subtype MSG, Microsoft Outlook Item
    Has subtype MS-DOC, Microsoft Office Word 97-2003 Binary File Format (.doc)
    Has subtype MS-PPT, Microsoft Office PowerPoint 97-2003 Binary File Format (.ppt)
    Has subtype MS-XLS, Microsoft Office Excel 97-2003 Binary File Format (.xls, BIFF8)
    Has later version CFB_4, Microsoft Compound File Binary File Format, Version 4
    Affinity to AAF_1_1, Advanced Authoring Format (AAF) Object, Version 1.1.

Early versions of the AAF format detailed use of the structured storage systems outlined in CFB to store the objects on disk.

    Affinity to WPD, WordPerfect Document Family. According to WPD from Archiveteam.org, WordPerfect version 7 can also store documents known as "WordPerfect Compound File" using the Microsoft OLE Compound file format with the same WPD extensions. OLE embedded objects are stored inside a storage called PerfectOffice_OBJECT, whereas the real document part is now stored as stream PerfectOffice_MAIN. In principal the format of this internal document part is the same like in previous versions, but one difference is that the minor version number is raised from 1 to 2.

Local use Explanation of format description terms

LC experience or existing holdings See various subtypes for holdings information.
LC preference See the Recommended Formats Statement for the Library of Congress format preferences.

Sustainability factors Explanation of format description terms

Disclosure Fully documented. Proprietary file format developed by Microsoft.
    Documentation Microsoft [MS-CFB]: Compound File Binary File Format specification, available from Microsoft.
Adoption CFB is implemented in a wide range of Microsoft products including Office for Mac 1998 - 2008 and Windows operating systems NY 4.0 through Windows 10. See Exploring the Compound File Binary Format from 2009, which describes CFB as "the bread and butter for the Microsoft Office suite of applications for many years."
    Licensing and patents CFB is covered by Microsoft's Open Specification Promise.
Transparency Depends on subtype or implementation.
Self-documentation None
External dependencies None
Technical protection considerations

Because a compound file is stored as a single file in the file-system, normal file-system security mechanisms can be used to secure the compound file. This includes read/write permissions, Access Control List (ACL), and encryption (NTFS EFS or BitLocker) where appropriate. Some subtypes permit encryption and password protection of specified streams. See, for example, [MS-OFFCRYPTO], used for Microsoft Office Binary File Formats.


Quality and functionality factors Explanation of format description terms


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension Not applicable.  Depends on subtype
Internet Media Type See note.  Depends on subtype
Magic numbers Hex: D0 CF 11 E0 A1 B1 1A E1
Documented in the CFB specification, in 2.2 Compound File Header. Applies to all files in CFB format; see GCK'S File Signatures Table entry for Compound Binary File format (aka OLECF).
File signature Hex: 3E 00 03 00 FE FF 09 00
At byte offset 24 from beginning of file. Documented in specification at 2.2 Compound File Header. This sequence indicates CFB (Compound File Binary format) major version 3, minor version 3e. The specification states that the minor version should always be indicated as 3e.

Notes Explanation of format description terms

General

In addition to the Major Version field value declaration of the version number in the header, the Sector Shift field specifies the sector size depending on the version declaration. If Major Version is 3, then the Sector Shift must be 0x0009, specifying a sector size of 512 bytes.

In a compound file, all integer fields, including Unicode characters encoded in UTF-16, must be stored in little-endian byte order. The only exception is in user-defined data streams, where the compound file structure does not impose any restrictions.

History  

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 11/28/2023