Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

WARC, Web ARChive file format

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name WARC (Web ARChive) file format
Description

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources.

A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages.

Production phase Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
Relationship to other formats
    May contain Data of various types; see Notes below
    May have component CDX_Index, CDX Internet Archive Index File
    Has earlier version ARC_IA, Internet Archive ARC file format.

Local use Explanation of format description terms

LC experience or existing holdings LC's web harvesting activities capture web sites in the WARC format. LC also has web archives in the predecessor ARC_IA format.
LC preference

The Library of Congress Recommended Formats Statement (RFS) lists WARC as the Preferred format for web archives.


Sustainability factors Explanation of format description terms

Disclosure Open standard, publicly documented, developed under the auspices of the International Internet Preservation Consortium. Submitted in May 2005 as a work item through ISO TC46/SC4, it was approved as an International Standard in May 2009. ISO TC46/SC4/WG12, convened by the Bibliothèque nationale de France, is the working group responsible for maintenance.
    Documentation ISO 28500:2009, Information and documentation -- WARC file format is available from ISO for purchase. The draft standard that was the basis for approval, ISO/DIS 28500, is at http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
Adoption The file format was designed to support the requirements of members of the International Internet Preservation Consortium.
    Licensing and patents None.
Transparency The WARC wrapper is transparent. Contained data harvested from the Web may be in any format. Transparency varies by format.
Self-documentation In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by basic information about the document.
External dependencies User access depends on large-scale indexing of a corpus.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering Supported through Internet Archive's Wayback Machine or equivalent tool.
Documentation of harvesting context Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc.
Efficiency at scale Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting.
Support for stewardship. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension warc
WARC files are not typically transmitted to users or used in ways that depend on recognition by file type.
Internet Media Type application/warc
 
Pronom PUID fmt/289
See http://www.nationalarchives.gov.uk/PRONOM/fmt/289.
Wikidata Title ID Q7978505
See https://www.wikidata.org/wiki/Q7978505.

Notes Explanation of format description terms

General The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers.
History An HTML version of WARC File Format (Version 0.9) is at http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html. Subsequent drafts are also available at http://archive-access.sourceforge.net/warc/ in various formats.

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 02/09/2024