Format Description Categories
Browse Alphabetical List
WARC, Web ARChive file format
Format Description Properties
- ID: fdd000236
- Short name: WARC
- Content categories:
- Format Category:
- Other facets:
- Last significant update:
- Draft status: Partial
Identification and description
||WARC (Web ARChive) file format
||The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.
||Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
|Relationship to other formats
|| Data of various types; see Notes below
Has earlier version
Internet Archive ARC file format.
|LC experience or existing holdings
||LC's web harvesting activities capture web sites in the WARC format. LC also has web archives in the predecessor ARC_IA format.
||LC's preferred format for harvested Web sites harvested in bulk is WARC.
Open standard, publicly documented, developed under the auspices of the International Internet Preservation Consortium. Submitted in May 2005 as a work item through ISO TC46/SC4, it was approved as an International Standard in May 2009. ISO TC46/SC4/WG12, convened by the
Bibliothèque nationale de France, is the working group responsible for maintenance.
||ISO 28500:2009, Information and documentation -- WARC file
format is available from ISO for purchase. The draft standard that was the basis for approval, ISO/DIS 28500, is at http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf.
The file format was designed to support the requirements of members of the International Internet Preservation Consortium.
| Licensing and patents
The wrapper is transparent; contained data varies.
In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by basic information about the document.
User access depends on large-scale indexing of a corpus.
|Technical protection considerations
Quality and functionality factors
||Supported through Internet Archive's Wayback Machine or equivalent tool.
|Documentation of harvesting context
||Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc.
|Efficiency at scale
||Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting.
|Support for stewardship.
||WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.
File type signifiers
||WARC files are not typically transmitted to users or used in ways that depend on recognition by file type.
|Internet Media Type
Thursday, 04-Apr-2013 06:21:29 EDT