Sustainability of Digital Formats
 Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

ARC_IA, Internet Archive ARC file format

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ARC_IA, Internet Archive ARC file format.
Description Specifies a method for combining multiple digital resources into an aggregate archival file together with related information, used since 1996 by the Internet Archive to store 'web crawls' as sequences of content blocks harvested from the World Wide Web.
Production phase Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
Relationship to other formats
    May contain Data of various types, for example, HTML pages, images as GIF, JPEG, etc.
    Has later version WARC, Web ARChive file format

Local use Explanation of format description terms

LC experience or existing holdings LC has substantial volumes of captured web sites in the ARC_IA format. See http://www.loc.gov/webarchiving/technical.html
LC preference LC's preferred format for harvested Web sites harvested in bulk is now WARC. ARC is acceptable as a legacy format.

Sustainability factors Explanation of format description terms

Disclosure Developed by the Internet Archive (Brewster Kahle). Documentation and tools to use files in the format freely available.
    Documentation Described at http://www.archive.org/web/researcher/ArcFileFormat.php
Adoption The file format developed for the Heritrix web crawler, supported by the International Internet Preservation Consortium.
    Licensing and patents None.
Transparency The wrapper is transparent; contained data varies.
Self-documentation In the ARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by some header information about the document: the document file format, the document size, outward links that the document contains, etc. At the Internet Archive, each ARC file has a corresponding DAT file that contains only the header information.
External dependencies User access depends on large-scale indexing of a corpus of ARC files or a separate copy of the record headers (e.g. Internet Archive DAT files). Indexing the DAT files can support user access by URL and date, as in the Wayback Machine.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering Supported through Internet Archive's Wayback Machine or equivalent tool.
Documentation of harvesting context Allows for basic information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, etc.
Efficiency at scale Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The use of coordinated ARC and DAT files is one way to support efficient indexing for such access.
Support for stewardship. The capabilities in ARC that support long-term management of a corpus of web archive files is basic. WARC was developed as an extension to ARC, in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.

File type signifiers Explanation of format description terms

Tag Value Note
Filename extension arc
ARC files are not typically transmitted to users or used in ways that depend on recognition by file type.

Notes Explanation of format description terms

General  
History  

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 11/01/2013