Sustainability of Digital Formats
 Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

CSV, Comma Separated Values (RFC 4180)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name CSV, Comma Separated Values (strict form as described in RFC 4180)
Description

CSV is a simple format for representing a rectangular array (matrix) of numeric and textual values. It an example of a "flat file" format. It is a delimited data format that has fields/columns separated by the comma character %x2C (Hex 2C) and records/rows/lines separated by characters indicating a line break. RFC 4180 stipulates the use of CRLF pairs to denote line breaks, where CR is %x0D (Hex 0D) and LF is %x0A (Hex 0A). Each line should contain the same number of fields. Fields that contain a special character (comma, CR, LF, or double quote), must be "escaped" by enclosing them in double quotes (Hex 22). An optional header line may appear as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. CSV commonly employs US-ASCII as character set, but other character sets are permitted.

Production phase May be used at any stage in the lifecycle of a dataset.
Relationship to other formats
    Has modified version Variants of the strict form described here exist. See Notes below.

Local use Explanation of format description terms

LC experience or existing holdings None in relation to collection holdings
LC preference None

Sustainability factors Explanation of format description terms

Disclosure

A simple de facto format, for which no single, official specification exists. The strict variant of the format described here was registered with IANA for the text/csv MIME type in RFC 4180.

In RFC 4180, the required section in an RFC for MIME type registration that documents the "Published Specification" reads: "While numerous private specifications exist for various programs and systems, there is no single 'master' specification for this format. An attempt at a common definition can be found in Section 2 [of RFC 4180]."

Some Useful References provide variant specifications.

    Documentation IETF RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. 2005. Available at http://tools.ietf.org/html/rfc4180 or http://www.ietf.org/rfc/rfc4180.txt
Adoption

Widely used as an exchange format for tabular data. Although very limited in functionality, there are many data exchange or data preservation contexts for which it is adequate, particularly when the syntax and semantics of fields are described in ancillary documentation that is also exchanged or preserved. CSV files can be imported and exported by almost any software designed for storing or manipulating data, including relational database systems, spreadsheet software, and statistical analysis software.

CSV is a preferred format for interchange in many contexts because it is so easy to process. Recommended Data Formats for Preservation Purposes in the Florida Digital Archive lists CSV as a format with a high confidence level of providing ongoing access in a usable form. CSV is a recommended format for data deposit with Library and Archives Canada, a supported format in MIT's DSpace implementation, and a recommended format for long-term retention by the State Archives of North Carolina. CSV was one of the primary formats into which the UK National Archives converted datasets that were selected for the National Digital Archive of Datasets between 1997 and 2010 (after which a government initiative promoting open data eliminated the need for such conversion by the National Archives). It is the preferred format for preparing tabular environmental data at the Oak Ridge National Laboratory and its use for tabular data is a best practice for the DataONE (Data Observation Network for Earth) project. Most government open data initiatives have CSV as one of the primary formats in which data can be downloaded. For example, CSV is one of the formats in which data from the U.S. Data.gov can be downloaded. Others include XML, ESRI shapefiles (ESRI_shape), and KML. The last two are for geospatial data.

    Licensing and patents None.
Transparency A simple text-based format that is very transparent, being both human-readable and easily machine-processable.
Self-documentation Poor. There is no internal capability to represent metadata, although the optional header row may provide some clues to the semantics of the columns. For preservation, an associated codebook is desirable, listing and describing the fields, and indicating types and ranges for field data values. In some contexts, the relevant information is supplied by documentation for a larger corpus or resource, rather than for each dataset.
External dependencies None.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality An extremely simple format with limited capabilities. The format does not support strong data typing and is limited to representing a simple tabular structure.
Support for software interfaces (APIs, etc.) The simple nature of the CSV format allows easy programming for parsing and using the data.
Data documentation (quality, provenance, etc.) No support. Most guidelines for use of the format for archiving datasets call for data documentation in separate files in appropriate formats.
Beyond normal functionality None.

File type signifiers Explanation of format description terms

Tag Value Note
Filename extension .csv
No particular extension is specified or required, but .csv is often used.
Internet Media Type text/csv
Registered with IANA via RFC 4180.

Notes Explanation of format description terms

General

Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools such as those listed below as Useful References:

  • In locales where the comma character is used in place of a decimal point in numbers, the separator between fields/columns is often a semicolon.
  • The line break character may be CR or LF, not necessarily CRLF.
  • Some Unix-based applications may use a different escape mechanism for indicating that one of the separator characters occurs within a text value. The individual character is preceded by a backslash character rather than enclosing the entire string in double quotes.
  • Single quotes may be treated as equivalent to double-quotes for escaping (also known as "text-qualification").

Several other caveats are worth noting:

  • The last record in a file may or may not end with a line break character.
  • Non-printable characters may be included in text fields by using one of several c-style character escape sequences: \### or \o### Octal; \x## Hex; \d### Decimal; and \u#### Unicode.
  • The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes.
  • In some uses, there is an assumption of strong data typing, with unquoted fields considered to be numeric, and quoted fields considered to be text data.
History  

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: Monday, 03-Mar-2014 14:06:27 EST