From the [Internet Archive document](http://archive.org/web/researcher/ArcFileFormat.php) Authors: Mike Burner and Brewster Kahle Date: September 15, 1996, Version 1.0 Internet Archive
The Archive stores the data it collects in large (currently 100MB) aggregate files for ease of storage in a conventional file system. It is the Archive’s experience that it is difficult to manage hundreds of millions of small files in most existing file systems.
This document describes the format of the aggregate files. The file format was designed to meet several requirements:
-
The file must be self-contained: it must permit the aggregated objects to be identified and unpacked without the use of a companion index file.
-
The format must be extensible to accommodate files retrieved via a variety of network protocols, including http, ftp, news, gopher, and mail.
-
The file must be "stream able": it must be possible to concatenate multiple archive files in a data stream.
-
Once written, a record must be viable: the integrity of the file must not depend on subsequent creation of an in-file index of the contents.
The reader will quickly recognize, however, that an external index of the contents and object-offsets will greatly enhance the retrievability of objects stored in this format. The Archive maintains such indices, but does not seek to standardize their format.
The description below uses pseudo-BNF to describe the archive file format. By convention, archive files are named with a ".arc" extension (e.g., "IA-000001.arc").
arc_file == <version_block><rest_of_arc_file> version_block == See definition below rest_of_arc_file == <doc>|<doc><rest_of_arc_file> doc == <nl><URL-record><nl><network_doc> URL-record == See definition below network_doc == whatever the protocol returned nl == Unix-newline-delimiter sp == ' ' (ascii space) comma is inappropriate because it can be in an URL.
The version block identifies the original filename, file version, and URL record fields of the archive file.
version-block == filedesc://<path><sp><version specific data><sp><length><nl> <version-number><sp><reserved><sp><origin-code><nl> <URL-record-definition><nl> <nl> version-1-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp><length><nl> 1<sp><reserved><sp><origin-code><nl> <URL IP-address ArchivArchivee-date Content-type Archive-length<nl> <nl>
version-2-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp>200<sp> -<sp>-<sp>0<sp><filename><sp><length><nl>
2<sp><reserved><sp><origin-code><nl> URL<sp>IP-address<sp>Archive-date<sp>Content-type<sp>Result-code<sp>Checksum<sp>Location<sp> Offset<sp>Filename<sp>Archive-length<nl>
<nl>
The "filedesc" line is a special-case URL record (see below). The path is the original path name of the archive file. The IP address is the address of the machine that created the archive file. The date is the date the archive file was created. The content type of "text/plain" simply refers to the remainder of the version block. The length specifies the size, in bytes, of the rest of the version block.
version-number == integer in ascii reserved == string with no white space origin-code == Name of gathering organization with no white space URL-record-definition == names of fields in URL records
The URL record introduces an object in the archive file. It gives the name and size of the object, as well as several pieces of metadata about its retrieval.
URL-record-v1 == <url><sp> <ip-address><sp> <archive-date><sp> <content-type><sp> <length><nl>
URL-record-v2 == <url><sp> <ip-address><sp> <archive-date><sp> <content-type><sp> <result-code><sp> <checksum><sp> <location><sp> <offset><sp> <filename><sp> <length><nl>
url == ascii URL string (e.g., "http://www.alexa.com:80/") ip_address == dotted-quad (eg 192.216.46.98 or 0.0.0.0) archive-date == date archived content-type == "no-type"|MIME type of data (e.g., "text/html") length == ascii representation of size of network doc in bytes date == YYYYMMDDhhmmss (Greenwich Mean Time) result-code == result code or response code, (e.g. 200 or 302) checksum == ascii representation of a checksum of the data. The specifics of the checksum are implementation specific.
location == "-"|url of re-direct offset == offset in bytes from beginning of file to beginning of URL-record filename == name of arc file
Note that all field values are ascii text. All fields have at least one character. No field value contains a space.
In the following example, please remember that length includes carriage returns and line feeds.
filedesc://IA-001102.arc 0 19960923142103 text/plain 76 1 0 Alexa Internet URL IP-address Archive-date Content-type Archive-length
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </HTML>
filedesc://IA-001102.arc 0.0.0.0 19960923142103 text/plain 200 - - 0 IA-001102.arc 122 2 0 Alexa Internet URL IP-address Archive-date Content-type Result-code Checksum Location Offset Filename Archive-length
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </HTML>
As noted above, the best way to retrieve a specific object from an archive file is to maintain an external database of object names, the files they are located in, their offsets within the files, and the sizes of the objects. Then, to retrieve the object, one need only open the file, seek to the offset, and do a single read of <size> bytes.
Programs that need to read the file without an index (such as to unpack the whole file) should use buffered I/O. The URL record can then be read with an fgets(), and the objects can be read with an fread() of <size> bytes.
Since the Archive format uses the standard URL specification to identify objects, it naturally lends itself to the storage of data retrieved via protocols other than HTTP. For example, a news article might appear as follows:
news:[email protected] 127.10.100.3 19960929142103 text/plain 328 Path: news.alexa.com!news1.best.com!news.dryswamp.edu!joebob From: [email protected] Newsgroups: alt.food Subject: Re: I am hungry Date: 28 SEP 96 21:02:47 GMT Organization: Dry Swamp University Lines: 1 Message-ID: <[email protected]> NNTP-Posting-Host: alligator.dryswamp.edu