-
Notifications
You must be signed in to change notification settings - Fork 80
whyxar
(Written by Rob Braun, originally posted at http://www.opendarwin.org/~bbraun/whyxar.html)
A few years ago, some coworkers and I came up with the idea of a new archive format, which eventually manifested its self as XAR, currently standing for eXtensible ARchive format. On the surface, there would appear to be no shortage of archive formats out there, and no need to create yet another. Xar has a number of benefits over existing formats including easy extraction of arbitrary data, metadata at the beginning of the file, and of course extensibility.
First what is xar? Xar is fundamentally a file format that can contain other files. We've built up an implementation of an archiver based on the xar format in the form of a library, and a command line tool which uses it. When limitations are discussed, it is important to understand where the restriction lies, in the command line tool, library, or in the file format. Basically, xar's format consists of a small binary header, an XML document, and the payload, in that order. The binary header contains information about the XML document, such as whether it is compressed, what it's hash is, what hashing algorithm was used, etc. The XML document is the interesting part, as it contains all the information about files contained in the archive. Using XML allows for a level of extensibility and leverage existing tools. Following the XML documents, often sloppily referred to as the table of content, is the heap. The heap contains individually compressed files and other data (such as extended attributes, resource forks, etc.), linearly appended. The offset, length, and compression type used to store the file in the heap is contained in the XML header.
Xar's XML header allows it to contain arbitrary metadata about files contained within the archive. In addition to the standard unix file metadata such as the size of the file and it's modification and creation times, xar can store information such as ext2fs and hfs file bits, unix flags, references to extended attributes, Mac OS X Finder information, Mac OS X resource forks, and hashes of the file data. But that is just the beginning of the possibilities. In addition to just file preservation, it is possible to store information that is frequently queried by users of the archive. Since the archiver reads files as they are archived, it can introspect file contents and store additional information about the file's contents in the XML header, without causing extra IO. Examples of this would be storing a script's interpreter, libraries an executable links against, and even MP3 ID tag information and JPEG EXIF information. Users of the archive have quick and easy access to the XML header, and do not need to expand the entire archive to get at the metadata associated with their files.
In addition to storing information about the contents of files, the XML header can also store arbitrary user data. This makes it a convenient platform to build tools on, such as a packaging format. For example, Darwinbuild - a tool used for building Mac OS X sources outside of Apple, can use xar to store information about the archive as a whole. It uses xar to archive the results of a build, and can store information such as whether or not the contents are symboled, when they were built, what was used to build it, who built it,what is required to successfully run the contents of the archive, the location of the sources used to build the archive, what their hash was at the time the sources were fetched, etc. All of this information can be quickly obtained from the XML header of the archive, and a particular applications additions to the XML header can be quickly and easily extracted to be used for something like receipt use.
Files in xar are individually compressed. This allows for quick extraction of individual files without the extra disk space requirements and CPU usage of extracting the entire archive, as compared to a compressed tar archive. This makes xar useful for quick restores of accidentally deleted or overwritten files, from a backup archive. Additionally, this means xar can use different compression methods for each file in the archive. For instance, it might not be a good idea to try to try to compress an already compressed file, but a large file might benefit greatly from using bzip2, whereas a small text file would be better served to use gzip.
There are many archive formats out there, but none support the level of extensibility, the richness of data archived, or the focus on end user extraction that xar has. Plus it's just cool to say.
Feb. 27 2006