Skip to content

HathiTrust

Rupert Gatti edited this page Jan 28, 2021 · 1 revision

Link:

www.hathitrust.org

Summary:

HathiTrust Digital Library is a digital preservation repository and access platform. HathiTrust provides long-term preservation and access services for digitised content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house member institution initiatives. Items in the public domain are in full view for everyone and items held in copyright are searchable. The members ensure the reliability and efficiency of the digital library by relying on community standards and best practices, developing policies and procedures to manage content and services at scale, and maintaining a modular, open infrastructure.

Format types:

TIFF, ITU, G4, JP2 (JPEG 2000 part 1), and Unicode OCR with and without coordinates. Do not support schemas that describe publication structures (e.g. DocBook, TEI, EPUB), or derivative image formative (JPEG or PNG).

Third-party content support:

No specific processes specified.

Features:

Two synchronised instances of storage with wide geographic separation (located in datacentres in Ann Arbor, MI and Indianapolis, IN), and an encrypted tape backup with 6 months of previous-version retention (located in a third datacentre several miles from the Ann Arbor storage instance). The need for continuous integrity checking is fundamental to HathiTrust’s data management strategy and underlies the choice of online (spinning magnetic disk) media for primary storage. Internally, each storage instance uses N+3 Reed-Solomon parity redundancy, which is analogous to but more fault-tolerant than conventional RAID 5 storage due to the additional parity redundancy. The storage system internally performs in-flight data integrity checks as well as periodic integrity checks of all at-rest data, and makes use of parity redundancy to permanently repair any errors encountered. External to the storage system, HathiTrust also conducts periodic validation of data with stored checksums to ensure that data has been ingested correctly and remains intact. Storage equipment is typically refreshed every 4-5 years. The storage system is modular and virtualised, with files split into blocks that are distributed across nodes of a cluster and automatically redistributed as needed to balance storage utilisation equally. Storage nodes that have reached retirement age may be removed from the cluster with an administrative command, and new nodes may be added, with all movement of data managed internally while employing the in-flight integrity checks described earlier.

Costs:

Tier-based fee system calculated on a cost-per-volume basis and total library expenditure, beginning at $7,146 per year.

Clone this wiki locally