-
Notifications
You must be signed in to change notification settings - Fork 3
Synthesizing overlap data from HathiTrust
We are providing links for full text online digital copies of physical books in our collection that may not be checked out physically due to the COVID 19 pandemic through checking out digitally from HathiTrust. This document explains where the data comes from and how it is processed to create links to checkout a version of a digitized item.
Thanks to Chad Nelson from Temple University Libraries for posting this article on his process for this which we cribbed from heavily. Below are steps very similar to Chad's with a couple small changes:
Pare the large file down to just the needed data (htid
and oclc
) with:
- csvcut to limit to just the needed columns (htid and oclc)
- csvgrep to eliminate rows without required fields (compact)
- sort and uniq to eliminate duplicates (sort by htid, delete duplicate rows)
gunzip -c hathi_full_20200501.txt.gz | \
csvcut -t -c 1,8 -z 1310720 | \
csvgrep -c 1,2 -r ".+" | \
sort | uniq > hathi_full_dedupe_may.csv
Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers:
csvgrep -t -c 4 -r ".+" overlap_20200518_psu.tsv | \
csvcut -c 1 | sort | uniq > overlap_all_unique_may.csv
Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:
csvgrep -c 2 -f overlap_all_unique_may.csv \
hathi_full_dedupe_may.csv > hathi_filtered_by_overlap_may.csv
Reduce the filtered overlap report to just pairs with unique oclc, first wins:
sort -t, -k2 -u hathi_filtered_by_overlap_may.csv > final_overlap_may.csv
We take this and we import it into our Solr repository to produce links to checkout materials from Hathi based on an existing match to OCLC number and, importantly, to affect the facet for "Access Online" so that patrons can see a much larger set of item available online.
Other useful links:
- Home
- Testing Documentation for Product Owner
- Components, Features, and Functions
- Library Faceting and Locations Management
- Advanced Search
- Browse Items By Library of Congress Call Number
- Browse by Subject, Author, and Title
- Availability Display
- Summary Holdings Display
- Holdings and Availability for Bound-Withs
- Holds and ILL
- Requesting Items with Aeon
- Course Reserves
- Google Books and HathiTrust Integration
- Bento Integration
- Indexing and Display
- Sources of Catalog Data
- Display Fields
- Title Fields
- Author and Creator Fields
- Thesis Department
- ISSNs and ISBNs
- URL Fields
- Publication and Edition Fields
- Material Characteristics Fields
- Language Fields
- Subject Fields
- Genre Fields
- Note Fields
- Serials
- Bound-Withs
- Formats
- Media Types
- Access Facet
- Open Access Facet
- Call Numbers
- OCLC Number
- LCCN
- Report Numbers
- Endowment Codes and Names
- Adding Linked to Request Scanning
- Summary Holdings Indexing
- My Account
- Tests
- Development Setup and Notes
- Deployment Notes