-
Notifications
You must be signed in to change notification settings - Fork 3
Synthesizing overlap data from HathiTrust
We are providing links for full text online digital copies of physical books in our collection that may not be checked out physically due to the COVID 19 pandemic through checking out digitally from HathiTrust. This document explains where the data comes from and how it is processed to create links to checkout a version of a digitized item.
Thanks to Chad Nelson from Temple University Libraries for posting this article on his process for this which we cribbed from heavily. Below are steps very similar to Chad's with a couple small changes:
-
Download the monthly HathiTrust file from the HathiFiles page.
-
Pare the monthly file down to just the needed data: OCLC number, HathiTrust id, HathiTrust bib_key and HathiTurst access code
- csvcut to limit to just the needed columns
- csvgrep to eliminate rows without required fields (compact)
- sort and uniq to eliminate duplicates
gunzip -c hathi_full_#{args[:period]}.txt.gz | \ csvcut -t -c 8,1,4,2 -z 1310720 | \ csvgrep -c 1,2,3,4 -r ".+" | \ sort | uniq > hathi_all_dedupe.csv
-
Process the lines with multiple oclc's
- Extract lines with multiple oclc's
csvgrep -c 1 -r "," hathi_all_dedupe_with_headers.csv > hathi_multi_oclc.csv
- Split multiple oclc's
- Merge splitted lines to deduped full file and add back headers
csvgrep -c 1 -r "," -i hathi_all_dedupe_with_headers.csv > hathi_single_oclc.csv cat hathi_single_oclc.csv hathi_multi_oclc_split.csv > hathi_full_dedupe_with_headers.csv
Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers for records that have some value for access
(which would be allow
or deny
). The deny
values are the ones available to checkout through ETAS:
csvgrep -t -c 4 -r ".+" #{args[:overlap_file]} | \
csvcut -c 1,3 | \
sort | uniq > overlap_all_unique.csv`
Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:
-
Split the overlap report by item_type: mono and multi/serial
csvgrep -H -c 2 -m "mono" overlap_all_unique.csv | \ csvcut -c 1 | \ sort | uniq > overlap_mono_unique.csv` csvgrep -H -c 2 -m "mono" -i overlap_all_unique.csv | \ csvcut -c 1 | \ sort | uniq > overlap_multi_unique.csv`
-
Filter the pared down HathiTrust data using the overlap OCLC numbers as the filter input:
csvgrep -c 1 -f overlap_mono_unique.csv hathi_full_dedupe_with_headers.csv | \ csvcut -C 3 | \ sort | uniq > final_hathi_mono_overlap.csv` csvgrep -c 1 -f overlap_multi_unique.csv hathi_full_dedupe_with_headers.csv | \ csvcut -C 2 | \ sort | uniq > final_hathi_multi_overlap.csv
To generate these two files locally, run RUBY_ENVIRONMENT=dev bundle exec rake hathitrust:process_hathi_overlap
on psulib_traject
repo. Make sure to edit hathi_overlap_path
, hathi_load_period
and overlap_file
settings if needed before running the hathitrust:process_hathi_overlap
task.
We upload these two files to blackcat01qa to import them into our Solr repository to produce links to checkout materials from Hathi based on an existing match to OCLC number and, importantly, to affect the facet for "Access Online" so that patrons can see a much larger set of item available online. More info on importing hathi data: https://github.com/psu-libraries/psulib_blacklight_deploy#hathitrust-files
Other useful links:
- Home
- Testing Documentation for Product Owner
- Components, Features, and Functions
- Library Faceting and Locations Management
- Advanced Search
- Browse Items By Library of Congress Call Number
- Browse by Subject, Author, and Title
- Availability Display
- Summary Holdings Display
- Holdings and Availability for Bound-Withs
- Holds and ILL
- Requesting Items with Aeon
- Course Reserves
- Google Books and HathiTrust Integration
- Bento Integration
- Indexing and Display
- Sources of Catalog Data
- Display Fields
- Title Fields
- Author and Creator Fields
- Thesis Department
- ISSNs and ISBNs
- URL Fields
- Publication and Edition Fields
- Material Characteristics Fields
- Language Fields
- Subject Fields
- Genre Fields
- Note Fields
- Serials
- Bound-Withs
- Formats
- Media Types
- Access Facet
- Open Access Facet
- Call Numbers
- OCLC Number
- LCCN
- Report Numbers
- Endowment Codes and Names
- Adding Linked to Request Scanning
- Summary Holdings Indexing
- My Account
- Tests
- Development Setup and Notes
- Deployment Notes