Skip to content

Synthesizing overlap data from HathiTrust

Banu Hapeloglu Kutlu edited this page Jul 21, 2020 · 14 revisions

We are providing links for full text online digital copies of physical books in our collection that may not be checked out physically due to the COVID 19 pandemic through checking out digitally from HathiTrust. This document explains where the data comes from and how it is processed to create links to checkout a version of a digitized item.

Thanks to Chad Nelson from Temple University Libraries for posting this article on his process for this which we cribbed from heavily. Below are steps very similar to Chad's with a couple small changes:

HathiTrust full data process

  1. Download the monthly HathiTrust file from the HathiFiles page.

  2. Pare the monthly file down to just the needed data: OCLC number, HathiTrust id, HathiTrust bib_key and HathiTurst access code

    1. csvcut to limit to just the needed columns
    2. csvgrep to eliminate rows without required fields (compact)
    3. sort and uniq to eliminate duplicates
    gunzip -c hathi_full_#{args[:period]}.txt.gz | \
    csvcut -t -c 8,1,4,2 -z 1310720 | \
    csvgrep -c 1,2,3,4 -r ".+" | \
    sort | uniq > hathi_all_dedupe.csv
  3. Process the lines with multiple oclc's

    1. Extract lines with multiple oclc's
    csvgrep -c 1 -r "," hathi_all_dedupe_with_headers.csv > hathi_multi_oclc.csv
    1. Split multiple oclc's
    2. Merge splitted lines to deduped full file and add back headers
    csvgrep -c 1 -r "," -i hathi_all_dedupe_with_headers.csv > hathi_single_oclc.csv
    cat hathi_single_oclc.csv hathi_multi_oclc_split.csv > hathi_full_dedupe_with_headers.csv

Overlap report data process

Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers for records that have some value for access (which would be allow or deny). The deny values are the ones available to checkout through ETAS:

csvgrep -t -c 4 -r ".+" #{args[:overlap_file]} | \
csvcut -c 1,3 | \
sort | uniq  > overlap_all_unique.csv`

Synthesize

Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:

  1. Split the overlap report by item_type: mono and multi/serial

    csvgrep -H -c 2 -m "mono" overlap_all_unique.csv | \
    csvcut -c 1 | \
    sort | uniq  > overlap_mono_unique.csv`
    
    csvgrep -H -c 2 -m "mono" -i overlap_all_unique.csv | \
    csvcut -c 1 | \
    sort | uniq  > overlap_multi_unique.csv`
  2. Filter the pared down HathiTrust data using the overlap OCLC numbers as the filter input:

    csvgrep -c 1 -f overlap_mono_unique.csv hathi_full_dedupe_with_headers.csv | \
    csvcut -C 3 | \
    sort | uniq > final_hathi_mono_overlap.csv`
    
    csvgrep -c 1 -f overlap_multi_unique.csv hathi_full_dedupe_with_headers.csv | \
    csvcut -C 2 | \
    sort | uniq > final_hathi_multi_overlap.csv

To generate these two files locally, run RUBY_ENVIRONMENT=dev bundle exec rake hathitrust:process_hathi_overlap on psulib_traject repo. Make sure to edit hathi_overlap_path, hathi_load_period and overlap_file settings if needed before running the hathitrust:process_hathi_overlap task.

We upload these two files to blackcat01qa to import them into our Solr repository to produce links to checkout materials from Hathi based on an existing match to OCLC number and, importantly, to affect the facet for "Access Online" so that patrons can see a much larger set of item available online. More info on importing hathi data: https://github.com/psu-libraries/psulib_blacklight_deploy#hathitrust-files

Other useful links:

Clone this wiki locally