Skip to content

Synthesizing overlap data from HathiTrust

Charlie Morris edited this page Jun 2, 2020 · 14 revisions

We are providing links for full text online digital copies of physical books in our collection that may not be checked out physically due to the COVID 19 pandemic through checking out digitally from HathiTrust. This document explains where the data comes from and how it is processed to create links to checkout a version of a digitized item.

Thanks to Chad Nelson from Temple University Libraries for posting this article on his process for this which we cribbed from heavily. Below are steps very similar to Chad's with a couple small changes:

HathiTrust full data process

Pare the large file down to just the needed data (htid and oclc) with:

  1. csvcut to limit to just the needed columns (htid and oclc)
  2. csvgrep to eliminate rows without required fields (compact)
  3. sort and uniq to eliminate duplicates (sort by htid, delete duplicate rows)
gunzip -c hathi_full_20200501.txt.gz |  \
csvcut -t -c 1,8 -z 1310720 | \
csvgrep -c 1,2 -r ".+" | \
sort | uniq > hathi_full_dedupe_may.csv

Overlap report data process

Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers:

csvgrep -t -c 4 -r ".+" overlap_20200518_psu.tsv | \
  csvcut -c 1 | sort | uniq  > overlap_all_unique_may.csv

Synthesize

Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:

csvgrep -c 2 -f overlap_all_unique_may.csv \
  hathi_full_dedupe_may.csv > hathi_filtered_by_overlap_may.csv

Reduce the filtered overlap report to just pairs with unique oclc, first wins:

sort -t, -k2 -u hathi_filtered_by_overlap_may.csv > final_overlap_may.csv

We take this and we import it into our Solr repository to produce links to checkout materials from Hathi based on an existing match to OCLC number and, importantly, to affect the facet for "Access Online" so that patrons can see a much larger set of item available online.

Other useful links:

Clone this wiki locally