Skip to content

Latest commit

 

History

History
81 lines (53 loc) · 2.76 KB

README.md

File metadata and controls

81 lines (53 loc) · 2.76 KB

"Checklist diff"

NOTE

This repository has been retired in favor of the newer listtools repository.

This repository represents work on a "checklist diff" (diff = difference, inpired by the Unix diff command) tool for the ASU BioKIC Taxon Concepts project from April to October 2020. A "checklist" here could mean either a simple species list or a comprehensive taxonomic hierarchy with synonyms.

This code has a number of bugs and it was getting to be difficult to work with. I began a major revision in August 2020, and this is in the 'alignment' branch of this repository. I became frustrated with this code and left off this line of work in November 2020.

The latest work along these lines is in the listtools repository. The new code base has a more modular design and, I hope, more correct than that of its predecessor.

November 2020

The inputs are two checklists in TSV or CSV format (extension .tsv or .csv). The output is an ad hoc report file.

Examples

Documentation:

Example: NCBI Primates vs. GBIF Primates

You can run a similar example (comparing two versions of NCBI Taxonomy) by simply saying make. But in detail, the procedure has multiple steps.

Get NCBI taxonomy from FTP site

We'll put everything related to the February 2020 version of NCBI taxonomy under work/ncbi/2020-01-01. Start with the release (dump).

mkdir -p work/ncbi/2020-01-01/dump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2020-01-01.zip
unzip -d work/ncbi/2020-01-01/dump taxdmp_2020-01-01.zip

Of course you can do this with any version of NCBI you like, by substituting the date.

Get GBIF taxonomy from GBIF site

Similarly work/gbif/2019-09-16. The release is a DwCA file (thus dwca).

mkdir -p work/gbif/2019-09-16/dwca
wget http://rs.gbif.org/datasets/backbone/2019-09-06/backbone.zip
unzip -d work/gbif/2019-09-16/dwca -q backbone.zip

For futher information see the backbone taxonomy landing page.

Other archived versions of GBIF are available on their site.

Compare them

python3 src/cldiff.py work/gbif/2019-09-16/primates.csv \
  work/ncbi/2020-01-01/primates.csv --out diff.out

The first checklist (or taxonomy) is the "A checklist" and the second is the "B checklist".

If you put the two in the opposite report you'll get the comparison ordered by the other taxonomy:

python3 src/cldiff.py work/gbif/2019-09-16/primates.csv work/ncbi/2020-01-01/primates.csv