Skip to content

phylosynth/NCBI_Taxonomic_Matcher

Repository files navigation

Taxonomic_Matcher

Store some scripts and results from matching NCBI and WCSP taxonomies

Note:

  • The data folder is not yet uploaded given the size > 1 GB.
`3.9G Feb 12 16:30 Spermatophyta_plnDB_02012020.txt`  
  • The python script is developed in Python3 enviroment.

General Workingflow

Match strategy0Match strategy1Match strategy2

These steps showing how to get final clean data

Step 1: Run ncbi_name_extract_V3.py

  • This script will take the database of the plant division from NCBI (e.g., "plnDB20191101.db" generated by phlawd db maker )
  • Then use SQL to extract my columns interested taxon group ("tid") and columns from the database
    sqlcmd = "SELECT ncbi_id,parent_ncbi_id,name,node_rank FROM taxonomy WHERE name_class ='scientific name' OR name_class = 'authority'"
  • Output file name: "Spermatophyta_plnDB_02012020.txt"

Step 2: Run remove_duplicate.py

  • Remove duplicated records reduce the data size

  • Output file name: "Spermatophyta58024_plnDB02032020_nodupl.csv"

Step 3: Run Spermatophyta_plnDB_cleanerV1.1.sh

  • Run bash script to further reduce some "unwanted" records (also see scripts inside comments for details):

    #subfamily
    #subgenus
    #suborder
    #subsection
    #subtribe
    #tribe
    #cf.
    #aff.
    #environmental
    #unclassified_
    #_incertae_sedis
    #_clade
    #_superclade
    #Group
    #group
    #complex
    #
    (type
    *)
    #lineages
    #C3

    #C4

    #_sensu_lato
    #_samples
    #_alliance
    #_division
    #hybrid
    #_cultivar
    #subgroup
    #form
    #ungrouped

    #unpublished

  • Output file name: "Spermatophyta58024_plnDB02032020_nodupl_bashcleaned.csv"

Step 4: Run Spermatophyta_sp_authority_format_v4.py

This script will reformat the NCBI taxonomy file, mainly split taxon Authority into a different column, and correct the taxon status (for examples a "varietas" assigned to "species").

It also generates 4 output files:

  • The main reformated file
    Spermatophyta58024_plnDB_pyphlawd02172020_reformated.csv

  • Those cultivars
    Spermatophyta58024_plnDB_pyphlawd02172020_cultivar.csv

  • Those hybrids Spermatophyta58024_plnDB_pyphlawd02172020_ill.hybrids.csv

  • Those "genus_sp." cases
    Spermatophyta58024_plnDB_pyphlawd02172020_sp.csv

Step 5: Run Spermatophyta_clean3.14snakeV3.py

  • This script will parsing the NCBI taxonomic information as order,family,genus_hybrid,genus,species_hybrid,species,infraspecific_rank,infraspecies,taxon_authority,taxon_rank,ncbi_id , to preparing for matching with the taxonomy database from World Checklist of Selected Plant Families (WCSP).

  • This script also calculate species count under each genus, family, and order providing information for the strategy of future supertree reconstruction.

  • It also generate 3 output files:

    • The main file
      Spermatophyta_clean02172020.csv

    • The file for those genus without species records under
      Spermatophyta_Nospecies_NCBI.csv

    • The table which calculate species richness under each genus
      Spermatophyta_Richness_NCBI.csv

(To be continued ...)

External link

Phylosynth

BIEN taxonomy match

Last update:

Mon Feb 17 14:42:32 2020

About

code and data for match NCBI taxonomy and world checklist

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published