Store some scripts and results from matching NCBI and WCSP taxonomies
Note:
- The data folder is not yet uploaded given the size > 1 GB.
`3.9G Feb 12 16:30 Spermatophyta_plnDB_02012020.txt`
- The python script is developed in Python3 enviroment.
These steps showing how to get final clean data
Step 1: Run ncbi_name_extract_V3.py
- This script will take the database of the plant division from NCBI (e.g., "plnDB20191101.db" generated by phlawd db maker )
- Then use SQL to extract my columns interested taxon group ("tid") and columns from the database
sqlcmd = "SELECT ncbi_id,parent_ncbi_id,name,node_rank FROM taxonomy WHERE name_class ='scientific name' OR name_class = 'authority'"
- Output file name: "Spermatophyta_plnDB_02012020.txt"
Step 2: Run remove_duplicate.py
-
Remove duplicated records reduce the data size
-
Output file name: "Spermatophyta58024_plnDB02032020_nodupl.csv"
Step 3: Run Spermatophyta_plnDB_cleanerV1.1.sh
-
Run bash script to further reduce some "unwanted" records (also see scripts inside comments for details):
#subfamily
#subgenus
#suborder
#subsection
#subtribe
#tribe
#cf.
#aff.
#environmental
#unclassified_
#_incertae_sedis
#_clade
#_superclade
#Group
#group
#complex
#(type*)
#lineages
#C3
#C4
#_sensu_lato
#_samples
#_alliance
#_division
#hybrid
#_cultivar
#subgroup
#form
#ungrouped
#unpublished -
Output file name: "Spermatophyta58024_plnDB02032020_nodupl_bashcleaned.csv"
Step 4: Run Spermatophyta_sp_authority_format_v4.py
This script will reformat the NCBI taxonomy file, mainly split taxon Authority into a different column, and correct the taxon status (for examples a "varietas" assigned to "species").
It also generates 4 output files:
-
The main reformated file
Spermatophyta58024_plnDB_pyphlawd02172020_reformated.csv -
Those cultivars
Spermatophyta58024_plnDB_pyphlawd02172020_cultivar.csv -
Those hybrids Spermatophyta58024_plnDB_pyphlawd02172020_ill.hybrids.csv
-
Those "genus_sp." cases
Spermatophyta58024_plnDB_pyphlawd02172020_sp.csv
Step 5: Run Spermatophyta_clean3.14snakeV3.py
-
This script will parsing the NCBI taxonomic information as
order,family,genus_hybrid,genus,species_hybrid,species,infraspecific_rank,infraspecies,taxon_authority,taxon_rank,ncbi_id
, to preparing for matching with the taxonomy database from World Checklist of Selected Plant Families (WCSP). -
This script also calculate species count under each genus, family, and order providing information for the strategy of future supertree reconstruction.
-
It also generate 3 output files:
-
The main file
Spermatophyta_clean02172020.csv -
The file for those genus without species records under
Spermatophyta_Nospecies_NCBI.csv -
The table which calculate species richness under each genus
Spermatophyta_Richness_NCBI.csv
-
(To be continued ...)
External link
Last update:
Mon Feb 17 14:42:32 2020