Taxonomic_Matcher

Store some scripts and results from matching NCBI and WCSP taxonomies

Note:

The data folder is not yet uploaded given the size > 1 GB.

`3.9G Feb 12 16:30 Spermatophyta_plnDB_02012020.txt`

The python script is developed in Python3 enviroment.

General Workingflow

These steps showing how to get final clean data

Step 1: Run ncbi_name_extract_V3.py

This script will take the database of the plant division from NCBI (e.g., "plnDB20191101.db" generated by phlawd db maker )
Then use SQL to extract my columns interested taxon group ("tid") and columns from the database
sqlcmd = "SELECT ncbi_id,parent_ncbi_id,name,node_rank FROM taxonomy WHERE name_class ='scientific name' OR name_class = 'authority'"
Output file name: "Spermatophyta_plnDB_02012020.txt"

Step 2: Run remove_duplicate.py

Remove duplicated records reduce the data size
Output file name: "Spermatophyta58024_plnDB02032020_nodupl.csv"

Step 3: Run Spermatophyta_plnDB_cleanerV1.1.sh

Run bash script to further reduce some "unwanted" records (also see scripts inside comments for details):

#subfamily
#subgenus
#suborder
#subsection
#subtribe
#tribe
#cf.
#aff.
#environmental
#unclassified_
#_incertae_sedis
#_clade
#_superclade
#Group
#group
#complex
#(type*)
#lineages
#C3
#C4
#_sensu_lato
#_samples
#_alliance
#_division
#hybrid
#_cultivar
#subgroup
#form
#ungrouped
#unpublished
Output file name: "Spermatophyta58024_plnDB02032020_nodupl_bashcleaned.csv"

Step 4: Run Spermatophyta_sp_authority_format_v4.py

This script will reformat the NCBI taxonomy file, mainly split taxon Authority into a different column, and correct the taxon status (for examples a "varietas" assigned to "species").

It also generates 4 output files:

The main reformated file
Spermatophyta58024_plnDB_pyphlawd02172020_reformated.csv
Those cultivars
Spermatophyta58024_plnDB_pyphlawd02172020_cultivar.csv
Those hybrids Spermatophyta58024_plnDB_pyphlawd02172020_ill.hybrids.csv
Those "genus_sp." cases
Spermatophyta58024_plnDB_pyphlawd02172020_sp.csv

Step 5: Run Spermatophyta_clean3.14snakeV3.py

This script will parsing the NCBI taxonomic information as order,family,genus_hybrid,genus,species_hybrid,species,infraspecific_rank,infraspecies,taxon_authority,taxon_rank,ncbi_id , to preparing for matching with the taxonomy database from World Checklist of Selected Plant Families (WCSP).
This script also calculate species count under each genus, family, and order providing information for the strategy of future supertree reconstruction.
It also generate 3 output files:
- The main file
  Spermatophyta_clean02172020.csv
- The file for those genus without species records under
  Spermatophyta_Nospecies_NCBI.csv
- The table which calculate species richness under each genus
  Spermatophyta_Richness_NCBI.csv

(To be continued ...)

External link

Phylosynth

BIEN taxonomy match

Last update:

Mon Feb 17 14:42:32 2020

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
imag		imag
results		results
LICENSE		LICENSE
README.html		README.html
README.md		README.md
Spermatophyta_clean3.14snakeV3.py		Spermatophyta_clean3.14snakeV3.py
Spermatophyta_plnDB_cleanerV1.1.sh		Spermatophyta_plnDB_cleanerV1.1.sh
Spermatophyta_sp_authority_format_V5.py		Spermatophyta_sp_authority_format_V5.py
Spermatophyta_sp_authority_format_v4.py		Spermatophyta_sp_authority_format_v4.py
Taxonomic_Matcher.Rproj		Taxonomic_Matcher.Rproj
ncbi_name_extract_V3.py		ncbi_name_extract_V3.py
remove_duplicate.py		remove_duplicate.py
taxonomic_matcher.R		taxonomic_matcher.R
taxonomic_matcherV1.1.R		taxonomic_matcherV1.1.R
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taxonomic_Matcher

General Workingflow

About

Releases

Packages

Languages

License

phylosynth/NCBI_Taxonomic_Matcher

Folders and files

Latest commit

History

Repository files navigation

Taxonomic_Matcher

General Workingflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages