Skip to content

Commit

Permalink
Merge pull request #62 from pirovc/dev
Browse files Browse the repository at this point in the history
genome_updater v0.5.0
  • Loading branch information
pirovc authored May 12, 2022
2 parents a5df18f + cf8f493 commit 1e046dd
Show file tree
Hide file tree
Showing 28 changed files with 1,443 additions and 608 deletions.
7 changes: 7 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
codecov:
ci:
- "travis.org"

ignore:
- ".git/"
- "tests/"
4 changes: 4 additions & 0 deletions .simplecov
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
require 'codecov'
require 'simplecov'

SimpleCov.formatter = Codecov::SimpleCov::Formatter
9 changes: 8 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
language: bash
dist: focal

before_install:
- gem install bashcov codecov
- sudo apt-get install parallel

script:
- tests/libs/bats/bin/bats tests/integration_offline.bats
- bashcov tests/libs/bats/bin/bats tests/integration_offline.bats

after_success:
- curl -Os https://uploader.codecov.io/latest/linux/codecov
- chmod +x codecov
- ./codecov -f coverage/codecov-result.json -Z

notifications:
email: false
324 changes: 173 additions & 151 deletions README.md

Large diffs are not rendered by default.

1,080 changes: 700 additions & 380 deletions genome_updater.sh

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

genome_updater uses the [bats](https://github.com/bats-core/bats-core) testing framework for Bash.

Use the `download_test_set.sh` to re-create a random set of offline files to test. Files will be downloaded to `files/genomes`.
Use the `download_test_set.sh` to re-create a random set of offline files to test. Files will be downloaded to `files/genomes` and filtered taxonomies to `files/pub/taxonomy/new_taxdump` [ncbi] and `releases/latest` [gtdb].
29 changes: 29 additions & 0 deletions tests/download_test_set.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,38 @@ do
fi
head -n 2 "full_assembly_summary.txt" > "${out_as}"
tail -n+3 "full_assembly_summary.txt" | shuf | head -n ${entries} >> "${out_as}"
# create a dummy historical for gtdb tests (just a copy)
cp "${out_as}" "${out_as%.*}_historical.txt"
# Download files
tail -n+3 "${out_as}" | cut -f 20 | sed 's/https:/ftp:/g' | xargs -P ${entries} wget --quiet --show-progress --directory-prefix="${outfld}" --recursive --level 2 --accept "${ext}"
cp -r "${outfld}ftp.ncbi.nlm.nih.gov/genomes/" "${outfld}"
rm -rf "full_assembly_summary.txt" "${outfld}ftp.ncbi.nlm.nih.gov/"
done
done

# Download and filter taxonomies for used accessions/taxids

# Get used accessions and taxids
cut -f 1,6 ${outfld}genomes/*/assembly_summary_*.txt ${outfld}genomes/*/*/assembly_summary.txt | grep -v "^#" | sort | uniq > ${outfld}accessions_taxids.txt
# ncbi new_taxdump
wget --quiet --show-progress --output-document "${outfld}new_taxdump.tar.gz" "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
tar xf "${outfld}new_taxdump.tar.gz" -C "${outfld}" taxidlineage.dmp rankedlineage.dmp
mkdir -p "${outfld}pub/taxonomy/new_taxdump/"
cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "[^0-9]${1}[^0-9]" "'${outfld}'taxidlineage.dmp"' >> "${outfld}pub/taxonomy/new_taxdump/taxidlineage.dmp"
cat "${outfld}accessions_taxids.txt" | xargs -l bash -c 'grep "^${1}[^0-9]" "'${outfld}'rankedlineage.dmp"' >> "${outfld}pub/taxonomy/new_taxdump/rankedlineage.dmp"
find "${outfld}pub/taxonomy/new_taxdump/" -printf "%P\n" | tar -czf "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz" --no-recursion -C "${outfld}pub/taxonomy/new_taxdump/" -T -
md5sum "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz" > "${outfld}pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5"
rm "${outfld}new_taxdump.tar.gz" "${outfld}taxidlineage.dmp" "${outfld}rankedlineage.dmp" "${outfld}pub/taxonomy/new_taxdump/taxidlineage.dmp" "${outfld}pub/taxonomy/new_taxdump/rankedlineage.dmp"

#gtdb
gtdb_out="${outfld}releases/release207/207.0/"
mkdir -p "${gtdb_out}"
gtdb_tax=( "ar53_taxonomy_r207.tsv.gz" "bac120_taxonomy_r207.tsv.gz" )
for tax in "${gtdb_tax[@]}"; do
wget --quiet --show-progress --output-document "${outfld}${tax}" "https://data.gtdb.ecogenomic.org/releases/release207/207.0/${tax}"
join -1 1 -2 1 <(cut -f 1 "${outfld}accessions_taxids.txt" | sort) <(zcat "${outfld}${tax}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$1,$2}' | sed -r 's/^.{3}//' | sort) -t$'\t' -o "2.2,2.3" | gzip > "${gtdb_out}${tax}"
rm "${outfld}${tax}"
done

md5sum ${gtdb_out}*.tsv.gz > "${gtdb_out}MD5SUM"
rm ${outfld}accessions_taxids.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_903930505.1 PRJEB38681 SAMEA6952057 CAIYYQ000000000.1 na 2026739 2026739 Euryarchaeota archaeon AlinenSedimentsCore2_bin-0840 latest Contig Major Full 2020/07/18 freshwater MAG --- AlinenSedimentsCore2_bin-0840 BILS na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/903/930/505/GCA_903930505.1_freshwater_MAG_---_AlinenSedimentsCore2_bin-0840 derived from metagenome; genus undefined na
GCA_903858355.1 PRJEB38681 SAMEA6954579 CAIOIP000000000.1 na 2220064 2220064 uncultured Candidatus Micrarchaeota archaeon AlinenSedimentsD1_bin-0133 latest Contig Major Full 2020/07/16 freshwater MAG --- AlinenSedimentsD1_bin-0133 BILS na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/903/858/355/GCA_903858355.1_freshwater_MAG_---_AlinenSedimentsD1_bin-0133 derived from environmental source; derived from metagenome na
GCA_016839815.1 PRJNA680430 SAMN16492231 JAEOTM000000000.1 na 2800102 2800102 Candidatus Hodarchaeota archaeon YT2_004 latest Contig Major Full 2021/02/09 ASM1683981v1 Shenzhen Univeristy na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/839/815/GCA_016839815.1_ASM1683981v1 derived from metagenome; genus undefined na
GCA_011389385.1 PRJNA480137 SAMN09639886 DTGE00000000.1 na 2026714 2026714 Candidatus Bathyarchaeota archaeon SpSt-755 latest Contig Major Full 2020/03/17 ASM1138938v1 The University of Hong Kong na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/389/385/GCA_011389385.1_ASM1138938v1 derived from metagenome; genus undefined na
GCA_017656495.1 PRJNA635695 SAMN15049706 JACDNS000000000.1 na 35749 35749 Thermococcus sp. GB_MAG1_027 latest Contig Major Full 2021/04/01 ASM1765649v1 Marine Biological Laboratory na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/656/495/GCA_017656495.1_ASM1765649v1 derived from metagenome na
GCA_018645535.1 PRJNA630981 SAMN14913871 JABGWN000000000.1 na 2026739 2026739 Euryarchaeota archaeon SI034_bin52 latest Contig Major Full 2021/06/02 ASM1864553v1 The University of Melbourne na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/645/535/GCA_018645535.1_ASM1864553v1 derived from metagenome; genus undefined na
GCA_002499365.1 PRJNA348753 SAMN06027185 DALD00000000.1 na 1915872 1915872 Euryarchaeota archaeon UBA29 UBA29 latest Scaffold Major Full 2017/10/10 ASM249936v1 University of Queensland na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/499/365/GCA_002499365.1_ASM249936v1 derived from metagenome; genus undefined na
GCA_004525575.1 PRJNA511814 SAMN11127074 SPCB00000000.1 na 2053491 2053491 Candidatus Thorarchaeota archaeon das_tool.maxbin2.13 latest Contig Major Full 2019/03/30 ASM452557v1 Radboud University Njmegen na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/525/575/GCA_004525575.1_ASM452557v1 derived from metagenome; genus undefined na
GCA_011335015.1 PRJNA480137 SAMN09639889 DTGH00000000.1 na 2250274 2250274 Candidatus Micrarchaeota archaeon SpSt-758 latest Contig Major Full 2020/03/16 ASM1133501v1 The University of Hong Kong na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/335/015/GCA_011335015.1_ASM1133501v1 derived from metagenome; genus undefined na
GCA_002069705.1 PRJNA321808 SAMN05004159 MWBV00000000.1 na 1852841 1852841 Candidatus Diapherotrites archaeon ADurb.Bin253 ADurb.Bin253 latest Contig Major Full 2017/03/22 ASM206970v1 University of Illinois na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/069/705/GCA_002069705.1_ASM206970v1 derived from metagenome; genus undefined na
GCA_900316635.1 PRJEB21624 SAMEA104666887 ONDQ00000000.1 na 253161 253161 uncultured Methanobrevibacter sp. RUG201 latest Scaffold Major Full 2018/03/21 Rumen uncultured genome RUG201 THE ROSLIN INSTITUTE na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/316/635/GCA_900316635.1_Rumen_uncultured_genome_RUG201 derived from environmental source na
GCA_011388575.1 PRJNA480137 SAMN09638894 DRUB00000000.1 na 334771 334771 Ignisphaera aggregans SpSt-1 latest Contig Major Full 2020/03/17 ASM1138857v1 The University of Hong Kong na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/388/575/GCA_011388575.1_ASM1138857v1 derived from metagenome na
GCA_018304485.1 PRJNA288027 SAMN18341270 JAGVWB000000000.1 na 2026736 2026736 Candidatus Diapherotrites archaeon RIFCSPLOWO2_01_FULL_43_13 latest Scaffold Major Full 2021/05/07 ASM1830448v1 Banfield Lab, University of California, Berkeley na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/304/485/GCA_018304485.1_ASM1830448v1 derived from metagenome; genus undefined na
GCA_018676255.1 PRJNA630981 SAMN14914095 JABHFD000000000.1 na 2026739 2026739 Euryarchaeota archaeon SI037_bin172 latest Contig Major Full 2021/06/02 ASM1867625v1 The University of Melbourne na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/676/255/GCA_018676255.1_ASM1867625v1 derived from metagenome; genus undefined na
GCA_016196285.1 PRJNA640378 SAMN15435488 JACPXY000000000.1 na 2026773 2026773 Candidatus Pacearchaeota archaeon NC_groundwater_849_Pr1_B-0.1um_42_10 latest Contig Major Full 2020/12/21 ASM1619628v1 Innovative Genomics Institute na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/196/285/GCA_016196285.1_ASM1619628v1 derived from metagenome; genus undefined na
GCA_002497565.1 PRJNA348753 SAMN06027207 DADS00000000.1 na 1915824 1915824 Euryarchaeota archaeon UBA179 UBA179 latest Scaffold Major Full 2017/10/10 ASM249756v1 University of Queensland na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/497/565/GCA_002497565.1_ASM249756v1 derived from metagenome; genus undefined na
GCA_902383905.1 PRJEB33885 SAMEA5851664 representative genome 1406512 1406512 Candidatus Methanomassiliicoccus intestinalis MGYG-HGUT-02160 latest Complete Genome Major Full 2019/08/10 UHGG_MGYG-HGUT-02160 EMG GCF_902383905.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/383/905/GCA_902383905.1_UHGG_MGYG-HGUT-02160 na
GCA_018692575.1 PRJNA630981 SAMN14914238 JABHKQ000000000.1 na 2026803 2026803 Candidatus Woesearchaeota archaeon SI037S2_bin24 latest Contig Major Full 2021/06/02 ASM1869257v1 The University of Melbourne na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/692/575/GCA_018692575.1_ASM1869257v1 derived from metagenome; genus undefined na
GCA_013390775.1 PRJNA640238 SAMN15312031 JACATB000000000.1 na 2511932 2511932 Marine Group I thaumarchaeote strain=D11 latest Scaffold Major Full 2020/07/06 ASM1339077v1 National Science Foundation of China GCF_013390775.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/390/775/GCA_013390775.1_ASM1339077v1 genus undefined na
GCA_002727275.1 PRJNA391943 SAMN07618837 PBWO00000000.1 na 2026739 2026739 Euryarchaeota archaeon RS814 latest Contig Major Full 2017/10/26 ASM272727v1 Tara Oceans Consortium na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/727/275/GCA_002727275.1_ASM272727v1 derived from metagenome; genus undefined na
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_002566855.1 PRJNA400804 SAMN07598389 NUZM00000000.1 na 1396 1396 Bacillus cereus strain=AFS074515 latest Scaffold Major Full 2017/10/17 ASM256685v1 UNC Chapel Hill GCF_002566855.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/566/855/GCA_002566855.1_ASM256685v1 na
GCA_902635445.1 PRJEB33281 SAMEA6073950 CACPNU000000000.1 na 198431 198431 uncultured prokaryote latest Contig Major Full 2019/11/05 AG-915-F08 WOODS HOLE OCEANOGRAPHIC INSTITUTION na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/635/445/GCA_902635445.1_AG-915-F08 derived from environmental source; derived from metagenome na
GCA_017159575.1 PRJNA287430 SAMN17764286 AAZEKK000000000.1 na 197 197 Campylobacter jejuni strain=FSIS12137393 latest Contig Major Full 2021/03/03 PDT000946857.1 USDA FSIS na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/159/575/GCA_017159575.1_PDT000946857.1 from large multi-isolate project na
GCA_005728625.1 PRJNA280335 SAMN10715290 AADQWW000000000.1 na 28901 28901 Salmonella enterica strain=ADRDL-2252 latest Contig Major Full 2019/05/23 PDT000448312.1 US Food and Drug Administration na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/005/728/625/GCA_005728625.1_PDT000448312.1 from large multi-isolate project na
GCA_013911495.1 PRJNA638822 SAMN15215249 JACETB000000000.1 na 1131 1131 Synechococcus sp. MCMED-G31 latest Contig Major Full 2020/07/29 ASM1391149v1 Evolutionary Genomics Group na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/911/495/GCA_013911495.1_ASM1391149v1 derived from metagenome na
GCA_004008395.1 na 2499034 2499034 Mycobacterium phage Cici latest Complete Genome Major Full 2019/01/08 ASM400839v1 na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/004/008/395/GCA_004008395.1_ASM400839v1 na
GCA_021355205.1 na 2894335 2894335 Burkholderia phage BgManors32 latest Complete Genome Major Full 2021/11/22 ASM2135520v1 na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/021/355/205/GCA_021355205.1_ASM2135520v1 na
GCA_003635585.1 PRJNA374603 SAMN06329599 MVSU00000000.1 na 210 210 Helicobacter pylori strain=HPAS14 latest Contig Major Full 2018/10/12 ASM363558v1 University of Western Australia GCF_003635585.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/635/585/GCA_003635585.1_ASM363558v1 na
GCA_012763735.1 PRJNA277984 SAMN04510396 AATCVN000000000.1 na 562 562 Escherichia coli strain=CDPHFDLB-F1602032-026A latest Contig Major Full 2020/04/23 PDT000113200.3 US Food and Drug Administration na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/763/735/GCA_012763735.1_PDT000113200.3 from large multi-isolate project na
GCA_013619715.1 PRJNA615626 SAMN14453445 JACEKU000000000.1 na 287 287 Pseudomonas aeruginosa strain=LiP14 latest Contig Major Full 2020/07/24 ASM1361971v1 University of Oxford GCF_013619715.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/619/715/GCA_013619715.1_ASM1361971v1 na
GCA_008787855.1 PRJNA292661 SAMN12842867 AALEUD000000000.1 na 28901 28901 Salmonella enterica strain=CVM N19S0343 latest Contig Major Full 2019/10/01 PDT000594120.1 FDA na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/008/787/855/GCA_008787855.1_PDT000594120.1 from large multi-isolate project na
GCA_903218915.1 PRJEB35770 SAMEA6813852 CAEZVL000000000.1 na 449393 449393 freshwater metagenome latest Contig Major Full 2020/06/05 UFOp-RE-23may17-586 BIOLOGY CENTRE ASCR, V.V.I., INSTITUTE OF HYDROBIO na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/903/218/915/GCA_903218915.1_UFOp-RE-23may17-586 derived from environmental source; metagenome na
GCA_008201245.1 PRJNA248792 SAMN03479222 AAJWIJ000000000.1 na 90371 28901 Salmonella enterica subsp. enterica serovar Typhimurium strain=7397 latest Contig Major Full 2019/09/02 PDT000058697.2 Public Health England na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/008/201/245/GCA_008201245.1_PDT000058697.2 from large multi-isolate project na
GCA_011078725.1 PRJNA248792 SAMN03168749 AAPFHW000000000.1 na 90371 28901 Salmonella enterica subsp. enterica serovar Typhimurium strain=H120980533 latest Contig Major Full 2020/03/09 PDT000042974.4 Public Health England na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/078/725/GCA_011078725.1_PDT000042974.4 from large multi-isolate project na
GCA_013549135.1 PRJNA230403 SAMN15522001 AATZYI000000000.1 na 28901 28901 Salmonella enterica strain=PNUSAS152956 latest Contig Major Full 2020/07/23 PDT000787515.1 CDC na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/549/135/GCA_013549135.1_PDT000787515.1 from large multi-isolate project na
GCA_018937815.1 PRJNA218110 SAMN19697485 ABAWPX000000000.1 na 562 562 Escherichia coli strain=PNUSAE074529 latest Contig Major Full 2021/06/17 PDT001069867.1 CDC na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/937/815/GCA_018937815.1_PDT001069867.1 from large multi-isolate project na
GCA_005603115.1 PRJNA230403 SAMN11552442 AADIAU000000000.1 na 28901 28901 Salmonella enterica strain=PNUSAS073825 latest Contig Major Full 2019/05/21 PDT000496874.1 CDC na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/005/603/115/GCA_005603115.1_PDT000496874.1 from large multi-isolate project na
GCA_019997905.1 PRJNA685966 SAMN21249929 na 283734 283734 Staphylococcus pseudintermedius strain=HSP149 latest Complete Genome Major Full 2021/09/15 ASM1999790v1 Universitat Autonoma de Barcelona na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/997/905/GCA_019997905.1_ASM1999790v1 from large multi-isolate project na
GCA_011897165.1 PRJNA218110 SAMN12361411 AARDFA000000000.1 na 562 562 Escherichia coli strain=PNUSAE027109 latest Contig Major Full 2020/04/02 PDT000549212.2 CDC na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/897/165/GCA_011897165.1_PDT000549212.2 from large multi-isolate project na
GCA_015893745.1 PRJNA514245 SAMN15566993 DACSEB000000000.1 na 575 575 Raoultella planticola MISC077 latest Contig Major Full 2020/12/09 PDT000883933.1 National Center for Biotechnology Information na na https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/893/745/GCA_015893745.1_PDT000883933.1 from large multi-isolate project na
Loading

0 comments on commit 1e046dd

Please sign in to comment.