Code corresponing to the paper:
Diverse and abundant viruses exploit conjugative plasmids (2023)
[bioRxiv]
Natalia Quinones-Olvera*, Siân V. Owen*, Lucy M. McCully, Maximillian G. Marin, Eleanor A. Rand, Alice C. Fan, Oluremi J. Martins Dosumu, Kay Paul, Cleotilde E. Sanchez Castaño, Rachel Petherbridge, Jillian S. Paull, Michael Baym
- Code
- Snakemake pipeline for genome assembly and annotation
- Data
- SRA BioProject:
PRJNA954020
- Sequencing runs:
SRR24145707
-SRR24145772
- Sequencing runs:
- Metadata
phage_metadata.tsv
- Assemblies
genomes/data/assemblies_oriented
- Genbank annotations
genomes/data/annotation/_gbks/
- SRA BioProject:
- Code
- Jupyter notebook with image processing:
Fig1.ipynb
- Jupyter notebook with image processing:
- Data
- Image files:
Fig1/data
- Image files:
- Code
- Jupyter notebook: Whole genome alignment and tree building
- Key commands:
# whole genome alignment clustalo -i <alphatv.fasta> -o <alphatv.msa.fasta> --outfmt=fa # tree building iqtree -st DNA -m MFP -bb 1000 -alrt 1000 -s <alphatv.msa.trim.fasta>
- Jupyter notebook: Whole genome alignment and tree building
- Data
- Genomes used (unaligned
fasta
):alphatv.fasta
- Whole-genome alignment (trimmed aligned
fasta
):alphatv.msa.trim.fasta
- Newick tree file (
iqtree
output):alphatv.msa.trim.fasta.treefile
- Genomes used (unaligned
- Code
- Jupyter notebook: Map figure
- Data
- Coordinates and references for map:
coordinates.tsv
- Coordinates and references for map:
- Code
- Snakemake: Pipeline producing the alignments and nucleotide diversity calculation.
- Key commands:
# align each assembly to reference minimap2 -ax asm20 -B2 -O6,26 --end-bonus 100 --cs <NC_001421.fasta> <assembly> > <output.sam> # calculate nucleotide diversity vcftools --vcf <merged_vcf> --window-pi 100 --window-pi-step 1 --out <NucDiv.100bp.slideby1.windowed.pi>
- Jupyer notebook: Plot, heatmap, and genome map.
- Snakemake: Pipeline producing the alignments and nucleotide diversity calculation.
- Data
- Nucleotide diversity values for sliding window size 100 bp:
NucDiv.100bp.slideby1.windowed.pi
- PRD1 reference annotation (curated version of
NC_001421
):PRD1_updated.gb
- Assemblies:
genomes/data/assemblies_oriented
- Nucleotide diversity values for sliding window size 100 bp:
-
Code
- Jupyter notebook: Processing growth curves, calculating area under the curve and liquid assay score, producing heatmap.
- Jupyter notebook: Plotting sample curves and heatmap.
- Custom functions imported in notebooks:
EOL_tools.py
-
Data
- Raw growth curve data:
all_growthcurves.tsv
- Liquid assay score values:
all_liquidasssayscores.tsv
- Phage tree: (see Figure 2)
- Strain 16S alignment:
16S.afa
- Strain tree:
16S.tree
- Raw growth curve data:
- Code
- Plot abundance
abundance.ipynb
- Plot abundance
- Data
- Raw counts
counts.tsv
- Raw counts
- Data
- NCBI RefSeq/Genbank tectiviruses:
NCBI/tectivirus_metadata.tsv
- JGI IMG/VR matches:
JGI_IMGVR/JGI_metadata.tsv
- From Yutin et. al. (2018) (paper):
Yutin/yutin_metadata.tsv
- Genbank files of shown genomes:
figures/Fig4/data/tecti_genomes/gb
- hmm models used to for color annotations:
Fig4/data/models/hmm
- NCBI RefSeq/Genbank tectiviruses:
- Code
- Jupyer notebook: Build ATPase tree
- Key commands:
# align ATPase sequences from all tectiviruses with ATPase hmm model hmmalign --trim <IX.2.hmm> <P9.faa> | esl-reformat --gapsym='-' afa - > <P9.afa> # build tree phyml -d aa -m LG -b -4 -v 0.0 -c 4 -a e -f e --no_memory_check -i <P9.phy>
- Key commands:
- Jupyer notebook: Build ATPase tree
- Data
- ATPase hmm model:
IX.2.hmm
- ATPase sequences used (unaligned
fasta
):P9.faa
- ATPase alignment (aligned
fasta
):P9.afa
- Newick tree:
P9.phy_phyml_tree
- ATPase hmm model:
-
Code
- Jupyter notebook: Build kraken database with viral database + tectiviruses from this study.
- Snakemake pipeline: To run kraken on metagenomic datasets.
- Key commands:
kraken2 --paired --report <kraken_report> --db <custom_db> <fastq_1> <fastq_2> > <kraken_results>
- Key commands:
- Jupyter notebook: Extract kraken results and produce plot.
-
Data
- Kraken results summary
results.tsv
- Kraken results summary
-
Code
- Jupyter notebook: Align metagenomic reads to reference PRD1 genome
-
Data
- SRA BioProject:
PRJNA954020
- Runs:
SRR24211943
-SRR24211944
- Runs:
- Metagenomic reads classified as alphatectivirus
- Mapped reads
- SRA BioProject:
-
Code
- Jupyter notebook: Produce trees
-
Data
- Genomes used (unaligned
fasta
)- Emesvirus
emesvirus.fasta
- Qubevirus
qubevirus.fasta
- Inovirus
inovirus.fasta
- Emesvirus
- Alignments (trimmed aligned
fasta
)- Emesvirus
emesvirus.trim.afa
- Qubevirus
qubevirus.trim.afa
- Inovirus
inovirus.trim.afa
- Emesvirus
- Newick trees
- Emesvirus
emesvirus.tree
- Qubevirus
qubevirus.tree
- Inovirus
inovirus.tree
- Emesvirus
- Genomes used (unaligned
-
Code
- Jupyter notebook: Produce genome map graphic.
-
Data
- FtMidnight genbank file
FtMidnight.rotated.gb
- FtMidnight genbank file
Everything in the notebooks should be able to run after installing this conda environment.
conda env create -f envs/pdep.yml
I tried including all the raw files in this repository, with the exception of large files such as sequencing runs, which can be accessed through the SRA (see specific section of accessions). Likewise, some intermediate files might be absent, but everything should be obtainable by running the code in the notebooks.
The snakemake piplelines should be able to run also from the same conda environment. Additional dependencies of each pipeline are included in the envs/
directory, next to the corresponding Snakefile
, and are dealt with by snakemake
. I've included a run_snakemake.sh
and a run_snakemake.loc.sh
file for each, which show how they can be executed for running it in a computer cluster or locally (respectively).
If you have trouble finding or running anything shown here, please do get in contact. You can submit an issue or send me an email: [email protected]