The reward for doing good is the opportunity to do more. -Jonas Salk
The reward for a good deed is another good deed. -Ben Azai.
The VirSieve pipeline began with the realization that the bioinformatic analysis of viral strains in wastewater might bioinformatically resemble the bioinformatics of analyzing intratumoral heterogeniety from formalin-fixed paraffin-embedded tissues. We hope our approach will prove to be a useful tool for the analysis and surveillance of emerging and established viral variants using wastewater-based epidemiology.
We are sponsoring a wastewater-based epidemiology symposium in January 2025. Click here to register )
This pipeline is still underdevelopment and testing, but will remain open-source and will always invite community participation and feedback. If you have any questions or suggestions (or a desire to collaborate), please email mweinstein @t zymoresearch .com.
sudo apt update
sudo apt install git
sudo apt install docker.io
sudo gpasswd -a $USER docker
git clone --recursive https://github.com/Zymo-Research/VirSieve.git
cd VirSieve
python3 containerBuild.py # This command will build the containers. It and all commands above only need to be run once per system as part of the setup.
python3 runPipeline.py /path/to/working/directory # This command will also be used to initiate subsequent analyses on new data
This software should be able to run on nearly any system with Docker installed on an account with sufficient privileges to run Docker containers. Having Git on the system will help with recursively cloning the relevant submodules and keeping the pipeline up to date, but the code can always be manually copied from Github. The Python code run outside the container (to build the containers and then run them) is intentionally kept simple and should be compatible with any currently-supported version of Python along with several versions long since deprecated.
All but the last line of the code given above is part of the installation process and should only need to be run one
time per system. If Docker and Git are already present on the system, the first three lines can be skipped. Additionally,
if Docker is already present and running on the system, the user is likely to already have permission to run Docker containers
without having to SUDO. If so, the fourth line can be skipped as well (this can be tested by running docker container run hello-world
).
To run the pipeline, simply use the following command:
python3 runPipeline.py /path/to/working/directory
The working directory will be expected to have the following structure for the pipeline to run automatically:
.
+-- workingDirecctoryName
| +-- rawFASTQ
| +-- reads-pe1.fastq
| +-- reads-pe2.fastq
| +-- adapters.fa
| +-- primers.bed
The adapters.fa and primers.bed files should have those exact names in those exact cases. The primer file should be a BED-formatted file containing the primer binding intervals within the genome for use in the iVar portion of the pipeline to trim the primers. The adapters.fa file should be a FASTA-formatted file containing the adapter sequences to be trimmed from the reads. The forward- and reverse-read FASTQ files can have any name, but they will be expected to follow the Illumina standard of using underscores as a delimiter with the first field being the sample name (if you have underscores within your sample names, they will likely be truncated). There should only be two FASTQ files present and they should end in some appropriate file extension to identify them as such.
THis pipeline will retain all intermediate files for traceability and diagnostic purposes. The folders created within the working directory will give a clue as to the source of the files contained within. The freyjaOutput folder will contain data relevant to Freyja as well as the result of the demixing of the sample. The vepOutputs folder will have the reported variants along with their expected consequences. The filteredVCF folder will have the VCF that went through filtering in GATK for different attributes affecting variant confidence. By default, only variants with 1% or greater abundance will be reported. The BQSR BAM file will contain all of the aligned reads with recalibrated base quality scores that was used for variant calling while the Mutect2-generated BAM file showing any realignments at active sites (which can be found in the rawVCF folder) will contain the read data as seen by Mutect2 from de novo reassembly around active sites.
We welcome and encourage contributions to this project from the community and will happily accept and acknowledge input (and possibly provide some free kits as a thank you). We aim to provide a positive and inclusive environment for contributors that is free of any harassment or excessively harsh criticism. Our Golden Rule: Treat others as you would like to be treated.
We use a modification of Semantic Versioning to identify our releases.
Release identifiers will be major.minor.patch
Major release: Newly required parameter or other change that is not entirely backwards compatible Minor release: New optional parameter Patch release: No changes to parameters
- Michael M. Weinstein - Project Lead, Programming and Design - michael-weinstein
This project is still in development, but we expect to keep it fully open-source with minimal restrictions on usage.
We would like to thank the following, without whom this would not have happened:
- The Python Foundation
- The staff at Zymo Research
- Our customers