CoproID helps you to identify the "true maker" of Illumina sequenced Coprolites/Paleofaeces by checking the microbiome composition and the endogenous DNA.
It combines the analysis of putative host ancient DNA with a machine learning prediction of the feces source based on microbiome taxonomic composition:
- (A) First coproID performs a comparative mapping of all reads agains two (or three) target genomes (genome1, genome2, and eventually genome3) and computes a host-DNA species ratio (NormalizedRatio)
- (B) Then coproID performs a metagenomic taxonomic profiling, and compares the obtained profiles to modern reference samples of the target species metagenomes. Using machine learning, coproID then estimates the host source from the metagenomic taxonomic composition (prop_microbiome).
- Finally, coproID combines A and B to predict the likely host of the metagenomic sample.
The coproID pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
A detailed description of coproID can be found in the article published in PeerJ.
i. Install nextflow
ii. Install either Docker
or Singularity
for full pipeline reproducibility (please only use Conda
as a last resort; see docs)
iii. Download the pipeline and test it on a minimal dataset with a single command
nextflow run nf-core/coproid -profile test,<docker/singularity/conda/institute>
Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile institute
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.
iv. Start running your own analysis!
nextflow run maxibor/coproid --genome1 'GRCh37' --genome2 'CanFam3.1' --name1 'Homo_sapiens' --name2 'Canis_familiaris' --reads '*_R{1,2}.fastq.gz' --krakendb 'path/to/minikraken_db' -profile docker
This command runs coproID to estimate whether the source of test samples (--reads '*_R{1,2}.fastq.gz'
) are coming from a human (--genome1 'GRCh37' -name1 'Homo_sapiens'
) or a dog (--genome2 'CanFam3.1' --name2 'Canis_familiaris'
), and specifies the path to the minikraken database (--krakendb 'path/to/minikraken_db'
).
NB: The example above assumes access to iGenomes.
See usage docs for all of the available options when running the pipeline.
The nf-core/coproid pipeline comes with documentation about the pipeline, found in the docs/
directory:
The nf-core/coproid pipeline comes with documentation about the pipeline, found in the docs/
directory and at the following address: coproid.readthedocs.io
- Installation
- Pipeline configuration
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
nf-core/coproid was written by Maxime Borry.
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on Slack (you can join with this invite).
coproID has been published in peerJ. The bibtex citation is available below:
@article{borry_coproid_2020,
title = {{CoproID} predicts the source of coprolites and paleofeces using microbiome composition and host {DNA} content},
volume = {8},
issn = {2167-8359},
url = {https://peerj.com/articles/9001},
doi = {10.7717/peerj.9001},
language = {en},
urldate = {2020-04-20},
journal = {PeerJ},
author = {Borry, Maxime and Cordova, Bryan and Perri, Angela and Wibowo, Marsha and Honap, Tanvi Prasad and Ko, Jada and Yu, Jie and Britton, Kate and Girdland-Flink, Linus and Power, Robert C. and Stuijts, Ingelise and Salazar-García, Domingo C. and Hofman, Courtney and Hagan, Richard and Kagoné, Thérèse Samdapawindé and Meda, Nicolas and Carabin, Helene and Jacobson, David and Reinhard, Karl and Lewis, Cecil and Kostic, Aleksandar and Jeong, Choongwon and Herbig, Alexander and Hübner, Alexander and Warinner, Christina},
month = apr,
year = {2020},
note = {Publisher: PeerJ Inc.},
pages = {e9001}
}
- AdapterRemoval v2 Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. https://doi.org/10.1186/s13104-016-1900-2
- FastQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Bowtie2 Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357. https://dx.doi.org/10.1038%2Fnmeth.1923
- Samtools Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
- Kraken2 Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. BioRxiv, 762302. https://doi.org/10.1101/762302
- PMDTools Skoglund, P., Northoff, B. H., Shunkov, M. V., Derevianko, A. P., Pääbo, S., Krause, J., & Jakobsson, M. (2014). Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America, 111(6), 2229–2234. https://doi.org/10.1073/pnas.1318934111
- DamageProfiler Judith Neukamm (Unpublished): 10.5281/zenodo.1064062
- Sourcepredict Borry, M. (2019). Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification. The Journal of Open Source Software. https://doi.org/10.21105/joss.01540
- MultiQC Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354