DiTaxa

Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

Asgari E., Münch P.C., Lesker T.R., McHardy A.C.★ and Mofrad M.R.K.★, Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection. bioRxiv, 2018. Available at: bioRxiv 334722; doi: https://doi.org/10.1101/334722

The datasets are also available for download .

Contact/Developer: Ehsaneddin Asgari (asgari [at] berkeley [dot] edu)
Project page: http://llp.berkeley.edu/ditaxa
PIs: Prof. Alice McHardy* and Prof. Mohammad Mofrad*

Summary

We propose subsequence based 16S rRNA data processing, as a new paradigm for sequence phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 13 out of 21 taxa with confirmed links to periodontitis (recall=0.62), relative to 3 out of 21 taxa (recall=0.14) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to 0.91 and 0.90, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ~1.5 hours on 20 cores, while the standard pipeline needed ~6.5 hours in the same setting.

Installation

DiTaxa is implemented in Python3.x and uses ScikitLearn and Keras frameworks for machine learning. To install the dependencies use the following command:

pip install -r requirements.txt

Please cite the bioarXiv version

@article {Asgari334722,
	author = {Asgari, Ehsaneddin and M{\"u}nch, Philipp C. and Lesker, Till R. and McHardy, Alice Carolyn and Mofrad, Mohammad R.K.},
	title = {Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection},
	year = {2018},
	doi = {10.1101/334722},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2018/05/30/334722},
	eprint = {https://www.biorxiv.org/content/early/2018/05/30/334722.full.pdf},
	journal = {bioRxiv}
}

User Manual

python3 ditaxa.py --indir address_of_samples --ext extension_of_the_files --outdir output_directory --dbname database_name --cores 20 --filelis list_of_files_in_a_file --label label_files --label_vals mapping_between_labels_to_1_or_0

Using the above mentioned command all the steps will be done sequentially and output will be organized in subdirectories. A detailed manual is in progress. You may reuse the sample runs in main/DiTaxa.py or the provided command example.

Local ezCloud blast and GraPhlAn setup

On line 27 of marker_detection/npe_generate_taxa_tree.py specify your blastn address.
On line 252,253 please provide the GraPhlAn paths.
On line 576 please provide a path to your local blast.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
bootstrapping		bootstrapping
classifier		classifier
db		db
feature_selection		feature_selection
main		main
make_representations		make_representations
marker_detection		marker_detection
utility		utility
LICENSE		LICENSE
README.md		README.md
ditaxa.py		ditaxa.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiTaxa

Installation

User Manual

Local ezCloud blast and GraPhlAn setup

Bootstrapping for sample size selection

Phenotype classification

Biomarker detection

Heatmap creation

About

Releases

Packages

Languages

License

llpberkeley/DiTaxa

Folders and files

Latest commit

History

Repository files navigation

DiTaxa

Installation

User Manual

Local ezCloud blast and GraPhlAn setup

Bootstrapping for sample size selection

Phenotype classification

Biomarker detection

Heatmap creation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages