Atlas and Eukaryotic reads #427

Sofie8 · 2021-01-13T15:25:08Z

Sofie8
Jan 13, 2021
Collaborator

My question is a bit related to the RNA metatranscriptomics question: #143
but rather on DNA metagenomes. Can we use parts of the atlas pipeline for finding/annotating eukaryotic contigs/genes in metagenome datasets.

It is hard to find a workflow for analysing metagenome datasets for eukaryotes, so here we go.

Suggestion:

QC-step: same for prok, euk. Read-extension step with tadpole?
Assembly: is there an assembler more tailored for eukaryotic reads? Or spades is the way to go here?
Host DNA removal: can you invert your step in which you 'remove' euk host DNA, to rather remove the prok and keep the euk?
Orf, protein prediction: ? metaeuk
Binning or gene-centered? I guess the last one.. and then the choice between 'taxonomy', or 'other' (pfam, cazy....)
Annotation: Mmseqs on the assembled eukaryotic contigs, or on the QC reads?
Results file: abundance table ==> can be used e.g. in phyloseq for downstream analyses, graphs..

What I tried now:

QC steps of atlas
Assembly by atlas
Using the contigs as input for mmseqs:

take the taxonomic mmseqs swiss-prot database:

mmseqs databases UniProtKB/Swiss-Prot swissprot tmp

run mmseqs easy-taxonomy on all contig files:

for file1 in _final_contigs.fasta
do
out=${file1%%._final_contigs.fasta}_output
mmseqs easy-taxonomy $file1 $databasedir/swissprot $resultdir/$out $resultdir/tmp --search-type 2
done

That's it, I am a little bit stuck here, I would like to hear others suggestions!

I have available DNA metagenome datasets (shallow metagenomes, 2x150 bp) from a freshwater stream to test things out.

Cheers,
Sofie

SilasK · 2021-01-14T09:33:04Z

SilasK
Jan 14, 2021
Maintainer

Thank you for the Suggestion. @jmtsuji also asked for the MMseqs taxonomy annotation.
Let's start a new branch.

As you proposed:

QC
Assembly, As far as I know, spades is still a good option.
MMseqs taxonomy on contigs
- Filter out Eukaryote contigs
- run MetaEuk
Binning I red that only Concoct can bin Eukaryotes, maybe also the new VAMB
Bin quality with Busco

Can you send me a output file for the mmseqs easy-taxonomy

0 replies

SilasK · 2021-01-15T08:27:34Z

SilasK
Jan 15, 2021
Maintainer

@Sofie8 Do you want to work on this? I can you help going, but you would need to write the same snakemake rules. Which isn't more complicated than bash that you already did.

0 replies

Sofie8 · 2021-01-15T09:43:02Z

Sofie8
Jan 15, 2021
Collaborator Author

@SilasK yes gladly!
I just have zero knowledge (yet) with snakemake. But if you tell me how to start, what to do, I want to try it. If I am too slow, or super inefficient, I can always try out what you assemble together in the EUK branch, if this can help you taking off some work from your shoulders. I shared a google drive folder with the output files of mmseq_easy_taxonomy, also the final contigs, and QC_reads and raw_reads. So we have a test dataset, shallow freshwater metagenomes.

It is part of a citizen science project , in which we sequence each month 10 locations of the water stream. The samples you see are from October, I have also the data from November, [december was not possible], we continue now in January, till august 2021. Samples are from the free-flowing water body, filtered over 0.2 µm and 3 µm and then separately DNA-extracted and sequenced (I thought this might help assembly, I should have an EUK enriched fraction (3 µm) and then prokaryotes (0.2 µm). We are interested to see (1) what's in the water, seasonal influence on biodiversity, and the impact of sewage overflows. At which locations do we pick up DNA from fishes, frogs, where it is 'too' dirty.. So the end result for me, would be a taxonomy, function 'abundance' table. Binning would be hard unless I try something on the merged all locations pool of reads.

So let's get started :-)

0 replies

LeeBergstrand · 2021-09-30T07:15:11Z

LeeBergstrand
Sep 30, 2021

@SilasK Your contig classification steps the eukaryotic genes could probably tie into what I was talking about in #455.

0 replies

LeeBergstrand · 2021-10-22T07:20:13Z

LeeBergstrand
Oct 22, 2021

@SilasK What would the effect be of different assemblers on the ability to generate eukaryotic contigs? My understanding is that in some cases different assemblers have different performances on different genomes. For example, there was a discontinued version of spades that was designed to better handle diploid genomes. Or is metaspades good enough for what we are trying to do?

1 reply

SilasK Oct 27, 2021
Maintainer

I know that the binners are biased towards prokaryotes bit Indon't know for assembly. Probably you wont anyway get complete diploid genomes from metagenomes.

LeeBergstrand · 2021-10-22T07:48:01Z

LeeBergstrand
Oct 22, 2021

This tool might be an interesting eukaryotic replacement for checkm.

Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC

https://doi.org/10.1186/s13059-020-02155-4

0 replies

SilasK · 2021-12-13T08:14:49Z

SilasK
Dec 13, 2021
Maintainer

Thank you @LeeBergstrand @jmtsuji

FYI, I thought this pre-print might interest you, given that you've been looking into binning eukaryotic genomes: https://doi.org/10.1101/2021.11.15.468626

Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure
Lotte J. U. Pronk, Marnix H. Medema

Abstract
Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic. However, because of marked differences in gene structure, prokaryotic gene prediction tools fail to accurately predict eukaryotic genes. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in gene structure. We first developed a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated accuracy of 97%, this classifier with principled features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By re-training our classifier with Tiara predictions as additional feature, weaknesses of both types of classifiers are compensated; the result is an enhanced classifier that outperforms all individual classifiers, with an F1-score of 1.00 on precision, recall and accuracy for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endosphere microbial community, we show how using Whokaryote to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Our enhanced classifier, which we call ‘Whokaryote’, is wrapped in an easily installable package and is freely available from https://git.wageningenur.nl/lotte.pronk/whokaryote.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atlas and Eukaryotic reads #427

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Atlas and Eukaryotic reads #427

Sofie8 Jan 13, 2021 Collaborator

take the taxonomic mmseqs swiss-prot database:

run mmseqs easy-taxonomy on all contig files:

Replies: 7 comments · 1 reply

SilasK Jan 14, 2021 Maintainer

SilasK Jan 15, 2021 Maintainer

Sofie8 Jan 15, 2021 Collaborator Author

LeeBergstrand Sep 30, 2021

LeeBergstrand Oct 22, 2021

SilasK Oct 27, 2021 Maintainer

LeeBergstrand Oct 22, 2021

SilasK Dec 13, 2021 Maintainer

Sofie8
Jan 13, 2021
Collaborator

Replies: 7 comments 1 reply

SilasK
Jan 14, 2021
Maintainer

SilasK
Jan 15, 2021
Maintainer

Sofie8
Jan 15, 2021
Collaborator Author

LeeBergstrand
Sep 30, 2021

LeeBergstrand
Oct 22, 2021

SilasK Oct 27, 2021
Maintainer

LeeBergstrand
Oct 22, 2021

SilasK
Dec 13, 2021
Maintainer