-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of mouse transcripts in annotation #49
Comments
Hi @apredeus , I noticed this also. It is an inconsistency between the Ensembl GTF and the cDNA file. For kallisto mapping, DEE2 uses the cDNA.
|
Hello @markziemann , so I contacted Ensembl about the clarification. Apparently there's some sort of division that Ensembl does for its annotation; cDNA is meant to mostly include protein coding transcripts. Upon closer examination that doesn't hold true either; cDNA is protein coding genes + all possible types of pseudogenes. If you look at what's actually included, here's the breakdown of "gene_type" field from the master GTF:
What's missing is all the lincRNAs, antisense, and bunch of other small and misc RNA types; these are aggregated in a separate file at ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/ncrna/. Additionally, and quite annoyingly, the cDNA file includes entities found exclusively on patches and alt-contigs. People discussed it for quite a while, and (I think) overall consensus is that it's best to use "primary" version of human/mouse assembly, together with the matching annotation. |
Thanks for investigating this Alex. In the next version of DEE we would like to include lincRNAs as well. So is it safe to concatenate the ncRNA.fa and cDNA.fa then remove any contigs not on the primary assembly? |
From my previous experience and discussions with other RNA-Seq bioinformaticians, Gencode seemed a bit better in terms of consistency and curation, while having a benefit of the same gene/transcript IDs as Ensembl. So I think it's a good idea to take the latest Gencode annotation for both human and mouse. What we would usually do is take the so-called primary version of genome assembly (meaning reference chromosomes AND extra scaffolds, but no patches or alt-contigs since they increase ambiguity and multi-mapping), matching primary GTF, and then just use rsem-prepare-reference (from RSEM) to generate the transcript sequences exactly matching the genome/GTF. Alternatively, Gencode has pre-extracted sequences of transcripts as well, but these do not include ones located in extra scaffolds. However, in the latest mouse version these account for only 95 out of ~ 140k, so they could probably safely be ignored :) |
Hello,
I was wondering about the annotation version you were using for processing mouse experiments using Kallisto. Ensembl 90 annotation has 131,195 unique transcripts; however, the cDNA file you've used only contains 109,282. Could you tell why is that, and why some of the transcripts were dropped?
Thank you!
The text was updated successfully, but these errors were encountered: