Number of mouse transcripts in annotation #49

apredeus · 2019-05-17T13:16:47Z

Hello,

I was wondering about the annotation version you were using for processing mouse experiments using Kallisto. Ensembl 90 annotation has 131,195 unique transcripts; however, the cDNA file you've used only contains 109,282. Could you tell why is that, and why some of the transcripts were dropped?

Thank you!

markziemann · 2019-05-17T14:49:35Z

Hi @apredeus , I noticed this also. It is an inconsistency between the Ensembl GTF and the cDNA file. For kallisto mapping, DEE2 uses the cDNA.

$ wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
$ zgrep -c '>' Mus_musculus.GRCm38.cdna.all.fa.gz
109282
I'm not sure about the reasons behind the discrepancy between the two files.

apredeus · 2019-05-20T14:29:23Z

Hello @markziemann ,

so I contacted Ensembl about the clarification. Apparently there's some sort of division that Ensembl does for its annotation; cDNA is meant to mostly include protein coding transcripts. Upon closer examination that doesn't hold true either; cDNA is protein coding genes + all possible types of pseudogenes. If you look at what's actually included, here's the breakdown of "gene_type" field from the master GTF:

     22 IG_C_gene
      1 IG_C_pseudogene
     20 IG_D_gene
      3 IG_D_pseudogene
     18 IG_J_gene
      4 IG_LV_gene
      2 IG_pseudogene
    306 IG_V_gene
    155 IG_V_pseudogene
    142 polymorphic_pseudogene
   8616 processed_pseudogene
  94937 protein_coding
     74 pseudogene
    499 transcribed_processed_pseudogene
     48 transcribed_unitary_pseudogene
    655 transcribed_unprocessed_pseudogene
     11 TR_C_gene
      5 TR_D_gene
     76 TR_J_gene
     10 TR_J_pseudogene
    194 TR_V_gene
     34 TR_V_pseudogene
     18 unitary_pseudogene
   2549 unprocessed_pseudogene

What's missing is all the lincRNAs, antisense, and bunch of other small and misc RNA types; these are aggregated in a separate file at ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/ncrna/.

Additionally, and quite annoyingly, the cDNA file includes entities found exclusively on patches and alt-contigs. People discussed it for quite a while, and (I think) overall consensus is that it's best to use "primary" version of human/mouse assembly, together with the matching annotation.

markziemann · 2019-05-20T23:23:50Z

Thanks for investigating this Alex. In the next version of DEE we would like to include lincRNAs as well. So is it safe to concatenate the ncRNA.fa and cDNA.fa then remove any contigs not on the primary assembly?

apredeus · 2019-05-21T13:40:31Z

From my previous experience and discussions with other RNA-Seq bioinformaticians, Gencode seemed a bit better in terms of consistency and curation, while having a benefit of the same gene/transcript IDs as Ensembl. So I think it's a good idea to take the latest Gencode annotation for both human and mouse.

What we would usually do is take the so-called primary version of genome assembly (meaning reference chromosomes AND extra scaffolds, but no patches or alt-contigs since they increase ambiguity and multi-mapping), matching primary GTF, and then just use rsem-prepare-reference (from RSEM) to generate the transcript sequences exactly matching the genome/GTF.

Alternatively, Gencode has pre-extracted sequences of transcripts as well, but these do not include ones located in extra scaffolds. However, in the latest mouse version these account for only 95 out of ~ 140k, so they could probably safely be ignored :)

markziemann added pipeline Issues to resolve in next pipeline version wontfix labels Aug 25, 2019

This was referenced Nov 13, 2024

Modernisation progress for pipeline #103

Closed

Correct representation of ncRNA in kallisto ressults #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of mouse transcripts in annotation #49

Number of mouse transcripts in annotation #49

apredeus commented May 17, 2019

markziemann commented May 17, 2019 •

edited

Loading

apredeus commented May 20, 2019 •

edited

Loading

markziemann commented May 20, 2019

apredeus commented May 21, 2019

Number of mouse transcripts in annotation #49

Number of mouse transcripts in annotation #49

Comments

apredeus commented May 17, 2019

markziemann commented May 17, 2019 • edited Loading

apredeus commented May 20, 2019 • edited Loading

markziemann commented May 20, 2019

apredeus commented May 21, 2019

markziemann commented May 17, 2019 •

edited

Loading

apredeus commented May 20, 2019 •

edited

Loading