Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of mouse transcripts in annotation #49

Open
apredeus opened this issue May 17, 2019 · 4 comments
Open

Number of mouse transcripts in annotation #49

apredeus opened this issue May 17, 2019 · 4 comments
Labels
pipeline Issues to resolve in next pipeline version wontfix

Comments

@apredeus
Copy link

Hello,

I was wondering about the annotation version you were using for processing mouse experiments using Kallisto. Ensembl 90 annotation has 131,195 unique transcripts; however, the cDNA file you've used only contains 109,282. Could you tell why is that, and why some of the transcripts were dropped?

Thank you!

@markziemann
Copy link
Owner

markziemann commented May 17, 2019

Hi @apredeus , I noticed this also. It is an inconsistency between the Ensembl GTF and the cDNA file. For kallisto mapping, DEE2 uses the cDNA.

$ wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
$ zgrep -c '>' Mus_musculus.GRCm38.cdna.all.fa.gz
109282
I'm not sure about the reasons behind the discrepancy between the two files.

@apredeus
Copy link
Author

apredeus commented May 20, 2019

Hello @markziemann ,

so I contacted Ensembl about the clarification. Apparently there's some sort of division that Ensembl does for its annotation; cDNA is meant to mostly include protein coding transcripts. Upon closer examination that doesn't hold true either; cDNA is protein coding genes + all possible types of pseudogenes. If you look at what's actually included, here's the breakdown of "gene_type" field from the master GTF:

     22 IG_C_gene
      1 IG_C_pseudogene
     20 IG_D_gene
      3 IG_D_pseudogene
     18 IG_J_gene
      4 IG_LV_gene
      2 IG_pseudogene
    306 IG_V_gene
    155 IG_V_pseudogene
    142 polymorphic_pseudogene
   8616 processed_pseudogene
  94937 protein_coding
     74 pseudogene
    499 transcribed_processed_pseudogene
     48 transcribed_unitary_pseudogene
    655 transcribed_unprocessed_pseudogene
     11 TR_C_gene
      5 TR_D_gene
     76 TR_J_gene
     10 TR_J_pseudogene
    194 TR_V_gene
     34 TR_V_pseudogene
     18 unitary_pseudogene
   2549 unprocessed_pseudogene

What's missing is all the lincRNAs, antisense, and bunch of other small and misc RNA types; these are aggregated in a separate file at ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/ncrna/.

Additionally, and quite annoyingly, the cDNA file includes entities found exclusively on patches and alt-contigs. People discussed it for quite a while, and (I think) overall consensus is that it's best to use "primary" version of human/mouse assembly, together with the matching annotation.

@markziemann
Copy link
Owner

Thanks for investigating this Alex. In the next version of DEE we would like to include lincRNAs as well. So is it safe to concatenate the ncRNA.fa and cDNA.fa then remove any contigs not on the primary assembly?

@apredeus
Copy link
Author

From my previous experience and discussions with other RNA-Seq bioinformaticians, Gencode seemed a bit better in terms of consistency and curation, while having a benefit of the same gene/transcript IDs as Ensembl. So I think it's a good idea to take the latest Gencode annotation for both human and mouse.

What we would usually do is take the so-called primary version of genome assembly (meaning reference chromosomes AND extra scaffolds, but no patches or alt-contigs since they increase ambiguity and multi-mapping), matching primary GTF, and then just use rsem-prepare-reference (from RSEM) to generate the transcript sequences exactly matching the genome/GTF.

Alternatively, Gencode has pre-extracted sequences of transcripts as well, but these do not include ones located in extra scaffolds. However, in the latest mouse version these account for only 95 out of ~ 140k, so they could probably safely be ignored :)

@markziemann markziemann added pipeline Issues to resolve in next pipeline version wontfix labels Aug 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline Issues to resolve in next pipeline version wontfix
Projects
None yet
Development

No branches or pull requests

2 participants