Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low mapping percentage if index generated based on gtf with lncRNAs #279

Open
KoenDeserranno opened this issue Dec 11, 2024 · 2 comments
Open

Comments

@KoenDeserranno
Copy link

Describe the issue
Dear,

Thank you for maintaining this great tool. Could you clarify the following for me:

I have ran kallisto count using the same input .fastq.gz files, but with different indexes (generated using kb ref). The indices differ in the .gtf files used: in the first version, the .gtf file was the standard GENCODEV47 .gtf file, in the second .gtf additional lncRNA entries were added.

Upon running with the standard index, I get mapping percentages of about 60 %. Upon running with the custom .gtf, the mapping percentages dropped to about 15 %. I first figured that this is probably due to the fast that reads multimap to genes and there antisense lncRNA in the latter case, and get thereby discarded. While Smart-seq3 UMI-reads are stranded, the internal reads are not. So I reran with flag --mm, but mapping percentages remained the same. Is this expected, or should the percentage mapped increase if --mm is supplied?

Thanks in advance!

What is the exact command that was run?

kb count --verbose -o /count_genes_mm/ -w {bc_whitelist} -t {threads} --h5ad -i /path/to/.idx -g /path/to/.txt -x SMARTSEQ3 --verbose {I1_fastq} {I2_fastq} {R1_fastq} {R2_fastq}

Command output (with --verbose flag)

[2024-12-11 14:07:04,663]   DEBUG [main] Printing verbose output
[2024-12-11 14:07:06,891]   DEBUG [main] kallisto binary located at /kallisto/kallisto
[2024-12-11 14:07:06,892]   DEBUG [main] bustools binary located at /bustools/bustools
[2024-12-11 14:07:06,892]   DEBUG [main] Creating `/count_genes_mm/tmp` directory
[2024-12-11 14:07:06,893]   DEBUG [main] Namespace(list=False, command='count', tmp=None, keep_tmp=False, verbose=True, i='.idx', g='.txt', x='SMARTSEQ3', o='./count_genes_mm/', num=False, w='./indices_SS3x.txt', r=None, t=20, m='2G', strand=None, inleaved=False, genomebam=False, aa=False, gtf=None, chromosomes=None, workflow='standard', em=False, mm=True, tcc=False, filter=None, filter_threshold=None, c1=None, c2=None, overwrite=False, dry_run=False, batch_barcodes=False, loom=False, h5ad=True, loom_names='barcode,target_name', sum='none', cellranger=False, gene_names=False, N=None, report=False, no_inspect=False, long=False, threshold=0.8, error_rate=None, platform='ONT', kallisto='./kallisto', bustools='./bustools', opt_off=False, k=31, no_validate=False, no_fragment=False, union=False, no_jump=False, quant_umis=False, keep_flags=False, parity=None, fragment_l=None, fragment_s=None, bootstraps=None, matrix_to_files=False, matrix_to_directories=False, fastqs=['/I1.fastq.gz', './I2.fastq.gz', './R1.fastq.gz', './R2.fastq.gz'])
[2024-12-11 14:07:09,459]    INFO [count] Using index ./.idx to generate BUS file to ./count_genes_mm/ from
[2024-12-11 14:07:09,460]    INFO [count]         I1.fastq.gz
[2024-12-11 14:07:09,460]    INFO [count]         I2.fastq.gz
[2024-12-11 14:07:09,460]    INFO [count]         R1.fastq.gz
[2024-12-11 14:07:09,460]    INFO [count]         R2.fastq.gz
[2024-12-11 14:07:09,460]   DEBUG [count] kallisto bus -i .idx -o /count_genes_mm/ -x SMARTSEQ3 -t 20 --paired I1.fastq.gz I2.fastq.gz Unassigned_R1.fastq.gz Unassigned_R2.fastq.gz
[2024-12-11 14:07:09,577]   DEBUG [count] 
[2024-12-11 14:07:09,577]   DEBUG [count] [bus] Using ATTGCGCAATG as UMI tag sequence
[2024-12-11 14:07:09,577]   DEBUG [count] [bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[2024-12-11 14:07:11,482]   DEBUG [count] [index] k-mer length: 31
[2024-12-11 14:07:12,985]   DEBUG [count] [index] number of targets: 146,097
[2024-12-11 14:07:12,985]   DEBUG [count] [index] number of k-mers: 93,004,815
[2024-12-11 14:07:12,985]   DEBUG [count] [index] number of D-list k-mers: 5,371,744
[2024-12-11 14:07:12,985]   DEBUG [count] [quant] running in paired-end mode
[2024-12-11 14:07:12,985]   DEBUG [count] [quant] will process sample 1: I1.fastq.gz
[2024-12-11 14:07:12,985]   DEBUG [count] I2.fastq.gz
[2024-12-11 14:07:12,985]   DEBUG [count] R1.fastq.gz
[2024-12-11 14:07:12,985]   DEBUG [count] R2.fastq.gz
[2024-12-11 14:07:21,100]   DEBUG [count] [quant] finding pseudoalignments for the reads ...
[2024-12-11 14:07:27,311]   DEBUG [count] [progress] 1M reads processed (15.3% mapped)
[2024-12-11 14:07:33,521]   DEBUG [count] [progress] 2M reads processed (15.3% mapped)
[2024-12-11 14:07:39,431]   DEBUG [count] [progress] 3M reads processed (15.3% mapped)
[2024-12-11 14:07:45,641]   DEBUG [count] [progress] 4M reads processed (15.3% mapped)
[2024-12-11 14:07:51,751]   DEBUG [count] [progress] 5M reads processed (15.3% mapped)
[2024-12-11 14:07:57,862]   DEBUG [count] [progress] 6M reads processed (15.3% mapped)
[2024-12-11 14:08:03,871]   DEBUG [count] [progress] 7M reads processed (15.3% mapped)
[2024-12-11 14:08:09,881]   DEBUG [count] [progress] 8M reads processed (15.3% mapped)
[2024-12-11 14:08:15,894]   DEBUG [count] [progress] 9M reads processed (15.3% mapped)
[2024-12-11 14:08:21,908]   DEBUG [count] [progress] 10M reads processed (15.3% mapped)
[2024-12-11 14:08:27,923]   DEBUG [count] [progress] 11M reads processed (15.3% mapped)
[2024-12-11 14:08:33,136]   DEBUG [count] [progress] 12M reads processed (15.3% mapped)
[2024-12-11 14:08:39,352]   DEBUG [count] [progress] 13M reads processed (15.3% mapped)
[2024-12-11 14:08:45,467]   DEBUG [count] [progress] 14M reads processed (15.3% mapped)
[2024-12-11 14:08:51,682]   DEBUG [count] [progress] 15M reads processed (15.3% mapped)
[2024-12-11 14:08:57,998]   DEBUG [count] [progress] 16M reads processed (15.3% mapped)
[2024-12-11 14:09:04,113]   DEBUG [count] [progress] 17M reads processed (15.3% mapped)
[2024-12-11 14:09:10,428]   DEBUG [count] [progress] 18M reads processed (15.3% mapped)
[2024-12-11 14:09:16,343]   DEBUG [count] [progress] 19M reads processed (15.3% mapped)
[2024-12-11 14:09:22,558]   DEBUG [count] [progress] 20M reads processed (15.3% mapped)
[2024-12-11 14:09:28,674]   DEBUG [count] [progress] 21M reads processed (15.3% mapped)
[2024-12-11 14:09:34,689]   DEBUG [count] [progress] 22M reads processed (15.3% mapped)
[2024-12-11 14:09:40,904]   DEBUG [count] [progress] 23M reads processed (15.3% mapped)
[2024-12-11 14:09:47,019]   DEBUG [count] [progress] 24M reads processed (15.3% mapped)
[2024-12-11 14:09:53,435]   DEBUG [count] [progress] 25M reads processed (15.3% mapped)
[2024-12-11 14:09:59,350]   DEBUG [count] [progress] 26M reads processed (15.3% mapped)
[2024-12-11 14:10:05,465]   DEBUG [count] [progress] 27M reads processed (15.3% mapped)
[2024-12-11 14:10:11,681]   DEBUG [count] [progress] 28M reads processed (15.3% mapped)
[2024-12-11 14:10:17,796]   DEBUG [count] [progress] 29M reads processed (15.3% mapped)
[2024-12-11 14:10:23,911]   DEBUG [count] [progress] 30M reads processed (15.3% mapped)
[2024-12-11 14:10:29,826]   DEBUG [count] [progress] 31M reads processed (15.3% mapped)
[2024-12-11 14:10:35,842]   DEBUG [count] [progress] 32M reads processed (15.3% mapped)
[2024-12-11 14:10:41,957]   DEBUG [count] [progress] 33M reads processed (15.3% mapped)
[2024-12-11 14:10:47,972]   DEBUG [count] [progress] 34M reads processed (15.3% mapped)
[2024-12-11 14:10:53,585]   DEBUG [count] [progress] 35M reads processed (15.4% mapped)
[2024-12-11 14:10:59,400]   DEBUG [count] [progress] 36M reads processed (15.3% mapped)
[2024-12-11 14:11:05,415]   DEBUG [count] [progress] 37M reads processed (15.3% mapped)
[2024-12-11 14:11:11,330]   DEBUG [count] [progress] 38M reads processed (15.3% mapped)
[2024-12-11 14:11:17,345]   DEBUG [count] [progress] 39M reads processed (15.3% mapped)
[2024-12-11 14:11:23,360]   DEBUG [count] [progress] 40M reads processed (15.3% mapped)
[2024-12-11 14:11:29,375]   DEBUG [count] [progress] 41M reads processed (15.4% mapped)
[2024-12-11 14:11:35,590]   DEBUG [count] [progress] 42M reads processed (15.4% mapped)
[2024-12-11 14:11:41,605]   DEBUG [count] [progress] 43M reads processed (15.4% mapped)
[2024-12-11 14:11:47,721]   DEBUG [count] [progress] 44M reads processed (15.4% mapped)
[2024-12-11 14:11:48,322]   DEBUG [count] [progress] 45M reads processed (15.4% mapped)              done
[2024-12-11 14:11:48,322]   DEBUG [count] [quant] processed 45,622,090 reads, 7,004,503 reads pseudoaligned
@Yenaled
Copy link
Collaborator

Yenaled commented Dec 11, 2024

The mm flag has nothing to do with read mapping (it is only used at the UMI counting step).

As for why you are seeing a lower mapping rate with lncRNAs are added in, it’s hard to say. Usually you should get higher mapping (unless perhaps the GTF was modified incorrectly therefore messing up the mapping?)

@KoenDeserranno
Copy link
Author

Thank you for the clarification and swift reply. I will look again into the gtf file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants