Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index sizes in recent kallisto #478

Open
ajggit opened this issue Jan 30, 2025 · 1 comment
Open

index sizes in recent kallisto #478

ajggit opened this issue Jan 30, 2025 · 1 comment

Comments

@ajggit
Copy link

ajggit commented Jan 30, 2025

I'm not sure this is an 'issue' but I recently updated to a newer kallisto (and gencode reference) and noticed that the index files are 5 times smaller which made me wonder if I have mangled something.

Later gencode is larger as expected:
370M gencode.v31.transcripts.fa
585M gencode.v47.transcripts.fa

But resulting index is much smaller:
3.0G gencode.v31.transcripts.kallisto_0.46.0.idx
624M gencode.v47.transcripts.kallisto_0.50.1.idx

Just curious if you found a way to greatly improve the efficiency of storing the index? The mapping rates seem similar. The only notable difference is that I simplified fasta sequence names to remove some uninformative info.

e.g.

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|

became

ENST00000450305.2|ENSG00000223972.6|DDX11L1|DDX11L1-201|transcribed_unprocessed_pseudogene

I can't see how this would lead to a decrease in index size of 5-fold though.

thanks for reading!

@Yenaled
Copy link
Collaborator

Yenaled commented Jan 30, 2025

The index data structure was completely rewritten (by me and a few others) making it more compact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants