index sizes in recent kallisto #478

ajggit · 2025-01-30T06:25:16Z

I'm not sure this is an 'issue' but I recently updated to a newer kallisto (and gencode reference) and noticed that the index files are 5 times smaller which made me wonder if I have mangled something.

Later gencode is larger as expected:
370M gencode.v31.transcripts.fa
585M gencode.v47.transcripts.fa

But resulting index is much smaller:
3.0G gencode.v31.transcripts.kallisto_0.46.0.idx
624M gencode.v47.transcripts.kallisto_0.50.1.idx

Just curious if you found a way to greatly improve the efficiency of storing the index? The mapping rates seem similar. The only notable difference is that I simplified fasta sequence names to remove some uninformative info.

e.g.

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|

became

ENST00000450305.2|ENSG00000223972.6|DDX11L1|DDX11L1-201|transcribed_unprocessed_pseudogene

I can't see how this would lead to a decrease in index size of 5-fold though.

thanks for reading!

Yenaled · 2025-01-30T07:08:01Z

The index data structure was completely rewritten (by me and a few others) making it more compact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index sizes in recent kallisto #478

index sizes in recent kallisto #478

ajggit commented Jan 30, 2025

Yenaled commented Jan 30, 2025

index sizes in recent kallisto #478

index sizes in recent kallisto #478

Comments

ajggit commented Jan 30, 2025

Yenaled commented Jan 30, 2025