Tracker: FlatGFA benchmarks #185

sampsyo · 2024-05-18T16:07:29Z

It's high time that we start measuring FlatGFA performance on "realistic" tasks against external baselines. I bootstrapped some really simple benchmarking infrastructure a while back, and it can only do two really basic things:

Compare parsing/conversion time: FlatGFA vs. odgi.
Compare a trivial "list the names of the paths" thing: FlatGFA vs. odgi vs. slow-odgi.

I think we need to look for workloads that are somehow representative of what people actually want to do, interactively, with these GFA analysis tools. These could be things that odgi does already, or it could be represented in other open-source tools that (by virtue of implementing that functionality) seem to be saying that the task is important. Here are some ideas… surely further googling could reveal more:

odgi extract-like functionality. This is one of 3 benchmarks in the odgi paper, and the only one that is a pure-ish "GFA operator." The paper compares against vg chunk, so perhaps we should too. We could even reuse commands from the script. I have actually started an implementation, but it is not yet correct because I don't understand the spec for odgi extract.
odgi viz-like visualization. This is the other non-parsing benchmark in the the odgi paper. I am skeptical of this one… it just seems pretty un-fun to do the drawing stuff, and maybe not that computationally intensive. And it's hard to divine the spec for what the visualization should even look like? Anyway, maybe this can be salvaged in a useful way. Perhaps it wants to be implemented in Python instead of in Rust?
Panacus histograms and growth statistics. This tool is written in Rust, so maybe it's a competitive baseline. Let's find out!
Bubble detection, like bubble_gun. AFAICT this is the only tool that does it? The algorithm doesn't seem too hard to re-implement; there's a paper about it that explains it pretty clearly.
Some pancat commands, such as pancat stats. Perhaps pancat edit would be a more interesting benchmark too?
Relatedly, comparing parsing performance only against pancat's gfagraphs library could be instructive. It's pure Python, however, so I'm guessing it can't be much faster than slow-odgi's parser?
gfatools is a C tool. It's not documented, but it seems to at least have odgi extract-like functionality and conversion to FASTA (like odgi flatten).
gretl contains a bunch of statistics calculations that we could try to re-implement. This one maybe seems like a good place to start since the specs may be clear? It's also implemented in Rust, so it may be a competitive baseline.
GFAffix is a special-purpose analysis that I honestly do not yet understand. But it seems like it may be computationally intensive?

Before all that, maybe it would also make sense to focus on the 3 specific pipelines that Andrea graciously showed off on Slack:

odgi build -g LPA.gfa -o - | odgi extract -i - -o - -r chm13__LPA__tig00000001:0-100 | odgi paths -i - -L | wc -l
odgi build -g LPA.gfa -o - | odgi extract -i - -o - -r chm13__LPA__tig00000001:0-100 | odgi stats -i - -S
odgi build -g LPA.gfa -o - | odgi extract -i - -o - -r chm13__LPA__tig00000001:0-100 | odgi sort -i - -o - -O -p gYs | odgi viz -i - -o x.png

…which would entail:

Make extract actually work.
Implement sort.
Do something about visualization?
Expose enough through the Python API to allow this type of composition while keeping data structures in memory.

On a very different level, we could consider something like sequence-to-graph alignment. The bottleneck there is unlikely to be the GFA representation, but who knows?

And finally: smoothxg, one of the steps in pggb, apparently composes the functionality from odgi chop and odgi sort. Maybe we can extract a sensible pipeline from that. For example, the CLI help strings state that:

prep is equivalent to odgi chop followed by odgi sort -p sYgs

…so that lays out a pretty specific pipeline in particular.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracker: FlatGFA benchmarks #185

Tracker: FlatGFA benchmarks #185

sampsyo commented May 18, 2024 •

edited

Loading

Tracker: FlatGFA benchmarks #185

Tracker: FlatGFA benchmarks #185

Comments

sampsyo commented May 18, 2024 • edited Loading

sampsyo commented May 18, 2024 •

edited

Loading