Beyond `.og`: a zoo of data representations #146

anshumanmohan · 2023-11-04T19:11:47Z

Back at MemPan23, we heard the following refrain a few times:

Pangenomics is one of those fields that have been "solved" by the discovery of the right data structure. That data structure is the GFA.

For our work thus far, we have relied heavily on ODGI's .og representation. As we know, this representation is an opinionated departure from the GFA with an eye to parallelizability.

Zoo of Data Representations

This issue's job is to bookmark the fact that the .og representation is just one point in a large search space. We should explore this space with additional quick prototypes that, like ODGI, start with GFA, produce some binary representation, and can be round-tripped losslessly to GFA.

Consider, for instance, a representation that analyzes a GFA and preprocesses it into a "path-centric" view. I am leaving the details of this representation vague, but the point is that this new representation will throw no information away, but will somehow be especially amenable to users who wish to run path-centric operations (reads, such as the sequence traversed by a path; writes, such as the insertion of new paths) on the given graph. We'd produce a path-centric binary .paths, analogous to the .og binary that ODGI produces.

The hope is that, for example, inserting a path into graph.paths will be:

Faster/cleaner than inserting a path into graph.gfa (because we have done some preprocessing).
Faster/cleaner than inserting a path into graph.og (because we have not done too much preprocessing and are not bogged down by too much cruft).

The .paths representation may totally lose to .og when it comes to, say, a node-centric operation, and that's okay. In fact, the .paths representation may disallow a node-centric operation entirely. Some other representation, which is node-specific, would probably do a fine job against .og. We can explore a range of such data representations and see how we fare versus .og.

Using Our Zoo in a Principled Way

A follow-up challenge, then, is the principled creation of commands that run on these different representations.

An .og file has all the preprocessing in one place, so this task is relatively easy: just write one suite of commands that run on .og-represented graphs.

If we, for our purposes, preprocess a graph into a bunch of different representations, then we will need representation-specific version for each high-level command. For instance, running an "insert path" command will look dramatically different on a graph that has been preprocessed to be path-centric versus one that has been preprocessed to be node-centric.

The hope is that our DSL and compiler will save the day. It will take from the user:

The high-level intent.
The data representation that they want to use.

And then it will:

Automatically create the representation-specific command required.
Run that command.
Report the answer.

The text was updated successfully, but these errors were encountered:

sampsyo · 2023-11-06T15:31:57Z

Awesome; thanks for getting this conversation started!

I think the first thing to do here is to build a round-trip converter between GFA and some binary format. This binary format should be the lowest-hanging fruit; it need not be performant or novel. Some options include (1) just the most convenient, direct binary analog to the GFA text format, or (2) an attempt to exactly replicate .og files. Whatever gets something working end-to-end (GFA -> binary, binary -> GFA) quickly is good.

Then the goal will be to parameterize this converter so it can produce many different styles of binary format.

For implementation, we will need a GFA parser. We could either write our converter in Python (in which case we can just use mygfa) or do it in Rust (which will be better for performance, but in which case we need to find a Rust GFA parser library). I did a little poking around at the different options that are out there and found one that seems much better than the rest: rs-gfa. It is nice and simple and cleanly factored, and it supports "streaming" ingestion of GFA lines. Parsing an input GFA line by line is as easy as this:

use gfa::parser::GFAParserBuilder;
use std::io::{self, BufRead};

fn main() {
    let parser = GFAParserBuilder::all().build_usize_id::<()>();

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let line = parser.parse_gfa_line(line.unwrap().as_ref());
        dbg!(line.unwrap());
    }
}

anshumanmohan added juicy More interesting than a chore, yet well-defined enough to be completed without getting too lost feature New feature, or a meaningful extension to an existing feature labels Nov 13, 2023

anshumanmohan self-assigned this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond `.og`: a zoo of data representations #146

Beyond `.og`: a zoo of data representations #146

anshumanmohan commented Nov 4, 2023 •

edited

Loading

sampsyo commented Nov 6, 2023

Beyond .og: a zoo of data representations #146

Beyond .og: a zoo of data representations #146

Comments

anshumanmohan commented Nov 4, 2023 • edited Loading

Zoo of Data Representations

Using Our Zoo in a Principled Way

sampsyo commented Nov 6, 2023

Beyond `.og`: a zoo of data representations #146

Beyond `.og`: a zoo of data representations #146

anshumanmohan commented Nov 4, 2023 •

edited

Loading