Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beyond .og: a zoo of data representations #146

Open
anshumanmohan opened this issue Nov 4, 2023 · 1 comment
Open

Beyond .og: a zoo of data representations #146

anshumanmohan opened this issue Nov 4, 2023 · 1 comment
Assignees
Labels
feature New feature, or a meaningful extension to an existing feature juicy More interesting than a chore, yet well-defined enough to be completed without getting too lost

Comments

@anshumanmohan
Copy link
Contributor

anshumanmohan commented Nov 4, 2023

Back at MemPan23, we heard the following refrain a few times:

Pangenomics is one of those fields that have been "solved" by the discovery of the right data structure. That data structure is the GFA.

For our work thus far, we have relied heavily on ODGI's .og representation. As we know, this representation is an opinionated departure from the GFA with an eye to parallelizability.

Zoo of Data Representations

This issue's job is to bookmark the fact that the .og representation is just one point in a large search space. We should explore this space with additional quick prototypes that, like ODGI, start with GFA, produce some binary representation, and can be round-tripped losslessly to GFA.

Consider, for instance, a representation that analyzes a GFA and preprocesses it into a "path-centric" view. I am leaving the details of this representation vague, but the point is that this new representation will throw no information away, but will somehow be especially amenable to users who wish to run path-centric operations (reads, such as the sequence traversed by a path; writes, such as the insertion of new paths) on the given graph. We'd produce a path-centric binary .paths, analogous to the .og binary that ODGI produces.

The hope is that, for example, inserting a path into graph.paths will be:

  1. Faster/cleaner than inserting a path into graph.gfa (because we have done some preprocessing).
  2. Faster/cleaner than inserting a path into graph.og (because we have not done too much preprocessing and are not bogged down by too much cruft).

The .paths representation may totally lose to .og when it comes to, say, a node-centric operation, and that's okay. In fact, the .paths representation may disallow a node-centric operation entirely. Some other representation, which is node-specific, would probably do a fine job against .og. We can explore a range of such data representations and see how we fare versus .og.

IMG 1505

Using Our Zoo in a Principled Way

A follow-up challenge, then, is the principled creation of commands that run on these different representations.

An .og file has all the preprocessing in one place, so this task is relatively easy: just write one suite of commands that run on .og-represented graphs.

If we, for our purposes, preprocess a graph into a bunch of different representations, then we will need representation-specific version for each high-level command. For instance, running an "insert path" command will look dramatically different on a graph that has been preprocessed to be path-centric versus one that has been preprocessed to be node-centric.

The hope is that our DSL and compiler will save the day. It will take from the user:

  • The high-level intent.
  • The data representation that they want to use.

And then it will:

  • Automatically create the representation-specific command required.
  • Run that command.
  • Report the answer.
@sampsyo
Copy link
Collaborator

sampsyo commented Nov 6, 2023

Awesome; thanks for getting this conversation started!

I think the first thing to do here is to build a round-trip converter between GFA and some binary format. This binary format should be the lowest-hanging fruit; it need not be performant or novel. Some options include (1) just the most convenient, direct binary analog to the GFA text format, or (2) an attempt to exactly replicate .og files. Whatever gets something working end-to-end (GFA -> binary, binary -> GFA) quickly is good.

Then the goal will be to parameterize this converter so it can produce many different styles of binary format.

For implementation, we will need a GFA parser. We could either write our converter in Python (in which case we can just use mygfa) or do it in Rust (which will be better for performance, but in which case we need to find a Rust GFA parser library). I did a little poking around at the different options that are out there and found one that seems much better than the rest: rs-gfa. It is nice and simple and cleanly factored, and it supports "streaming" ingestion of GFA lines. Parsing an input GFA line by line is as easy as this:

use gfa::parser::GFAParserBuilder;
use std::io::{self, BufRead};

fn main() {
    let parser = GFAParserBuilder::all().build_usize_id::<()>();

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let line = parser.parse_gfa_line(line.unwrap().as_ref());
        dbg!(line.unwrap());
    }
}

@anshumanmohan anshumanmohan added juicy More interesting than a chore, yet well-defined enough to be completed without getting too lost feature New feature, or a meaningful extension to an existing feature labels Nov 13, 2023
@anshumanmohan anshumanmohan self-assigned this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature, or a meaningful extension to an existing feature juicy More interesting than a chore, yet well-defined enough to be completed without getting too lost
Projects
None yet
Development

No branches or pull requests

2 participants