Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mygfa doesn't parse optional CIGAR strings #124

Open
susan-garry opened this issue Jul 17, 2023 · 3 comments
Open

mygfa doesn't parse optional CIGAR strings #124

susan-garry opened this issue Jul 17, 2023 · 3 comments
Labels
triage required Ideas that need further discussion, whether asynchronous or synchronous, before we can take action

Comments

@susan-garry
Copy link
Contributor

There are a number of gfa features that mygfa doesn't account for yet, so I'm not sure how high of a priority this fix should be, but this issue is preventing mygfa from parsing odgi's generated gfa files.

Note that this is the gfa specification that I'm using as a reference: http://gfa-spec.github.io/GFA-spec/GFA1.html

Essentially, mygfa doesn't have the functionality to parse certain CIGAR strings, specifying the alignment of two segments (?). This particular issue shows up wherever an "alignment" string appears, for example in links:

L 1 + 2 + 0M

and paths:

P path1 1+,2+,2+ 0M, 0M

The last column of these lines represents a CIGAR string (or list of CIGAR strings). My understanding is that in either case, this string can be replaced with *:

L       1       +       2       +       *
P       path1   1+,2+,2+         *

Which indicates that the overlap is unspecified. According to the docs, if unspecified, "the CIGAR strings are determined by fetching the CIGAR string from the corresponding link records, or by performing a pairwise overlap alignment of the two sequences." I'm not yet sure what the latter is or how difficult it would be to accomplish, but this suggests that in order to support this, we may want to pre-process gfa files and sort the lines by type so that we parse Path lines after Link lines.

@anshumanmohan , based on your knowledge of overlap, does this sound doable? How much of a priority should this be?

@anshumanmohan
Copy link
Contributor

We can parse these, we just don't do a particularly careful job because odgi seems not to either. See #80 for more

but this issue is preventing mygfa from parsing odgi's generated gfa files.

Is there a specific point where this seems to be breaking?
Could you please say more, or maybe push a minimal breaking example in a branch?

@anshumanmohan
Copy link
Contributor

Happy to be outvoted on this, but I don't think that going over the paths and actually computing overlaps is of interest to us. If that's a feature you're proposing, I'd put that at rather low priority. If something is breaking because of the current treatment, that's high priority for sure!

@sampsyo
Copy link
Collaborator

sampsyo commented Jul 18, 2023

I think this is another "YAGNI" situation: let's parse these if (and only if) we know of a specific odgi command that needs to know about them. @susan-garry, did you have a specific command that needs to process overlaps?

@anshumanmohan anshumanmohan added the triage required Ideas that need further discussion, whether asynchronous or synchronous, before we can take action label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage required Ideas that need further discussion, whether asynchronous or synchronous, before we can take action
Projects
None yet
Development

No branches or pull requests

3 participants