Relational semantics in the Pollen DSL #75

sampsyo · 2023-04-29T00:45:32Z

sampsyo
Apr 29, 2023
Maintainer

This is a philosophy-level about the design of our DSL. For simplicity of exposition, I'm wiring it in a style that's like "here's how it is" rather than "here's a humble proposal for how it might be." But I really do mean it as a way to set up a debate, start discussion, etc., and I don't mean it to be set in stone at all. (I think some of this may be painfully obvious and some of it may be more controversial, and I would love to find where that line is.)

What is the goal of the Pollen DSL?

We want a DSL that has abstract semantics and gets good performance. Specifically, the kind of abstract semantics we're imagining can be called "relational semantics," wherein every program appears to be directly manipulating the relations defined elsewhere in relational.md. The advantages of relational semantics should be:

easy for programmers to think about, and harder to write buggy code
parallelizable
possible to retarget to different physical implementations (data structures)
portable to different hardware (CPU and FPGA, most importantly)

The problem that makes this research is that an "obvious" implementation of relational programs would be slow. That "obvious" implementation would, basically, be an interpreter that is implemented using only one, extremely basic kind of data structure: the big sets that make up the fundamental relational representation. The thing that makes this DSL design (and compiler design) problem both interesting and hard is that we want to exploit high-performance data structures without breaking the relational abstraction.

What does "relational semantics" really mean?

(For simplicity, this section omits "links"/"edges" altogether. But I think the principles generalize.)

From the programmer's perspective, the way programs work is indistinguishable from a situation where we implement Pollen with the dumbest, slowest possible data structures. That is, it should look to the user exactly like the only data structures that exist to represent a given GFA-style variation graph are:

sets of opaque identities (names) for segments and paths
a relation mapping segments to their concrete nucleotide sequences ("strands")
a relation that defines all the steps for all the paths over all the segments

In other words, the interesting stuff happens in one, big relation $R \subseteq \text{Path} \times \text{Index} \times \text{Direction} \times \text{Segment}$.

Now, our DSL will of course not want to make people interact directly with this big relation $R$ all the time! It would be really inconvenient to write programs that have to iterate over the huge set $R$ just to get, for example, the sequence of steps that make up a given path or the set of paths that "cross" a given segment. So, we will have shortcuts—but these shortcuts are all semantically defined in terms of how they read or write the sets above. For example:

We want to let people write loops that look something like for step in path.steps {...}. Each value step in the body of that loop should look like a record with fields like path, index, and handle. However, the idea is that—from a semantic point of view—the programmer shouldn't need to think of this as iterating over a special per-path step set data structure that actually, literally exists. Instead, iterating over path.steps should—again, semantically speaking—work as if it looked in the big relation $R$ to find all the tuples whose first element is path. A naive implementation would literally do this: it would have a big set written down called R, and to implement this loop, it would do something like the Python expression sorted([Step(path, index, Handle(dir, seg) for (path_, index, dir, seg) in R where path_ == path], key=lambda s: s.index). That is, it would slowly walk over the entire set R to pull out all the data it needs.
We want to let people access the set of steps that "cross" a given segment by writing seg.steps. Again, in terms of semantics, the program should look as if it is computing this set on the fly by looking at the one, true canonical set $R$.

It's worth noting that, in proper odgi itself, the set seg.steps is not an illusion: odgi actually pre-computes a set of steps associated with every segment (which it calls a "node"). This is good for performance! But our goal here, in defining the relational semantics, is to divorce ourselves from that kind of implementation concern. We would really like it if high-performance implementations of the Pollen DSL would also actually reify this seg.steps set for quick access---but, according to our goals set out in the previous answer, we also want this reification to be invisible to the programmer. That is, looking only at the outputs of programs and not at performance, a programmer should not be able to tell apart different implementations of Pollen---for instance, the naive one that uses R and one that literally uses odgi's data structures. No matter how clever a programmer might get with writing tricky programs, we would like to guarantee that any program produces the same answer in either implementation.

How can you tell that we have relational semantics?

A good way to understand programming language semantics is through litmus tests: little programs with two plausible outputs, and whose output tells you which of two semantic worlds we're living in. So one thing we can do is write little litmus-test programs and say "under relational semantics, the output would be X; under some not-quite-right semantics, the output would be Y."

To establish a baseline, let's write a "no-op" Pollen program that re-emits an input graph unmodified. (This involves imagining a little bit about our syntax and features and stuff; please forgive any mismatches with current proposals or the current parser implementation.) Here's a no-op:

graph g;
parset out_segs[Segment, g];
parset out_paths[Path, g];
parset out_steps[Step, g];

for seg in graph.segments {
  emit seg to out_segs;
}
for path in graph.paths {
  emit path to out_paths;
}
for step in graph.steps {
  emit step to out_steps;
}

From here on out, I'm going to omit the graph and parset declarations at the top for brevity. Here's a different program with identical (no-op) semantics:

for seg in graph.segments {
  emit seg to out_segs;
  for step in seg.steps {
    emit step to out_steps;
  }
}
for path in graph.paths {
  emit path to out_paths;
}

That is, walking over the steps from the "point of view" of individual segments should yield the same overall set of steps as iterating over the monolithic, aggregate set graph.steps. After all, semantically speaking, each seg.steps is not actually a separate little set; it is just a convenient way to pull out a little subset of the big graph.steps. In the same sense, this program is also (equivalently) a no-op:

for seg in graph.segments {
  emit seg to out_segs;
}
for path in graph.paths {
  emit path to out_paths;
  for step in path.steps {
    emit step to out_steps;
  }
}

One last version of the program should also be a no-op, given our decision that emit is deduplicating:

for seg in graph.segments {
  emit seg to out_segs;
  for step in seg.steps {
    emit step to out_steps;
  }
}
for path in graph.paths {
  emit path to out_paths;
  for step in path.steps {
    emit step to out_steps;
  }
}

That is, we're redundantly emitting every single step twice: once from the segment's point of view, and once from the path's point of view. But the steps are always the same, so the output set is just identical to graph.steps.

Here is an extremely contrived program that acts as a litmus test:

for seg in graph.segments {
  emit seg to out_segs;
  for step in seg.steps {
    emit Step { segment: graph.segments[0], ...step} to out_steps;
  }
}
for path in graph.paths {
  emit path to out_paths;
}

This program "modifies" the steps, in the first loop, to always cross the first segment. The relevant question about this program is: after it finishes, if you were to look at path.steps for every path, would you see those changes (i.e., would you see every step crossing the same segment)? Or would you not---meaning that we would have "inconsistent" views of all the steps depending on whether we looked at them using path.steps for every path or seg.steps for every segment?

Under relational semantics, we would see the changes regardless of where we view them from. Semantically, there is only one big step relation; it is not possible for path.steps and seg.steps to "disagree." You can imagine a different semantics---one in which seg.steps is actually a different set of data from what you see from seg.steps---under which this would not be true, i.e., "changing" the steps from the point of view of segments does not affect them from the point of view of paths, so they can become inconsistent.

In other words, this kind of inconsistency, where the data looks different from different viewpoints, is impossible under relational semantics.

Is it still possible to do bad things, even within relational semantics?

Yes! While relational semantics prevents the aforementioned form of inconsistency, this kind of inconsistency is not the only bad thing that programs can do.

A big category of bad thing that programs can still do is to refer to objects that don't exist. For example, how about this program:

for seg in graph.segments {
  emit seg to out_segs;
}
for path in graph.paths[:5] {
  emit path to out_paths;
}
for step in graph.steps {
  emit step to out_steps;
}

That gaph.paths[:5] is imaginary syntax that just iterates over the first 5 paths, arbitrarily chosen. So this program is like a no-op, except that it omits most of the paths. The result is that we will emit a bunch of steps that refer to paths that don't exist. This is bad, but it's not prevented by relational semantics.

anshumanmohan · 2023-05-03T20:02:48Z

anshumanmohan
May 3, 2023
Maintainer

Thanks for this, Adrian! It is super helpful to see it all written out. By and large, this lines with my intuition nicely. This has clarified one misunderstanding of mine, and I think I can hazard an explanation of Susan's concern using this background. Those are below.

My misunderstanding: pre-computations versus views

I've been saying that seg.steps is pre-computed for the user's convenience, and so, after certain kinds of modification to the graph, we need to perform housekeeping on seg.steps to make it consistent with the new graph. I say "certain kinds" of modification because

Simply modifying the sequences of segments (à la crush) will not change seg.steps, so ideally we cleverly avoid this no-op housekeeping.
Adding paths to the graph (à la inject) will change seg.steps, so ideally we perform appropriate housekeeping.

But of course the naïve version of this is to just always do housekeeping, no-op or not.

Now I see that, living purely in relational semantics-land, this is a non-issue. seg.steps is just a view into $R$. The modifications that would prompt us to housekeep seg.steps are themselves doing the housekeeping! They aren't even doing anything fancy; the very act of modifying $R$ for algorithmic purposes is also modifying $R$ for housekeeping purposes. Indeed, housekeeping is a fiction.

A high-performance implementation of the relational model may very well pre-compute seg.steps, meaning that their seg.steps will not be a "live view" of $R$ but a "snapshot" of $R$ at the moment of pre-computation. They will have to occasionally recompute, and they can do whatever fanciness they want to recompute as infrequently as they can get away with. There is a naïve way, and it lines up with that above: just re-dip in the holy waters of $R$ every time $R$ is modified.

Susan's concern (I think)

I think that Susan's concern falls into your "Is it still possible to do bad things" category. Susan still needs to opine, of course, but IIRC she was afraid of something like:

// A program that drops all of graph in_g's paths

graph out_g;
parset out_segs[Segment, out_g];
parset out_paths[Path, out_g];
parset out_steps[Step, out_g];

for seg in in_g.segments {
  emit seg to out_segs;
}
for step in in_g.steps {
  emit step to out_steps;
}

We held the user's hand in creating out_steps, but it's bogus because it talks about paths that don't exist in the graph out_g. I think you'll both agree that this a legal program that creates an inconsistency of the second flavor. Is that accurate, Susan, or was your concern something else?

3 replies

sampsyo May 3, 2023
Maintainer Author

Cool!! Yes, this sums it up well:

A high-performance implementation of the relational model may very well pre-compute seg.steps, meaning that their seg.steps will not be a "live view" of but a "snapshot" of at the moment of pre-computation. Then will have to occasionally recompute, and they can do whatever fanciness they want to recompute as infrequently as they can get away with.

In other words, you do have to do exactly the bookkeeping you were imagining—if you have chosen an implementation that does that "pre-computation." The way you decide what bookkeeping is necessary is: "what must I do to make sure the user always sees relational semantics, i.e., they can't tell I have chosen this implementation?"

sampsyo May 3, 2023
Maintainer Author

And good example of a simple kind of inconsistency. The main point I was attempting to make last time (poorly) was that this is still a thing to worry about, even if we have relational semantics. To break down the options for how to deal with that sort of consistency, there is:

ignore it (let people produce bad output)
dynamically catch it and throw an error
dynamically catch it and silently fix it, e.g., automatically emit all paths associated with all segments we emit
statically rule it out and reject programs that might do this

As ever, all of these are worthy options, but I think we should do the simplest possible one. That could be 1 or 3, probably.

anshumanmohan May 3, 2023
Maintainer

Agreed; I think we vaguely shook hands on option 1 last time!

sampsyo · 2023-06-08T21:50:19Z

sampsyo
Jun 8, 2023
Maintainer Author

Capturing one important outcome from today's synchronous discussion: I think the relational semantics is super important for defining the invariants that define a "valid graph." There are two reasons: (1) Relational semantics eliminate one category of possible weirdness, i.e., disagreements that stem from different "views" disagreeing on what data exists in a graph. (2) Thinking of the graph as a relation makes it easy to define the invariants that make a graph valid, i.e., that describe any remaining weirdnesses that are not ruled out by relational semantics. (It does not, however, by itself tell us how to enforce those invariants.)

Here are all the well-formed-graph invariants I can think of:

Basic identifier existence: every element $t \in R$ only contains members of the constituent sets, namely $\text{Path}$ and $\text{Segment}$, that actually exist in those sets.
Index uniqueness: for distinct pair of tuples belonging to the same path, $(p, i, d, s) \in R$ and $(p, i', d', s') \in R$, they do not share the same index, i.e., $i ≠ i'$.
Index contiguousness: for each path $p$, consider all the steps $(p, i, d, s) \in R$. Now project each $i$ from these tuples, yielding a set we will call $\text{indices}(p)$ that contains all the indices on all the steps for the path $p$. The size of this set is exactly $\text{max}(\text{indices}(p)) + 1$, i.e., it is the set of natural numbers from 0 through the maximum with no missing indices.

Defining these invariants is a separate question from how to enforce them, as discussed above in #75 (reply in thread).

2 replies

susan-garry Jul 13, 2023
Collaborator

Very well said! I think this addresses my concern regarding programmers being able to write pollen code that generate invalid graphs. The possible solutions are listed above, and currently our plan seems to be to simply generate bad graphs, but we could use the relational semantics to dynamically throw an error when a bad graph is generated.

I can't think of any other invariants that must be satisfied for a graph to be valid under the relational semantics, though we will have to revisit these invariants if we ever add support for links/edges.

sampsyo Jul 13, 2023
Maintainer Author

Yeah, the whole "paths are consistent with links" issue, as enforced by odgi validate, is another interesting invariant to consider—one that is a bit "deeper" than the purely structural ones I listed above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relational semantics in the Pollen DSL #75

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Relational semantics in the Pollen DSL #75

sampsyo Apr 29, 2023 Maintainer

What is the goal of the Pollen DSL?

What does "relational semantics" really mean?

How can you tell that we have relational semantics?

Is it still possible to do bad things, even within relational semantics?

Replies: 2 comments · 5 replies

anshumanmohan May 3, 2023 Maintainer

My misunderstanding: pre-computations versus views

Susan's concern (I think)

sampsyo May 3, 2023 Maintainer Author

sampsyo May 3, 2023 Maintainer Author

anshumanmohan May 3, 2023 Maintainer

sampsyo Jun 8, 2023 Maintainer Author

susan-garry Jul 13, 2023 Collaborator

sampsyo Jul 13, 2023 Maintainer Author

sampsyo
Apr 29, 2023
Maintainer

Replies: 2 comments 5 replies

anshumanmohan
May 3, 2023
Maintainer

sampsyo May 3, 2023
Maintainer Author

sampsyo May 3, 2023
Maintainer Author

anshumanmohan May 3, 2023
Maintainer

sampsyo
Jun 8, 2023
Maintainer Author

susan-garry Jul 13, 2023
Collaborator

sampsyo Jul 13, 2023
Maintainer Author