Replies: 2 comments 5 replies
-
Thanks for this, Adrian! It is super helpful to see it all written out. By and large, this lines with my intuition nicely. This has clarified one misunderstanding of mine, and I think I can hazard an explanation of Susan's concern using this background. Those are below. My misunderstanding: pre-computations versus viewsI've been saying that
But of course the naïve version of this is to just always do housekeeping, no-op or not. Now I see that, living purely in relational semantics-land, this is a non-issue. A high-performance implementation of the relational model may very well pre-compute Susan's concern (I think)I think that Susan's concern falls into your "Is it still possible to do bad things" category. Susan still needs to opine, of course, but IIRC she was afraid of something like:
We held the user's hand in creating |
Beta Was this translation helpful? Give feedback.
-
Capturing one important outcome from today's synchronous discussion: I think the relational semantics is super important for defining the invariants that define a "valid graph." There are two reasons: (1) Relational semantics eliminate one category of possible weirdness, i.e., disagreements that stem from different "views" disagreeing on what data exists in a graph. (2) Thinking of the graph as a relation makes it easy to define the invariants that make a graph valid, i.e., that describe any remaining weirdnesses that are not ruled out by relational semantics. (It does not, however, by itself tell us how to enforce those invariants.) Here are all the well-formed-graph invariants I can think of:
Defining these invariants is a separate question from how to enforce them, as discussed above in #75 (reply in thread). |
Beta Was this translation helpful? Give feedback.
-
This is a philosophy-level about the design of our DSL. For simplicity of exposition, I'm wiring it in a style that's like "here's how it is" rather than "here's a humble proposal for how it might be." But I really do mean it as a way to set up a debate, start discussion, etc., and I don't mean it to be set in stone at all. (I think some of this may be painfully obvious and some of it may be more controversial, and I would love to find where that line is.)
What is the goal of the Pollen DSL?
We want a DSL that has abstract semantics and gets good performance. Specifically, the kind of abstract semantics we're imagining can be called "relational semantics," wherein every program appears to be directly manipulating the relations defined elsewhere in
relational.md
. The advantages of relational semantics should be:The problem that makes this research is that an "obvious" implementation of relational programs would be slow. That "obvious" implementation would, basically, be an interpreter that is implemented using only one, extremely basic kind of data structure: the big sets that make up the fundamental relational representation. The thing that makes this DSL design (and compiler design) problem both interesting and hard is that we want to exploit high-performance data structures without breaking the relational abstraction.
What does "relational semantics" really mean?
(For simplicity, this section omits "links"/"edges" altogether. But I think the principles generalize.)
From the programmer's perspective, the way programs work is indistinguishable from a situation where we implement Pollen with the dumbest, slowest possible data structures. That is, it should look to the user exactly like the only data structures that exist to represent a given GFA-style variation graph are:
In other words, the interesting stuff happens in one, big relation$R \subseteq \text{Path} \times \text{Index} \times \text{Direction} \times \text{Segment}$ .
Now, our DSL will of course not want to make people interact directly with this big relation$R$ all the time! It would be really inconvenient to write programs that have to iterate over the huge set $R$ just to get, for example, the sequence of steps that make up a given path or the set of paths that "cross" a given segment. So, we will have shortcuts—but these shortcuts are all semantically defined in terms of how they read or write the sets above. For example:
for step in path.steps {...}
. Each valuestep
in the body of that loop should look like a record with fields likepath
,index
, andhandle
. However, the idea is that—from a semantic point of view—the programmer shouldn't need to think of this as iterating over a special per-path step set data structure that actually, literally exists. Instead, iterating overpath.steps
should—again, semantically speaking—work as if it looked in the big relationpath
. A naive implementation would literally do this: it would have a big set written down calledR
, and to implement this loop, it would do something like the Python expressionsorted([Step(path, index, Handle(dir, seg) for (path_, index, dir, seg) in R where path_ == path], key=lambda s: s.index)
. That is, it would slowly walk over the entire setR
to pull out all the data it needs.seg.steps
. Again, in terms of semantics, the program should look as if it is computing this set on the fly by looking at the one, true canonical setIt's worth noting that, in proper odgi itself, the set
seg.steps
is not an illusion: odgi actually pre-computes a set of steps associated with every segment (which it calls a "node"). This is good for performance! But our goal here, in defining the relational semantics, is to divorce ourselves from that kind of implementation concern. We would really like it if high-performance implementations of the Pollen DSL would also actually reify thisseg.steps
set for quick access---but, according to our goals set out in the previous answer, we also want this reification to be invisible to the programmer. That is, looking only at the outputs of programs and not at performance, a programmer should not be able to tell apart different implementations of Pollen---for instance, the naive one that usesR
and one that literally uses odgi's data structures. No matter how clever a programmer might get with writing tricky programs, we would like to guarantee that any program produces the same answer in either implementation.How can you tell that we have relational semantics?
A good way to understand programming language semantics is through litmus tests: little programs with two plausible outputs, and whose output tells you which of two semantic worlds we're living in. So one thing we can do is write little litmus-test programs and say "under relational semantics, the output would be X; under some not-quite-right semantics, the output would be Y."
To establish a baseline, let's write a "no-op" Pollen program that re-emits an input graph unmodified. (This involves imagining a little bit about our syntax and features and stuff; please forgive any mismatches with current proposals or the current parser implementation.) Here's a no-op:
From here on out, I'm going to omit the
graph
andparset
declarations at the top for brevity. Here's a different program with identical (no-op) semantics:That is, walking over the steps from the "point of view" of individual segments should yield the same overall set of steps as iterating over the monolithic, aggregate set
graph.steps
. After all, semantically speaking, eachseg.steps
is not actually a separate little set; it is just a convenient way to pull out a little subset of the biggraph.steps
. In the same sense, this program is also (equivalently) a no-op:One last version of the program should also be a no-op, given our decision that
emit
is deduplicating:That is, we're redundantly emitting every single step twice: once from the segment's point of view, and once from the path's point of view. But the steps are always the same, so the output set is just identical to
graph.steps
.Here is an extremely contrived program that acts as a litmus test:
This program "modifies" the steps, in the first loop, to always cross the first segment. The relevant question about this program is: after it finishes, if you were to look at
path.steps
for everypath
, would you see those changes (i.e., would you see every step crossing the same segment)? Or would you not---meaning that we would have "inconsistent" views of all the steps depending on whether we looked at them usingpath.steps
for every path orseg.steps
for every segment?Under relational semantics, we would see the changes regardless of where we view them from. Semantically, there is only one big step relation; it is not possible for
path.steps
andseg.steps
to "disagree." You can imagine a different semantics---one in whichseg.steps
is actually a different set of data from what you see fromseg.steps
---under which this would not be true, i.e., "changing" the steps from the point of view of segments does not affect them from the point of view of paths, so they can become inconsistent.In other words, this kind of inconsistency, where the data looks different from different viewpoints, is impossible under relational semantics.
Is it still possible to do bad things, even within relational semantics?
Yes! While relational semantics prevents the aforementioned form of inconsistency, this kind of inconsistency is not the only bad thing that programs can do.
A big category of bad thing that programs can still do is to refer to objects that don't exist. For example, how about this program:
That
gaph.paths[:5]
is imaginary syntax that just iterates over the first 5 paths, arbitrarily chosen. So this program is like a no-op, except that it omits most of the paths. The result is that we will emit a bunch of steps that refer to paths that don't exist. This is bad, but it's not prevented by relational semantics.Beta Was this translation helpful? Give feedback.
All reactions