Incremental parsing of visibly pushdown languages? #36

modulovalue · 2023-11-14T14:45:57Z

I'm wondering, do you have any insights into incremental parsing where VPLs could offer some kind of advantage over, e.g., parsing context free languages using, e.g., LR/LL parsing?

Or, specifically, how would you go about making owl work in an incremental fashion, that is, if owl has parsed some source once, and only small parts of it have changed, subsequent parses should not have to reparse everything from scratch.

I'd appreciate any thoughts you might have on this.

ianh · 2023-11-14T20:04:13Z

I'm not really familiar with incremental parsing techniques for context free languages, but one issue with owl's style of parsing is that nondeterminism can be resolved arbitrarily far in the future. For example, in this grammar:

dots = even | odd
odd = '.' ('.' '.')*
even = ('.' '.')+

whether a particular dot is part of the "even" rule or "odd" rule depends on the total number of dots in the sequence.

Technically, the grammar is compiled into a deterministic automaton, but in order to make a parse tree, the path through the original nondeterministic automaton has to be recovered -- this is done in a backward pass over the recorded sequence of state transitions. (The parse tree construction actions are encoded as epsilon transitions of the nondeterministic automaton.)

So actually, now that I'm thinking about it, I believe if you record the full sequence of states/transitions/parse tree construction actions for both for the forward and backward pass, you could do something like this:

Start from the deterministic state recorded for the edit location, following the deterministic automaton until you reach a "sync point" where both the state and the remaining tokens are the same as those in the recorded sequence.
At this point, you know the forward and backward passes will proceed as before, starting from the sync point in the forward pass and returning to the sync point in the backward pass. So you can skip all that and just start the backward pass from the sync point, starting from the stored nondeterministic state at that point in the sequence.
Continue the backward pass until you reach another "sync point" where the nondeterministic state and remaining tokens are the same as in the recorded sequence.

The parse tree construction actions associated with the nondeterministic automaton transitions between these two sync points are the "new actions", replacing the "old actions" that were in the sequence between the two points before the edit. The parse tree construction actions before the first sync point and after the last sync point are unchanged. So the last thing to do is to incrementally update the tree itself based on the change to the sequence of construction actions. I'm not sure how to do this, but I'm sure someone has written about it somewhere.

For single-character changes in the "dots" example, the sync points would always be the start and end of the input (so incrementality wouldn't really gain you anything), but for more realistic examples, hopefully the sync points would bracket the changes more precisely.

As for getting owl itself to work incrementally, I'm not sure how hard that would be. At the very least, you'd need more API to handle the incremental changes. I could look into it more closely if you have a specific need for it.

modulovalue · 2023-11-15T19:56:53Z

You said:

Technically, the grammar is compiled into a deterministic automaton, but in order to make a parse tree, the path through the original nondeterministic automaton has to be recovered -- this is done in a backward pass over the recorded sequence of state transitions. (The parse tree construction actions are encoded as epsilon transitions of the nondeterministic automaton.)

This comment is unrelated to incremental parsing, but I'm wondering:

If I understood you correctly, you are doing a single forward pass during parsing to collect a sequence of transitions, and a backward pass to reconstruct the parse tree events from those transitions in reverse.

Wouldn't it be possible to skip the "backward pass over the recorded sequence of state transitions" by storing a set of actions for each state transition of the deterministic automaton, and to emit them during the forward pass? That is, to merge both passes into one?

ianh · 2023-11-15T20:46:51Z

It's not possible -- take the dots grammar from before, for example. You can't tell whether to emit a "begin odd" or "begin even" action after only the first ..

BTW, if you want to read more, this backward-pass technique for reconstructing the nondeterministic transitions from the deterministic ones comes from Danny Dubé's work on the SILex lexer generator. It's described more thoroughly in these two papers: http://www.iro.umontreal.ca/~feeley/papers/DubeFeeleyACTAINFORMATICA00.pdf http://www.schemeworkshop.org/2006/14-dube.pdf (these papers are where I learned about it). You run into the same issue parsing regular expressions (which makes sense, since dots is a regular grammar).

modulovalue mentioned this issue Nov 15, 2023

Deforested tree construction possible? #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental parsing of visibly pushdown languages? #36

Incremental parsing of visibly pushdown languages? #36

modulovalue commented Nov 14, 2023

ianh commented Nov 14, 2023

modulovalue commented Nov 15, 2023

ianh commented Nov 15, 2023 •

edited

Loading

Incremental parsing of visibly pushdown languages? #36

Incremental parsing of visibly pushdown languages? #36

Comments

modulovalue commented Nov 14, 2023

ianh commented Nov 14, 2023

modulovalue commented Nov 15, 2023

ianh commented Nov 15, 2023 • edited Loading

ianh commented Nov 15, 2023 •

edited

Loading