Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe memoization during parse graph flattening #2136

Merged
merged 18 commits into from
Feb 11, 2025

Conversation

PieterOlivier
Copy link
Contributor

@PieterOlivier PieterOlivier commented Jan 30, 2025

This PR improves the memoization strategy during parse graph flattening. The original proved both incorrect and too slow in the context of the highly ambiguous and cyclic parse forests produced by error recovery.

Special thanks to @arnoldlankamp who came up with the following three heuristics specifying when nodes can be safely cached for memoization:

  1. Any non-zero length node that is part of a link with a non-zero length prefix.
  2. Any node, which is a part of a prefix for which the above above holds true.
  3. Any nullable node that is part of a link for which all prefixes adhere to rule one at some point.

This PR implements a new caching strategy based on these heuristics.

We have gone through great lengths to test the correctness and performance of the code in this PR.
We already had a test suite that tested error recovery on all characters in all Rascal source files in the rascal repo using
two tests: delete the single character and delete all characters until the end-of-line.

We modified these tests to:

  • First try to do the error recovery parse without memoization during flattening
  • If this succeeded within two seconds, do the same with memoization
  • Compare the results

In all cases where the parse without memoization succeeded, the parse with memoization resulted in
exactly the same tree as the tree build without memoization. This in contrast with the old memoization approach where in the same test the result with memoization was often different from the result without memoization.

We have also setup a performance benchmark to compare the speed of "normal" parsing (no error recovery and no ambiguities)
of all Rascal source files in the rascal repo with the speed before this PR. We could not find any speed differences between
the two version, so we are confident this PR does not degrade performance of "normal" parses.

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 47.84314% with 133 lines in your changes missing coverage. Please review.

Project coverage is 49%. Comparing base (8c405fc) to head (d647f70).
Report is 273 commits behind head on feat/error-recovery.

Files with missing lines Patch % Lines
src/org/rascalmpl/library/util/ErrorRecovery.java 0% 66 Missing ⚠️
...c/org/rascalmpl/parser/gtd/result/struct/Link.java 71% 11 Missing and 9 partials ⚠️
...rg/rascalmpl/parser/util/ParseStateVisualizer.java 0% 15 Missing ⚠️
src/org/rascalmpl/parser/util/DebugUtil.java 0% 14 Missing ⚠️
src/org/rascalmpl/util/visualize/dot/DotGraph.java 0% 9 Missing ⚠️
...pl/parser/gtd/result/out/DefaultNodeFlattener.java 90% 2 Missing and 1 partial ⚠️
...ascalmpl/parser/gtd/result/out/INodeFlattener.java 50% 1 Missing and 1 partial ⚠️
...ser/gtd/result/out/ListContainerNodeFlattener.java 91% 2 Missing ⚠️
...pl/parser/gtd/result/out/SkippedNodeFlattener.java 0% 0 Missing and 1 partial ⚠️
...ser/gtd/result/out/SortContainerNodeFlattener.java 92% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                   @@
##             feat/error-recovery   #2136    +/-   ##
======================================================
  Coverage                     49%     49%            
+ Complexity                  6619    6572    -47     
======================================================
  Files                        687     696     +9     
  Lines                      61218   61043   -175     
  Branches                    8874    8910    +36     
======================================================
+ Hits                       30369   30383    +14     
+ Misses                     28620   28410   -210     
- Partials                    2229    2250    +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DavyLandman
Copy link
Member

Hi @arnoldlankamp, your comments on #2100 really helped out. Do you have some time to take a look at this? Thanks.

ps. @PieterOlivier I think this means you should close #2100 ?

@PieterOlivier
Copy link
Contributor Author

PieterOlivier commented Jan 31, 2025

ps. @PieterOlivier I think this means you should close #2100 ?

Definitely, I have done so now.

@arnoldlankamp
Copy link
Contributor

Hi @arnoldlankamp, your comments on #2100 really helped out. Do you have some time to take a look at this? Thanks.

I'll have a look at it as soon as I'm able.

@PieterOlivier
Copy link
Contributor Author

We will merge this PR shortly so we can continue work on error recovery, @arnoldlankamp your feedback on this PR is still very welcome! If any changes are needed we will implement them in a separate PR.

@PieterOlivier PieterOlivier merged commit 54d50fb into feat/error-recovery Feb 11, 2025
6 of 7 checks passed
@PieterOlivier PieterOlivier mentioned this pull request Feb 11, 2025
@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Feb 16, 2025

Sorry for responding a bit slowly to this.

I had a look at the PR and haven't seen anything that would warrant any changes (yet 😉 ).

There are however some things I would have implemented differently. Maybe I'll add those as comments to this PR at a later date. If only for the purpose of sharing insight.

I still need to find some time to have a more thorough look at certain things to see if I can think of some edge cases that we may still need to cover.

What I'm thinking of is things like interactions with nesting restrictions (which are used to model priorities and associativity), where sorts potentially have a different sub-set of results associated with them for the same input location, depending on their parent sort. (Looking at the code, I don't really expect to find any issues related to this).
Also the action executor functionality might be impacted in some way, since it's context dependent (although I don't think anyone has ever used this feature anyway 🤷‍♂️ ). (Seems to be handled, but I'd like to double check).

I'll let you know if I find anything.

@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Feb 16, 2025

Oh and congratulations on successfully fixing this issue. I think you're the first person to make a non-trivial change to the parser's codebase.

I know the parser related code can be quite challenging to comprehend, because of some choices I made back when I initially wrote it.

Honestly, I never expected it to still be in use after so many years.

If I'd have the time I'd write a new version, with cleaner code, a simpler grammar input mechanism, support for dynamic grammar rules (to provide users a clean and generic way to model things like the off-side rule, typedefs or other context sensitive language features, without having to resort to sematic actions, post processing or requiring language feature specific parser 'addons') and a bunch of other improvements.
Ah, too many ideas, too little time ... 😞 .

@PieterOlivier
Copy link
Contributor Author

@arnoldlankamp thanks for the praise and your time! And although this PR (and indeed the whole error recovery PR) is now merged in main, I share your sentiment that this is will not be the last work on this subject. We are definitely interested in anything you want to share in the future!


if (cacheMode == CacheMode.CACHE_MODE_FULL) {
node.setTree(result);
} else if (cacheMode == CacheMode.CACHE_MODE_SHARING_ONLY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good way to maximize the sharing of results. Originally I would just ignore the sharing of results that were potentially modified by an action executor.

Note however that the action executor functionality is, as far as I'm aware, unused. It was originally intended as a hook for the potential future addition of sematic actions; to support non-context free language constructs, for example.

Since this never seems to have happened, one could consider removing this (essentially dead) functionality (in a separate PR). As this would simplify the flattener code and maybe improve performance slightly. Or you could leave it for future use. I'm just sharing info here 😄 .


private boolean checkCacheable() {
int type = node.getTypeIdentifier();
if (type == CharNode.ID || type == LiteralNode.ID || type == SkippedNode.ID || type == EpsilonNode.ID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming type == EpsilonNode.ID is ok here, since epsilons will should only occur as terminal leaf nodes and never as part of a multi-symbol production.
Although I have to mention if you feed the parser a production with the form S ::= AεBC it will just interpret the ε as empty and move on. Similarly for empty literals. They should never be encountered, but they will parse just fine if a user erroneously ends up feeding them to the parser.

So essentially there is nothing wrong with this condition and it doesn't need to be changed 😄 .

It just triggered me to share the 'user error' case, since it could hypothetically occur. But I don't think Rascal would ever generated grammars like this, so it's not an issue.

// Parse forest memoization during flattening
// Note that the type should technically be a type parameter, but this
// would entail more code changes than we are willing to make at this point.
private IConstructor tree;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: The following comment is not a request for a change, but merely intended for the purpose of sharing insight. Maybe it's useful, maybe it's not, so feel free to ignore it 😉 .

Originally, my initial instinct was also to share the trees via a field in this class. It seems the easiest and most logical way to implement this, involving the least amount of code.

But ultimately I opted for a different solution, since conceptually it actually doesn't seem the right place to store the cached trees.
Interestingly enough, in a way, the code comment above echoes the point I'm going to try to make. So I might just end up putting words to the it can't be right I need to change all this stuff for this one little thing feeling I seem to get from this comment 😁 .

I tend to think in phases. Each phase takes input data, applies a transformation and produces output data. During each phase modifications can be made to the output data structures, but at the end of each phase the produced result is final and immutable (to provide both opportunities for improving performance during the application of a transformation, but ultimately also guarantee data safety).

In this case we can distinguish the following phases:

  1. Grammar + Input String -> Parser -> Binarized Parse Forest
  2. Binarized Parse Forest -> Flattener -> Flattened Parse Forest

Since the Flattener produces the Flattened Parse Forest, I would also consider the Flattener to be responsible for caching the intermediate results it produces.

By modifying the Binarized Parser Forest to store references to parts of the output of the Flattener phase we break both phase encapsulation and the data immutability constraint.
The consequences of this are:

  • Modifications need to be made to an upstream phase and it's output data to facilitate functionality of a downstream phase; while each phase's concerns should remain separate and data should only flow downstream.
  • It may not be safe to use the Binarized Parse Forest more than once as input for the Flattener. Since modifications are made to the input data, results might differ on subsequent calls.

It might not really be an issue in this specific case, but in general it's a pattern I would avoid, since it provides a potential source for bugs.

A counter example for when adding "cache fields" to objects is ok is lazy initialization (e.g. calculating an object's hash code when requesting it for the first time). Since here the object clearly has ownership over the data.

For everything else I would go the more functional programming inspired route of immutable data -> transform -> immutable data and adjust the size of the phase depending on performance constraints (since going "immutable everything" can tank performance in certain cases 😉 ).

In fact the parser went the "one phase to rule them all" route (apart from flattening) for performance reasons, but paid a heavy price in ease of comprehension (while the underlaying algorithm is actually rather trivial; similar to Earley, but with hidden right nullable support and a binarized parse forest like GLL).

Again. Nothing needs to be changed. As it stand the implementation both correct and performant.
Your code comment just compelled me to share my perspective, that's all; once I start typing it's not easy for me to stop 😄 .

@arnoldlankamp
Copy link
Contributor

So I finally found some time to have a proper look at the code. Sorry for taking so long.

All the edge cases I wanted to check are covered. Everything looks to be correct 👍 .

I did however add a couple of long winded comments to the code. Just for the sake of sharing knowledge.

@arnoldlankamp
Copy link
Contributor

Oh and one more thing (sorry for the hijack btw). Our recent discussions triggered my brain into thinking about improvements and additions (my brain quite easily gets triggered into "solving puzzles" and doesn't want to stop, so thanks for that 🙄 ).

At first I was thinking in terms of code simplifications and adding new useful features. Then I considered how to work around one of the bigger performance limitations (the JVMs memory management and GCs don't play all that well with modifying large interlinked graph like data structures, i.e. building parse forests). Strangely enough, via this avenue, a whole new parser algorithm recently popped into my head.
Unfortunately I can't put it into code any time soon. Which I really need to do to verify that what's in my head is correct. Maybe I'll try to fiddle some things on a piece of paper when I have some time, to check some things.

If my idea checks out you'll get this:

  • O(N^2) worst-case space requirements until parse completion and O(N^3) after binarized tree construction.
  • Extremely low memory usage, near zero garbage generation and an implementation with high mechanical sympathy.
  • No parse forest construction until parse completion. So no tree nodes are build for parse traces that die off mid parse.
  • O(N^3) worst-case performance (obviously, but with a way lower cost for N than even Rascal's current parser).
  • A binarized parse forest O(N^3) that is tail shared, instead of head shared. Making it a bit easier to work with and traverse. You may even get away with using projections, instead of flattening the entire forest pre-emptively. This way flattening would realistically only incur a cost for ambiguous nodes (just flatten and cache them in place on first access), which would be a massive performance improvement.
  • Compatible with in-flight filtering (e.g. nesting restrictions and match filters like Rascal's current parser and more, like dynamic data based filtering).
  • The first phase is driven top-down, so parse errors should be easy to trace.

I'll refrain from going too much into detail, but basically the idea is to unroll all parse traces into a sparse bit matrix like data structure (think of it as a kind of journaled GSS for the entire input string) and follow all completed traces back through this data structure to build a binarized tree. This way you only need to store from and to for each recognized symbol on the way forward (and leave parent-child reference implicit, by location). On the way back you can merge the recognized symbols with knowledge derived from the grammar to figure out what is a child of what and build the tree (i.e. if a production's location matches a potential parent's recognized symbol location, it's a child).
The first phase is pretty straight forward; the main challenge here is to store the parse traces as efficiently as possible (I have this pretty much figured out already). The actual logic is mainly in the second phase and is where I still need to check some things to verify that this will work for all cases (especially in combination with in-flight filtering) and it won't end up hallucinating extra results, for example.
But overall I'm feeling pretty optimistic about it.

Having said that, it'll probably take a very long while before I get around to doing anything with this idea, other than a bit of doodling on a piece of paper, to see if it could work.
In any case I just wanted to share the direction in which I was thinking. Maybe it'll serve as inspiration for someone else 😉 .

@PieterOlivier
Copy link
Contributor Author

I totally share your sentiment about caching the memoization result in the AbstractNode. I just could not resist the easy opportunity to eliminate all map operations during memoization, even though it clearly breaks the phase abstraction. What tipped me over was the fact that SGTDBF instances are explicitly protected from being reused so there is no risk of reusing or reflattening the binarized parse forrest.
I will add an explicit comment to the code for innocent successors to warn them about this issue.

@PieterOlivier
Copy link
Contributor Author

Your ideas about a better implementation definitely sound interesting. I do not think we currently have the manpower to do anything with them. If you find the time to write your ideas down more thoroughly, ideally in the form of a scientific paper, that would be awesome! But I understand that doing so in your spare time would be a huge strain, I know I would not be able to manage it.
Maybe @jurgenvinju can find someone to pick up on this as the potential gains in terms of raw parse speed and memory gains are clear.

@jurgenvinju
Copy link
Member

If any of us would spend time and energy on the algorithm side, we really need the kind of data dependence that is proposed by Ali and Anastasia in their thesis on iguana.

With data dependent parsers we can elegantly solve indentation-sensitive grammars, symbol tables and other scanner and parser "hacks" in real programming languages. Ali has made a version that integrates with rascal. However, if we add it to Arnold's parser it may be less work to get it fully up and running?

The data comes top-down from parameters of non terminals. You could say that E[i=0] is simply a different nonterminal from E[i=1]. And then there is some builtin data like the left and right extent of already recognized symbols and their character yield. The extra constraints are any boolean predicate over the above data.

High level disambiguation stuff like priorities and the offside rule or alignment rules, can be "compiled" down to basic data dependent predicates. See Ali's and Anastasia's thesis.

Very very interesting 🤓

@arnoldlankamp
Copy link
Contributor

I totally share your sentiment about caching the memoization result in the AbstractNode. I just could not resist the easy opportunity to eliminate all map operations during memoization, even though it clearly breaks the phase abstraction. What tipped me over was the fact that SGTDBF instances are explicitly protected from being reused so there is no risk of reusing or reflattening the binarized parse forrest. I will add an explicit comment to the code for innocent successors to warn them about this issue.

I agree completely. And indeed, doing away with the map operations has it's advantages; both in terms of performance and simplicity.

@arnoldlankamp
Copy link
Contributor

arnoldlankamp commented Mar 9, 2025

If any of us would spend time and energy on the algorithm side, we really need the kind of data dependence that is proposed by Ali and Anastasia in their thesis on iguana.

This is exactly what I meant by "support for dynamic data based filtering". This is actually rather trivial to add. Just like you mentioned, you can just include an environment in the GSS nodes and make it part of their identity.

I'm not a 100% clear on the details of what Ali and Anastasia came up with. I know you once send me one of their papers, but I have a hard time interpreting scientific literature. To me it's all weird notations and Greek symbols 🤷‍♂️ .
I think at least part of it is pretty much in line with what I came up with, but the handling of the offside rule looked like a kind of parser build-in, with the align thingy; I went a different route with that.

Personally, I have a strong preference for generic solutions, opposed to adding support for special cases directly into the parser. Similar to the nesting restriction thing I came up with for supporting priority and associativity rules (which, in a way, is basically in-flight grammar factoring without actually modifying the grammar).

Anyway, my plan was to add support for two types of variables:

  1. A (copy-on-write) scoped global dictionary, which holds one-to-many key-value pairs. You can use this for all declare-before-use constructs; like typedefs. You could even get rid of the declare-before-use constraint, by using placeholder patterns for variables that can potentially be declared downstream and running a post parse filtering pass to backfill, validate and prune the parse forest.
  2. Local variables with shadowing. These can be used for recursive language structures; like the offside rule, where each time you descend you need to take more whitespace into account for the level you descend into.

When these variables should be filled by the parser can be indicated by annotating parts of a production with a variable name.

So all that we are left with is for the user of the parser to come up with an extension to the grammar formalism to support these two types of variables.
I was imagining something like:

  • Global appending: Typedef ::= "typedef" Type types += TypeAlias
  • Usage: Assignment ::= <types> VariableName "=" Expr
  • Recursive usage (literal): Expr ::= indent = (<indent="\t"> "\t" ) Expr
  • Recursive usage (patterned): Expr ::= indent := (<indent:=[\t ]*> [\t ]*) Expr

For recursive rules we use shadowing (and assign a default value). When we descend into the Expr on the rhs, we use the updated value for indent. Any other alternatives on the same level will naturally keep using the current value of indent.

Naturally you could also come up with a shorthand for common constructs (e.g.: Expr ::= offside Expr or something) and have the parser generator take care of the implementation details.

There is one case that is a bit more involved to get right; and that's the case where mixing tabs and spaces is acceptable. We can use pattern assignment here, instead of literal assignment. Basically we assign a clone of the actual matched grammar symbol(s) to the variable, including a length restriction. This length restriction is something we implicitly require. Normally we assign variables based on matched input. Matched input has a definitive length. By maintaining the length constraint that matching literally would normally impose, but accommodating for the matching of patterns, we widen the range of acceptable inputs within these bounds. Whitespace is a prime example of why this is a useful thing to support and this should allow us to handle the offside rule in a generic way, without resorting to special cased parser support or requiring extensions to the grammar formalism that only serve a single purpose. You essentially get a choice between exact matches and constrained matches.
One other thing to note is that, since the matching of patterns stored in variables is done by the parser itself and any restrictions imposed on this process are artificial, there is actually nothing stopping you from going full parser inception to allow for maximum creativity (but please don't 😅 ).

So in short, all your data dependence wishes can be granted 😁 .

And yes, this could also be added to Rascal's current parser implementation. Conceptually it's not that hard. Just carry along an environment in the nodes, take it into account during stack expansion by making it part of the node's equality check and add some code to deal with variable usage.
However I am a bit apprehensive about actually implementing this at the moment. As you're aware the parser codebase is not that easy to comprehend, since all features and optimizations interact with each other in interesting ways. If I were to do this I would need to spend a couple of consecutive days full-time on it. It's not that hard, but it's not a project that's easily done in small iterations. Unfortunately this is not something I can afford to do right now 😞 .

@arnoldlankamp
Copy link
Contributor

Ali has made a version that integrates with rascal

If this is the case. Why don't you use Ali's parser instead? It sounds like the work has already been done.

@arnoldlankamp
Copy link
Contributor

If you find the time to write your ideas down more thoroughly, ideally in the form of a scientific paper, that would be awesome!

That would indeed be a nice thing to do. Maybe some day, if time allows and my idea checks out 😉 .

@jurgenvinju
Copy link
Member

We have experimental status on that. And at the same time very well tested and used and supported version of the current parser. So that's why I'm exploring all angles. I'm not sure which way to go yet.

If we'd go data-dependent with your parser then all filtering can be done on the level of data-dependent context-free grammars, so the specifics for priority and associativity and even "offside" would be removed from the parser runtime code. In their place there would be grammar to grammar transformations that add the filtering semantics by adding parameters and constraints based on the high-level disambiguation constructs. This we can learn from Nastya and Ali; but I want it as pure rascal code. And of course there would be the environments for variables (immutable) and the generic constraints (I'm thinking lambdas that return a boolean)

@jurgenvinju
Copy link
Member

jurgenvinju commented Mar 10, 2025

Of course we want parser efficiency in the context of data-dependence. So for linear parsers to stay linear and context free grammars to remain worst-case cubic. In general ddcfg parsing is Turing complete; but I think we can write down some clear restrictions that limit the amount of unique variable assignments to cubic.

@arnoldlankamp
Copy link
Contributor

I think I understand your reasoning here. Simply put, if you take the option to keep using the current parser, you'd prefer to extract as much functionality as possible to the Rascal layer to make it easier to modify, maintain and do experiments with. This is kind of the opposite direction of what I was thinking in, but seems like a sensible thing to do, if you would choose to go this route.

On the other hand, it does seem like a lot of work opposed to flipping the switch and using Ali's parser. If it compiles and all tests are green just ship it 😉 .

@DavyLandman DavyLandman deleted the recovery/safe-memoization branch March 10, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants