Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dyno: fix query recursion bug #26459

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DanilaFe
Copy link
Contributor

@DanilaFe DanilaFe commented Jan 3, 2025

This PR is not merge ready since it has a 300% performance hit on Dyno as things stand (debug and release). However, I can't think of a different ways to improve performance, and the "slow" parts appear fundamental to the issue.

This fixes a bug pointed out by @riftEmber in which we get a query error similar to the one fixed in #26061. Indeed, her intuition was right: this is another bug that has to do with isQueryRunning, but it's much more insidious, and much harder to fix performantly.

The Bug

The issue has to do, once again, with cycles of query dependencies guarded by isQueryRunning. Consider a cycle of three queries: A -> B -> C -> A. Each query runs an isQueryRunning before invoking its dependency to avoid cycles. As a result, we get:

A -> B -> C -> stop (A is running)

Importantly, in this case, the query A does not get added as a dependency to to C, since the isQueryRunning check doesn't add dependencies by itself, and a call to A is prevented. However, we do have edges A -> B and B -> C.

In a new generation, suppose we start the with B instead of calling A at the top level as before. Suppose additionally that another dependency of C has changed, necessitating its recomputation. Then, we get:

B -> C -> A -> stop (B is running)

At this point, A has been added as an edge to C, completing a dependency cycle in the graph. We will now get infinite recursion.

The Fix

What happened? At the surface, an issue is that we add dependency edges, but never remove them. So, as different query orderings occur, we get closer and closer to getting cycles.

A deeper issue is that we have hysteresis w.r.t. isQueryRunning. If it returns true during one query execution, we implicitly assume (when checking for recomputation) that it will always return true. Thus, it's possible to write queries that break the query system, since they don't react to the changing structure of the graph.

My solution was to track calls to isQueryRunning as edges. These are a new kind of edge that are not recursively traversed; rather, they use the same mechanism as isQueryRunning to check if the previous result has changed, and react accordingly. This resolves the problem, but adding these new edges and making isQueryRunning checks slows down dyno very significantly (testDomains goes from 6 seconds to 23 seconds). The delays are both due to the increased size of the dependencies vector (6 -> 8 seconds) and the checking itself (8 seconds -> 23 seconds).

I'm not sure how to proceed, since this is a clear correctness issue. Perhaps @mppf will have some advice?

Performance notes

My initial intuition was that the slowdown was caused due to invalidation of queries (i.e., we used to use a cached result, but it's not sound, so we no longer do). To test this hypothesis, I executed the following script which invokes testInteractive on two testDomains programs.

`domains.1.chpl`
module M {
  var d : domain(1);
  param rg = 1;
  type ig = int;
  type fullIndex = if rg == 1 then ig else rg*ig;

  param r = d.rank;
  type i = d.idxType;
  param s = d.strides;
  param rk = d.isRectangular();
  param ak = d.isAssociative();

  var p = d.pid;

  for loopI in d {
    var z = loopI;
  }

  proc generic(arg: domain) {
    type GT = arg.type;
    return 42;
  }

  proc concrete(arg: domain(1)) {
    type CT = arg.type;
    return 42;
  }

  var g_ret = generic(d);
  var c_ret = concrete(d);
}
`domains.2.chpl`
module M {
  var d : domain(1, strides=strideKind.one);
  param rg = 1;
  type ig = int;
  type fullIndex = if rg == 1 then ig else rg*ig;

  param r = d.rank;
  type i = d.idxType;
  param s = d.strides;
  param rk = d.isRectangular();
  param ak = d.isAssociative();

  var p = d.pid;

  for loopI in d {
    var z = loopI;
  }

  proc generic(arg: domain) {
    type GT = arg.type;
    return 42;
  }

  proc concrete(arg: domain(1, strides=strideKind.one)) {
    type CT = arg.type;
    return 42;
  }

  var g_ret = generic(d);
  var c_ret = concrete(d);
}

I made a change to testInteractive to persist query timing information between generations.

Diff to `testInteractive.cpp`
diff --git a/frontend/test/resolution/testInteractive.cpp b/frontend/test/resolution/testInteractive.cpp
index 7771893361a..eff8262b104 100644
--- a/frontend/test/resolution/testInteractive.cpp
+++ b/frontend/test/resolution/testInteractive.cpp
@@ -296,6 +296,8 @@ int main(int argc, char** argv) {
     return 0; // need this to return 0 for testing to be happy
   }
 
+  if (timing) ctx->beginQueryTimingTrace(timing);
+
   while (true) {
     ctx->advanceToNextRevision(gc);
 
@@ -304,7 +306,6 @@ int main(int argc, char** argv) {
     setupSearchPaths(ctx, enableStdLib, cmdLinePaths, files);
     typeForBuiltin(ctx, UniqueString::get(ctx, "int"));
     ctx->setDebugTraceFlag(trace);
-    if (timing) ctx->beginQueryTimingTrace(timing);
 
     CompilerFlags flags;
     flags.set(CompilerFlags::WARN_UNSTABLE, warnUnstable);
@@ -380,7 +381,6 @@ int main(int argc, char** argv) {
 
     printf("Ran %i queries to compute the above\n\n",
            ctx->numQueriesRunThisRevision());
-    if (timing) ctx->endQueryTimingTrace();
 
     if (gc) {
       ctx->collectGarbage();
@@ -408,5 +408,6 @@ int main(int argc, char** argv) {
     }
   }
 
+  if (timing) ctx->endQueryTimingTrace();
   return 0;
 }

I then executed the following script to gather timing data across generations:

Testing script
#!/bin/bash

# Create a pipe
pipe=/tmp/testpipe

if [[ ! -p $pipe ]]; then
    mkfifo $pipe
fi
echo "Pipe created"

exec 3<>$pipe
echo "Pipe opened"

$(find build -name "testInteractive") domains.main.chpl --std --time $1 < $pipe &
echo "y" >&3
sleep 10

cp domains.2.chpl domains.main.chpl
echo "y" >&3
sleep 5

cp domains.1.chpl domains.main.chpl
echo "y" >&3
sleep 5

echo "n" >&3
sleep 5

exec 3>&-
rm $pipe

I found that queries were run more often after the patch was applied, but not by a lot (28522 queries vs 25272 queries). However, the total time elapsed by each query was significantly larger (31476 vs 19109.8). This makes me think that my inital hypothesis was incorrect; rather, query execution simply got much slower.

In particular, my initial results suggest that the following queries got much slower: emitMultipleDefinedSymbolErrorsQuery and filterCandidatesInitialGatherRejectedQuery (a factor of at least 5x slower).

DanilaFe added a commit to DanilaFe/chapel that referenced this pull request Jan 3, 2025
This incurs a penalty up front, but prevents odd ordering properties
in the query system. It's a workaround for a query system bug
detailed in chapel-lang#26459.

Signed-off-by: Danila Fedorin <[email protected]>
@mppf
Copy link
Member

mppf commented Jan 6, 2025

I'm not sure how to proceed, since this is a clear correctness issue. Perhaps @mppf will have some advice?

Ideally, we should find a way to solve it that avoids the performance overhead here.

Here are a few ideas that you might be able to investigate

  • For queries that work with a cycle & isQueryRunning, we could arrange for them to always launch the query with the smallest element (by some definition). That would lead to A -> B -> C -> stop (A is running) even on the recomputation.
  • Related to that, but from an opposite direction. It looks like we only have 4 queries that use isQueryRunning today: 3 in scope resolution and 1 in resolveForwardingExprs. Maybe we can rewrite these in a different way. Conceptually, we might have a query that gathers what work needs to be done, detects and removes cycles in that work, and proceeds that way.
  • We might need the query framework to be aware of queries that can handle cycles. Today it isn't, really. The queries handle cycles by detecting the cycle and then handling the call that would loop it in a different way. Normally, we are visiting all of the things in a graph (potentially while creating that graph, at least at a conceptual level; as with type resolution), and we just don't re-visit the one that would cause a cycle. Potentially, we could have QUERY_BEGIN_CYCLE_AWARE or something, and the query system itself could do this instead of the query (and as a result, the query would not need isQueryRunning). The open question here is just whether or not that would help matters (i.e. would it just move the bug?)
  • Re the deeper issue with isQueryRunning, can we put constraints on what a correct query can do with it? Maybe I am being too optimistic, but I don't think it's necessarily a problem that it's possible to write incorrect queries (we can write core dumps as well...). The key question is, can we describe the rules a query working with isQueryRunning needs to follow for correctness, and make sure that our 4 queries that do it follow these rules? Of course it is a big problem if it's not possible to describe such rules or to have the queries follow them.

DanilaFe added a commit that referenced this pull request Jan 6, 2025
This incurs a penalty up front, but prevents odd ordering properties in
the query system. It's a workaround for a query system bug detailed in
#26459.

This way, cycles in the standard module dependency graph are always
visited in the same order (starting with `ChapelBase`). As a result, we
don't create cycles over multiple generations. This workaround does not
prevent the issue from occurring in other contexts, such as user
projects with circular dependencies.

Reviewed by @benharsh -- thanks!

## Testing
- [x] Anna's test branch doesn't crash
riftEmber pushed a commit to riftEmber/chapel that referenced this pull request Jan 6, 2025
This incurs a penalty up front, but prevents odd ordering properties
in the query system. It's a workaround for a query system bug
detailed in chapel-lang#26459.

Signed-off-by: Danila Fedorin <[email protected]>
riftEmber pushed a commit to riftEmber/chapel that referenced this pull request Jan 6, 2025
This incurs a penalty up front, but prevents odd ordering properties
in the query system. It's a workaround for a query system bug
detailed in chapel-lang#26459.

Signed-off-by: Danila Fedorin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants