Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the retry support for nondeterministic expressions #11789

Merged
merged 5 commits into from
Dec 23, 2024

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Nov 27, 2024

Contributes to #11649

This PR is trying to address some requirements described in issue #11649, but not all of them.

It introduces two new classes named GpuExpressionRetryable and RetryStateTracker to initially set up a fundamental support for the items 2 and 3 as below, and adds in the relevant unit tests.

  1. We need code changes so that if a retry happens on a non-deterministic expression that is outside of a checkpoint/restore, then we fail instead of retrying.
  2. We also want a way to detect a non-deterministic expression being run outside of a checkpoint/restore retry block and throw an error from the plan so that when we can have tests validate that we have this covered.

And it also adds the integration tests for the function rand() being used in HashAggregate, Generate, Projection, ArrowEvalPython and Filter. This is for the item 4, and it still does not cover all the cases where a nondeterministic expression can be used, but we are closer than before.

@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Nov 29, 2024

The failure in CI is a known issue, pls refer to #11790.

@firestarman
Copy link
Collaborator Author

build

@sameerz sameerz added the bug Something isn't working label Dec 5, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great first step. I mostly want to understand the 3.5.1 issue around generate and if we need to file a follow on issue.

I would also love to see some rand tests around

  • join (we can restrict the range of rand to make it actually match)
  • hash aggregate with some distinct operators so expand is called
  • (Do we need/want rand in the key for an aggregate?)
  • window operations

Then I think we will have covered most of the major cases when this could be called.


# See https://github.com/apache/spark/commit/9c0b803ba124a6e70762aec1e5559b0d66529f4d
@ignore_order(local=True)
@pytest.mark.skipif(is_before_spark_351(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a follow on issue for us to handle this properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally we don't need to do anything. It is a pure Spark bug.
Spark 3.5.0 will complain the exception as below when evaluating a rand() in Generate with the codegen disabled and fail this test.

E                   Caused by: java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
E                   	at scala.Predef$.require(Predef.scala:281)
E                   	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497)
E                   	at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495)
E                   	at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35)
E                   	at org.apache.spark.sql.catalyst.expressions.CreateArray.$anonfun$eval$1(complexTypeCreator.scala:95)
E                   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
...
E                   	at org.apache.spark.sql.catalyst.expressions.CreateArray.eval(complexTypeCreator.scala:95)
E                   	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375)
E                   	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)

And Spark before 3.5.0 will throw the following exception and fail the test too.

E               pyspark.errors.exceptions.captured.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found:
E               explode(array(rand(42L))),col
E               in operator Generate explode(array(rand(42))), false, [col#2].;
E               Project [col#2]
E               +- Generate explode(array(rand(42))), false, [col#2]
E                  +- LogicalRDD [a#0], false

While GPU supports Generate with rand for all the versions. So to make this test work we have to ignore it for Spark before 3.5.1.

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 6, 2024

I would also love to see some rand tests around

  • join (we can restrict the range of rand to make it actually match)
  • hash aggregate with some distinct operators so expand is called
  • (Do we need/want rand in the key for an aggregate?)
  • window operations

Thx for review. Would it be ok to add them by a following PR ? And merging this PR will not close the issue #11649.
Here is some investigation on Window and Expand. #11649 (comment)

@firestarman
Copy link
Collaborator Author

@revans2 Could you help review this again? I'v addressed your comments. Thx in advance.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am still trying to get caught up on reviews. I am okay with finishing this in a follow on issue.

@firestarman firestarman merged commit 32aa3e1 into NVIDIA:branch-25.02 Dec 23, 2024
49 checks passed
@firestarman firestarman deleted the rand-retry branch December 23, 2024 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants