-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Distribution error failing in the SanityCheck, for a specific influxql plan. #58
base: patched-df-dec25-ver44.0.0
Are you sure you want to change the base?
Conversation
af2856a
to
27f2a3a
Compare
* demonstrate the insertion of coalesce after the use of column estimates, and the removal of the test scenario's forcing of rr repartitioning
…the coalesce added in the EnforceDistribution
27f2a3a
to
202860b
Compare
* fix * update cli
@@ -516,7 +516,7 @@ fn remove_bottleneck_in_subplan( | |||
) -> Result<PlanWithCorrespondingCoalescePartitions> { | |||
let plan = &requirements.plan; | |||
let children = &mut requirements.children; | |||
if is_coalesce_partitions(&children[0].plan) { | |||
if is_coalesce_partitions(&children[0].plan) && !is_aggregation(plan) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix, for now.
A proper fix (if we decide to fix at this conditional) should be comparing the partitioning needs of the parent (of coalesce) vs children of coalesce. My initial attempt to do so caused other DF tests to fail.
I didn't proceed further since I'm unsure of the correct solution at a higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be checking that it is an aggregation and that that the aggregation mode is SinglePartitioned
🤔
We have a bug where the SanityCheck fails due to mismatched distribution needs between a union and an aggregate. After chasing it down (and recreating in this PR), it appears that a change done in an earlier EnforceDistribution optimizer run -- is undone in the EnforceSorting optimizer run. Here is the plan before the EnforceDistribution:
Here is the plan after the EnforceDistribution. Note the insertion of the coalesce between the union and aggregate.
The partial and final aggregates are later combined in the CombinePartialFinalAggregate. The plan then passed to EnforceSorting is:
EnforceSorting removes the needed coalesce, and replaces with an SPM. But the SPM is inserted further up the plan (above the aggregate node):
This then fails the SanityCheck plan. Possible solutionsThis PR is about demonstrating how the error occurred, (based upon changes is the 2 optimizer runs), and adds a temporary fix to unblock iox. As for the proper fix, I'm unclear on what to do. Ideas include, but are not limited to:
There could also be other solutions. I'll make an upstream PR and ping for advice. |
// Use a small batch size, to trigger RoundRobin in tests | ||
config.execution.batch_size = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the removal of the forced insertion of round robin repartitioning. This was preventing the coalesce insertion (as seen in our real world reproducer).
pub(crate) fn parquet_exec_with_stats() -> Arc<ParquetExec> { | ||
let mut statistics = Statistics::new_unknown(&schema()); | ||
statistics.num_rows = Precision::Inexact(10); | ||
statistics.column_statistics = column_stats(); | ||
|
||
let config = | ||
FileScanConfig::new(ObjectStoreUrl::parse("test:///").unwrap(), schema()) | ||
.with_file(PartitionedFile::new("x".to_string(), 10000)) | ||
.with_statistics(statistics); | ||
assert_eq!(config.statistics.num_rows, Precision::Inexact(10)); | ||
|
||
ParquetExec::builder(config).build_arc() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we are now providing statistics in the parquet exec.
This was preventing the coalesce insertion (as seen in our real world reproducer). In the absence of parquet stats, we always has a rr repartition because the roundrobin_beneficial_stats was always evaluating to true.
.optimize(optimized, &Default::default()) | ||
.unwrap_err(); | ||
assert!(err.message().contains(" does not satisfy distribution requirements: HashPartitioned[[a@0]]). Child-0 output partitioning: UnknownPartitioning(2)")); | ||
let checker = checker.optimize(optimized, &Default::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of running the SanityChecker as part of the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one suggestion about making the check even more specific, but otherwise I think this is great
Thanks @wiedld
/// Same as [`repartitions_for_aggregate_after_sorted_union`], but adds a projection | ||
/// as well between the union and aggregate. This change the outcome: | ||
/// | ||
/// * we no longer get repartitioning, and instead get coalescing. | ||
#[test] | ||
fn coalesces_for_aggregate_after_sorted_union_projection() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we are getting coalesce inserted in the EnforceDistribution run.
Temporary fix for https://github.com/influxdata/influxdb_iox/issues/13310.
This PR branches off our current patched DF branch. It adds a few commits to handle the above issue.
Confirmed that it does fix this bug reproducer in iox.
Changes made
Commit 1 = recreate the insertion of the coalesce in the
EnforceDistribution
optimization pass.Commit 2 = reproducer of SanityCheck failure after
EnforceSorting
removes the added coalesceCommit 3 = a special cased fix that uses the checking of the AggregateExec. (a more general fix will occur upstream)
Commit 4 = unrelated. It's fixing a wasm build CI error by cherry-picking over the upstream fix.