-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to DataFusion 43, fix a bug, add more tests #53
Conversation
@@ -99,10 +99,14 @@ impl QueryStage { | |||
/// Get the input partition count. This is the same as the number of concurrent tasks | |||
/// when we schedule this query stage for execution | |||
pub fn get_input_partition_count(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why in the case of a leaf node, the input partition is the same as the output partition of the plan, while in case of a plan with children it is the output partitioning of the first child? This is assuming that all children have the same partition count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in context.py
we have this logic:
# if the query stage has a single output partition then we need to execute for the output
# partition, otherwise we need to execute in parallel for each input partition
concurrency = stage.get_input_partition_count()
output_partitions_count = stage.get_output_partition_count()
This is based on the assumption that the query stage is a shuffle write, which perhaps was always true when running TPC-H, so the existing code worked.
With the new simple SELECT * FROM table
test that you added, we had a query stage where the plan was a CsvExec
and had no children so we had to handle this as a special case. There is no input partitioning in this case. We use the output partitioning because DataFusion will have already decided that based on the files that are available.
This code is all confusing and I would like to make it less so.
This is based on #50 and adds an extra test and fixes a bug