Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to DataFusion 43, fix a bug, add more tests #53

Merged
merged 17 commits into from
Dec 14, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Dec 14, 2024

This is based on #50 and adds an extra test and fixes a bug

@andygrove andygrove changed the title wip: bug fix & tests Upgrade to DataFusion 43, fix a bug, add more tests Dec 14, 2024
@andygrove andygrove marked this pull request as ready for review December 14, 2024 16:53
@@ -99,10 +99,14 @@ impl QueryStage {
/// Get the input partition count. This is the same as the number of concurrent tasks
/// when we schedule this query stage for execution
pub fn get_input_partition_count(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why in the case of a leaf node, the input partition is the same as the output partition of the plan, while in case of a plan with children it is the output partitioning of the first child? This is assuming that all children have the same partition count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in context.py we have this logic:

    # if the query stage has a single output partition then we need to execute for the output
    # partition, otherwise we need to execute in parallel for each input partition
    concurrency = stage.get_input_partition_count()
    output_partitions_count = stage.get_output_partition_count()

This is based on the assumption that the query stage is a shuffle write, which perhaps was always true when running TPC-H, so the existing code worked.

With the new simple SELECT * FROM table test that you added, we had a query stage where the plan was a CsvExec and had no children so we had to handle this as a special case. There is no input partitioning in this case. We use the output partitioning because DataFusion will have already decided that based on the files that are available.

This code is all confusing and I would like to make it less so.

@andygrove andygrove merged commit 151a0e2 into apache:main Dec 14, 2024
2 checks passed
@andygrove andygrove deleted the tests branch December 14, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants