Upgrade to DataFusion 43, fix a bug, add more tests #53

andygrove · 2024-12-14T15:58:51Z

This is based on #50 and adds an extra test and fixes a bug

edmondop · 2024-12-14T17:58:54Z

src/query_stage.rs

@@ -99,10 +99,14 @@ impl QueryStage {
    /// Get the input partition count. This is the same as the number of concurrent tasks
    /// when we schedule this query stage for execution
    pub fn get_input_partition_count(&self) -> usize {


Can you explain why in the case of a leaf node, the input partition is the same as the output partition of the plan, while in case of a plan with children it is the output partitioning of the first child? This is assuming that all children have the same partition count?

in context.py we have this logic:

# if the query stage has a single output partition then we need to execute for the output # partition, otherwise we need to execute in parallel for each input partition concurrency = stage.get_input_partition_count() output_partitions_count = stage.get_output_partition_count()

This is based on the assumption that the query stage is a shuffle write, which perhaps was always true when running TPC-H, so the existing code worked.

With the new simple SELECT * FROM table test that you added, we had a query stage where the plan was a CsvExec and had no children so we had to handle this as a special case. There is no input partitioning in this case. We use the output partitioning because DataFusion will have already decided that based on the files that are available.

This code is all confusing and I would like to make it less so.

edmondop and others added 15 commits November 29, 2024 19:52

Implementing Unit testing for Python

1b847e0

Installing all deps in CI

b4aab9a

Adding maturin develop

b3dddd7

Restoring correct input partitioning

b298923

Generated new plans

f07c38d

Restored test plans for ignored tests

4e60563

tests

ebe403a

fix

ff277de

fix

9809013

update expected plans

d2321b2

update expected plans

ef72e11

revert some changes

e371f07

remove comment

6a8976e

updated plans

546a4c0

upgrade to DF 43

64e46c1

andygrove changed the title ~~wip: bug fix & tests~~ Upgrade to DataFusion 43, fix a bug, add more tests Dec 14, 2024

andygrove marked this pull request as ready for review December 14, 2024 16:53

update deps, more tests

eb7000b

edmondop reviewed Dec 14, 2024

View reviewed changes

bug fix

4b3ccf3

andygrove merged commit 151a0e2 into apache:main Dec 14, 2024
2 checks passed

andygrove deleted the tests branch December 14, 2024 18:29

andygrove mentioned this pull request Dec 14, 2024

Single-node Python unit tests fail #52

Closed

edmondop mentioned this pull request Dec 14, 2024

Implementing Unit testing for Python #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to DataFusion 43, fix a bug, add more tests #53

Upgrade to DataFusion 43, fix a bug, add more tests #53

andygrove commented Dec 14, 2024 •

edited

Loading

edmondop Dec 14, 2024

andygrove Dec 14, 2024

Upgrade to DataFusion 43, fix a bug, add more tests #53

Upgrade to DataFusion 43, fix a bug, add more tests #53

Conversation

andygrove commented Dec 14, 2024 • edited Loading

edmondop Dec 14, 2024

Choose a reason for hiding this comment

andygrove Dec 14, 2024

Choose a reason for hiding this comment

andygrove commented Dec 14, 2024 •

edited

Loading