Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the canonicalizing for GPU file scan #10137

Merged
merged 2 commits into from
Jan 3, 2024

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Jan 2, 2024

close #10136

The original output should be used when creating a canonicalized GPU file scan.

Because the rule prunePartitionForFileSourceScan will remove partition columns that are not
used by the first downstream ProjectExec for some patterns, leading to some partition columns
not exist in the finalized output. Then the AttributeReferences in the partitionFilters but excluded
from the finalized output will not be canonicalized.

The original output should be used when creating a canonicalized GPU file scan, because
the rule `prunePartitionForFileSourceScan` in Plugin may remove the partition columns
that are not used by the first downstream ProjectExec for some patterns, leading to some
partition columns not exist in the finalized output.

Then the `AttributeReference`s in some filters but not in the output will not be canonicalized.

---------

Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman changed the title Fix filescan canonical Fix the canonicalizing for GPU file scan Jan 2, 2024
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks okay to me, but I don't understand how this fixes any bug, unless for some reason Spark planed the two getDF calls in the test differently, or we only filtered out the predicate output in one of the two GpuFileSourceScanExec instances.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe explained the difference to me and I understand it now.

@firestarman firestarman merged commit ace4870 into NVIDIA:branch-24.02 Jan 3, 2024
38 of 39 checks passed
@firestarman firestarman deleted the fix-filescan-canonical branch January 3, 2024 01:24
@firestarman
Copy link
Collaborator Author

Thx all.

@sameerz sameerz added the bug Something isn't working label Jan 4, 2024
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Jan 18, 2024
nvliyuan pushed a commit to firestarman/spark-rapids that referenced this pull request Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal can be different
4 participants