Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 3 of expression based transform: Use computed transform #613

Merged
merged 8 commits into from
Feb 6, 2025

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Dec 20, 2024

What changes are proposed in this pull request?

Use the transform that has been computed rather than using transform_to_logical.

  1. Remove column_mapping_mode from GlobalScanState (it's not needed there anymore)
  2. Remove the old transform_to_logical code
  3. Add a new scan::state::transform_to_logical function that encapsulates the boilerplate of applying the transform expression
  4. Use the new function where needed.

How was this change tested?

Existing tests pass which test this functionality extensively.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.08%. Comparing base (2240154) to head (d8a2355).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #613   +/-   ##
=======================================
  Coverage   84.08%   84.08%           
=======================================
  Files          77       77           
  Lines       17823    17777   -46     
  Branches    17823    17777   -46     
=======================================
- Hits        14986    14948   -38     
+ Misses       2120     2115    -5     
+ Partials      717      714    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ffi/src/scan.rs Outdated
@@ -429,5 +427,12 @@ pub unsafe extern "C" fn visit_scan_data(
callback,
};
// TODO: return ExternResult to caller instead of panicking?
visit_scan_files(data, selection_vec, &transforms.transforms, context_wrapper, rust_callback).unwrap();
visit_scan_files(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this formatting that should have been applied in an earlier PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was applied earlier, it's included in part-2. not quite sure why it's showing up here but it'll go away once earlier PRs merge

&all_fields,
have_partition_cols,
);
let logical = if let Some(ref transform) = scan_file.transform {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Is there some way to factor out some of the duplicated logic between this default engine, and the sync engine example above? (I'm guessing this new code just adds to existing duplication, so best addressed in a separate PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually moved it to a function: state::transform_to_logical

But I suspect that's not the code you're referencing here. Which duplicate code did you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've forgotten the details now, and I can't see the original code any more (loading the file link for this comment brings up a file without the comment). Maybe it was related to partition value parsing (a quick search turns up several uses of that function, generally falling into two categories). Otherwise... sorry no idea.

have_partition_cols: bool,
) -> DeltaResult<Box<dyn EngineData>> {
let physical_schema = global_state.physical_schema.clone();
if !have_partition_cols && global_state.column_mapping_mode == ColumnMappingMode::None {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this code was annoying when I added column mapping support for expression eval. I think this was the only code left that specifically tracked or cared about column mapping mode, because of the new way logical -> physical transforms are performed.

Recommend to audit the caller chain and see what other code can be simplified, now that we don't need column mapping logic here any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! I've removed it from GlobalScanState as we don't need it there anymore

)
.unwrap();
// to transform the physical data into the correct logical form
let logical = if let Some(ref transform) = scan_file.transform {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is (at least) three places that do ~exactly the same thing. Is there a way to factor out a helper method that everyone can use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, factored out to state::transform_to_logical

@nicklan nicklan force-pushed the part-3-use-computed-transform branch from 6d72b75 to ec0d671 Compare January 10, 2025 23:55
nicklan added a commit that referenced this pull request Jan 23, 2025
…and return it. (#607)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This is the initial part of moving to using expressions to express
transformations when reading data. What this PR does is:
- Compute a "static" transform, which is just a set of column
expressions that need to be passed directly through without change, or
enough metadata for lower levels to fill in a "fixup" expression
- The static transform is passed into the iterator that parses each
`Add` file
- When parsing the `Add` file, if there are needed fix-ups (just
partition columns today), the correct expression is created, and
inserted into a row indexed map
- This map is returned so the caller can find out for a given row what,
if any, expression needs to be applied when reading the specified row

Follow-up PRs:
* #612: Propagate this information through when using `visit_scan_files`
* #613: Actually use the data to do transformation and remove
`transform_to_logical` entirely
* #614: Make this work over ffi and use it
* (TODO): Clean up any existing code that's now over complicated in the
scan building

Each of those are more invasive and end up touching significant code, so
I'm staging this as much as possible to make reviews easier.

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

Unit tests, and inspection of resultant expressions when run on tables
@nicklan nicklan force-pushed the part-3-use-computed-transform branch from ec0d671 to 24a45c9 Compare January 23, 2025 00:53
@nicklan nicklan force-pushed the part-3-use-computed-transform branch from 24a45c9 to 2d598d8 Compare February 4, 2025 20:01
@nicklan nicklan marked this pull request as ready for review February 4, 2025 20:17
@@ -12,8 +12,7 @@ use delta_kernel::engine::arrow_data::ArrowEngineData;
use delta_kernel::engine::default::executor::tokio::TokioBackgroundExecutor;
use delta_kernel::engine::default::DefaultEngine;
use delta_kernel::engine::sync::SyncEngine;
use delta_kernel::scan::state::{DvInfo, GlobalScanState, Stats};
use delta_kernel::scan::transform_to_logical;
use delta_kernel::scan::state::{transform_to_logical, DvInfo, GlobalScanState, Stats};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the old transform_to_logical code

??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- the partition values aware version of transform_to_logical_internal is what disappeared.

Comment on lines 111 to 122
if let Some(ref transform) = transform {
engine
.get_expression_handler()
.get_evaluator(
physical_schema.clone(),
transform.as_ref().clone(), // TODO: Maybe eval should take a ref
logical_schema.clone().into(),
)
.evaluate(physical_data.as_ref())
} else {
Ok(physical_data)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: possibly cleaner with a match?

Suggested change
if let Some(ref transform) = transform {
engine
.get_expression_handler()
.get_evaluator(
physical_schema.clone(),
transform.as_ref().clone(), // TODO: Maybe eval should take a ref
logical_schema.clone().into(),
)
.evaluate(physical_data.as_ref())
} else {
Ok(physical_data)
}
match transform {
Some(ref transform) => engine
.get_expression_handler()
.get_evaluator(
physical_schema.clone(),
transform.as_ref().clone(), // TODO: Maybe eval should take a ref
logical_schema.clone().into(),
)
.evaluate(physical_data.as_ref()),
None => Ok(physical_data),
}

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple nits, LGTM (and I think these small PRs in parts/sequences are great!)

_transform: Option<ExpressionRef>,
partition_values: HashMap<String, String>,
transform: Option<ExpressionRef>,
_: HashMap<String, String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be able to remove this or does this still prove useful elsewhere? (I know there are a number of spots where we use this callback and pass partition values - curious if we will eventually migrate everything to transform or if this is still actually needed)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. My idea was to keep it for now so existing systems could migrate more slowly. Basically, deprecate it

Ok(physical_data)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc nit: Line 142/143 below might benefit from listing the new transform field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, nice catch, added

@nicklan nicklan merged commit 68f4790 into delta-io:main Feb 6, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants