Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 4: read_table.c uses transform in ffi #614

Merged
merged 12 commits into from
Feb 21, 2025

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Dec 20, 2024

What changes are proposed in this pull request?

Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :)

In particular:

  • remove add_partition_columns in arrow.c, we don't need it anymore
  • expose ffi methods to get an expression evaluator and evaluate an expression from c
  • use the above to add an apply_transform function in arrow.c

How was this change tested?

  • existing tests

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 49.54128% with 55 lines in your changes missing coverage. Please review.

Project coverage is 84.34%. Comparing base (e2bcd0b) to head (bbb9465).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/scan.rs 0.00% 21 Missing ⚠️
ffi/src/engine_funcs.rs 73.23% 18 Missing and 1 partial ⚠️
ffi/src/expressions/kernel.rs 0.00% 13 Missing ⚠️
kernel/src/engine/arrow_expression.rs 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #614      +/-   ##
==========================================
- Coverage   84.54%   84.34%   -0.20%     
==========================================
  Files          75       75              
  Lines       17553    17657     +104     
  Branches    17553    17657     +104     
==========================================
+ Hits        14840    14893      +53     
- Misses       2005     2055      +50     
- Partials      708      709       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Dec 20, 2024
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the duckdb story with this new approach/PR? Have we explored that yet?

Asking because IIRC they push partition values down into their parquet reader, so they'll need to introspect the transforms and handle them differently than any kernel code we've written.

nicklan added a commit that referenced this pull request Jan 23, 2025
…and return it. (#607)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This is the initial part of moving to using expressions to express
transformations when reading data. What this PR does is:
- Compute a "static" transform, which is just a set of column
expressions that need to be passed directly through without change, or
enough metadata for lower levels to fill in a "fixup" expression
- The static transform is passed into the iterator that parses each
`Add` file
- When parsing the `Add` file, if there are needed fix-ups (just
partition columns today), the correct expression is created, and
inserted into a row indexed map
- This map is returned so the caller can find out for a given row what,
if any, expression needs to be applied when reading the specified row

Follow-up PRs:
* #612: Propagate this information through when using `visit_scan_files`
* #613: Actually use the data to do transformation and remove
`transform_to_logical` entirely
* #614: Make this work over ffi and use it
* (TODO): Clean up any existing code that's now over complicated in the
scan building

Each of those are more invasive and end up touching significant code, so
I'm staging this as much as possible to make reviews easier.

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

Unit tests, and inspection of resultant expressions when run on tables
ffi/src/scan.rs Outdated
@@ -398,5 +429,5 @@ pub unsafe extern "C" fn visit_scan_data(
callback,
};
// TODO: return ExternResult to caller instead of panicking?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder: get to this TODO!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OussamaSaoudi

This isn't actually quite as easy at it seems. To become an ExternResult you need an engine, and as the APIs are now you likely won't have an engine when calling this, because kernel_scan_data_next doesn't have one, so it can't pass it in when it calls this function.

We could take an engine arg in kernel_scan_data_next and thread it all the way through, but that's a bigger change. So I'm making an issue and punting for now :) #680

@nicklan nicklan force-pushed the part-4-read-table-uses-transform branch from b961220 to aaeda50 Compare February 6, 2025 01:11
@nicklan nicklan marked this pull request as ready for review February 6, 2025 01:32
@nicklan
Copy link
Collaborator Author

nicklan commented Feb 6, 2025

What is the duckdb story with this new approach/PR? Have we explored that yet?

We haven't. I'm keeping the old partition map in there partly so we don't break them. But they will need to modify their extension to visit the expression, and notice that it's just a scalar expression adding a constant column, and that they can ignore that and just push it into their parquet reader.

@@ -6,6 +6,14 @@ Schema:
├─ number: long
└─ a_float: double

letter: [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new way of doing it ends up putting it first, which is more correct based on the schema

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Deleted code is always a bonus. A few nits to consider before merge.

@@ -50,86 +51,10 @@ static GArrowRecordBatch* get_record_batch(FFI_ArrowArray* array, GArrowSchema*
return record_batch;
}

// Add columns to a record batch for each partition. In a "real" engine we would want to parse the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice side bonus that we get to delete so much code!

Comment on lines 144 to 145
if (!transformed) {
// TODO: What?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's only one error possible, should we just propagate it and do the print+free here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like a good idea, but then in apply_transform we can't do:

if (!context->arrow_context->cur_transform) {
  print_diag("  No transform needed");
  return data;
}

and we'd have to manually create the result type.

Doesn't feel great either way, but the flow at least will print the error and then exit. Since this is an "example" I feel like it's mostly okay.

Comment on lines +151 to +152
// TODO: Make this a data_type, and give a way for c code to go between schema <-> datatype
output_type: Handle<SharedSchema>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO could be tricky because schemas are opaque to engine, introspected only by visitor methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes, and we just haven't spec'd it all out. It would be a lot of "duplicate" code, as we'd basically need all the same visitors, but just not including the extra bits a schema has (name, nullability, metadata). So we'd probably want to somehow abstract that and share between datatype and schema visiting.

Future work I think :)

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome very excited for this :)

@@ -187,28 +107,59 @@ static GArrowBooleanArray* slice_to_arrow_bool_array(const KernelBoolSlice slice
return (GArrowBooleanArray*)ret;
}

static ExclusiveEngineData* apply_transform(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe some docs that this consumes the data and hands back a (potentially) new one

@nicklan nicklan merged commit baa3fc3 into delta-io:main Feb 21, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants