Part 4: read_table.c uses transform in ffi #614

nicklan · 2024-12-20T22:45:23Z

What changes are proposed in this pull request?

Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :)

In particular:

remove add_partition_columns in arrow.c, we don't need it anymore
expose ffi methods to get an expression evaluator and evaluate an expression from c
use the above to add an apply_transform function in arrow.c

How was this change tested?

existing tests

codecov · 2024-12-20T22:49:53Z

Codecov Report

Attention: Patch coverage is 49.54128% with 55 lines in your changes missing coverage. Please review.

Project coverage is 84.34%. Comparing base (e2bcd0b) to head (bbb9465).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ffi/src/scan.rs	0.00%	21 Missing ⚠️
ffi/src/engine_funcs.rs	73.23%	18 Missing and 1 partial ⚠️
ffi/src/expressions/kernel.rs	0.00%	13 Missing ⚠️
kernel/src/engine/arrow_expression.rs	33.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #614      +/-   ##
==========================================
- Coverage   84.54%   84.34%   -0.20%     
==========================================
  Files          75       75              
  Lines       17553    17657     +104     
  Branches    17553    17657     +104     
==========================================
+ Hits        14840    14893      +53     
- Misses       2005     2055      +50     
- Partials      708      709       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich

What is the duckdb story with this new approach/PR? Have we explored that yet?

Asking because IIRC they push partition values down into their parquet reader, so they'll need to introspect the transforms and handle them differently than any kernel code we've written.

ffi/examples/read-table/arrow.c

…and return it. (#607)  ## What changes are proposed in this pull request?  This is the initial part of moving to using expressions to express transformations when reading data. What this PR does is: - Compute a "static" transform, which is just a set of column expressions that need to be passed directly through without change, or enough metadata for lower levels to fill in a "fixup" expression - The static transform is passed into the iterator that parses each `Add` file - When parsing the `Add` file, if there are needed fix-ups (just partition columns today), the correct expression is created, and inserted into a row indexed map - This map is returned so the caller can find out for a given row what, if any, expression needs to be applied when reading the specified row Follow-up PRs: * #612: Propagate this information through when using `visit_scan_files` * #613: Actually use the data to do transformation and remove `transform_to_logical` entirely * #614: Make this work over ffi and use it * (TODO): Clean up any existing code that's now over complicated in the scan building Each of those are more invasive and end up touching significant code, so I'm staging this as much as possible to make reviews easier.  ## How was this change tested?  Unit tests, and inspection of resultant expressions when run on tables

nicklan · 2025-01-25T01:16:47Z

ffi/src/scan.rs

@@ -398,5 +429,5 @@ pub unsafe extern "C" fn visit_scan_data(
        callback,
    };
    // TODO: return ExternResult to caller instead of panicking?


reminder: get to this TODO!

@OussamaSaoudi

This isn't actually quite as easy at it seems. To become an ExternResult you need an engine, and as the APIs are now you likely won't have an engine when calling this, because kernel_scan_data_next doesn't have one, so it can't pass it in when it calls this function.

We could take an engine arg in kernel_scan_data_next and thread it all the way through, but that's a bigger change. So I'm making an issue and punting for now :) #680

nicklan · 2025-02-06T01:33:22Z

What is the duckdb story with this new approach/PR? Have we explored that yet?

We haven't. I'm keeping the old partition map in there partly so we don't break them. But they will need to modify their extension to visit the expression, and notice that it's just a scalar expression adding a constant column, and that they can ignore that and just push it into their parquet reader.

nicklan · 2025-02-06T01:35:08Z

ffi/tests/read-table-testing/expected-data/basic-partitioned.expected

@@ -6,6 +6,14 @@ Schema:
 ├─ number: long
 └─ a_float: double

+letter:  [


new way of doing it ends up putting it first, which is more correct based on the schema

scovich

Very nice. Deleted code is always a bonus. A few nits to consider before merge.

scovich · 2025-02-06T21:57:08Z

ffi/examples/read-table/arrow.c

@@ -50,86 +51,10 @@ static GArrowRecordBatch* get_record_batch(FFI_ArrowArray* array, GArrowSchema*
  return record_batch;
 }

-// Add columns to a record batch for each partition. In a "real" engine we would want to parse the


Nice side bonus that we get to delete so much code!

ffi/examples/read-table/arrow.c

scovich · 2025-02-06T21:59:36Z

ffi/examples/read-table/arrow.c

+  if (!transformed) {
+    // TODO: What?


Given that there's only one error possible, should we just propagate it and do the print+free here?

This seemed like a good idea, but then in apply_transform we can't do:

if (!context->arrow_context->cur_transform) { print_diag(" No transform needed"); return data; }

and we'd have to manually create the result type.

Doesn't feel great either way, but the flow at least will print the error and then exit. Since this is an "example" I feel like it's mostly okay.

scovich · 2025-02-06T22:01:22Z

ffi/src/engine_funcs.rs

+    // TODO: Make this a data_type, and give a way for c code to go between schema <-> datatype
+    output_type: Handle<SharedSchema>,


The TODO could be tricky because schemas are opaque to engine, introspected only by visitor methods?

Basically yes, and we just haven't spec'd it all out. It would be a lot of "duplicate" code, as we'd basically need all the same visitors, but just not including the extra bits a schema has (name, nullability, metadata). So we'd probably want to somehow abstract that and share between datatype and schema visiting.

Future work I think :)

ffi/src/engine_funcs.rs

ffi/src/scan.rs

zachschuermann

awesome very excited for this :)

zachschuermann · 2025-02-20T23:14:29Z

ffi/examples/read-table/arrow.c

@@ -187,28 +107,59 @@ static GArrowBooleanArray* slice_to_arrow_bool_array(const KernelBoolSlice slice
  return (GArrowBooleanArray*)ret;
 }

+static ExclusiveEngineData* apply_transform(


nit: maybe some docs that this consumes the data and hands back a (potentially) new one

github-actions bot assigned nicklan Dec 20, 2024

nicklan mentioned this pull request Dec 20, 2024

Part 1, Read transforms via expressions: Just compute the expression and return it. #607

Merged

github-actions bot added the breaking-change Change that will require a version bump label Dec 20, 2024

scovich reviewed Dec 30, 2024

View reviewed changes

ffi/examples/read-table/arrow.c Show resolved Hide resolved

nicklan commented Jan 25, 2025

View reviewed changes

nicklan added 3 commits February 5, 2025 16:39

use transform in read_table

90318eb

make things work again

985bc72

cleanups and comments

aaeda50

nicklan force-pushed the part-4-read-table-uses-transform branch from b961220 to aaeda50 Compare February 6, 2025 01:11

nicklan added 3 commits February 5, 2025 17:12

free even if failure

d99872f

free things, for miri

937bbaf

fmt

307a8d7

nicklan marked this pull request as ready for review February 6, 2025 01:32

nicklan requested review from scovich, zachschuermann and OussamaSaoudi February 6, 2025 01:33

nicklan commented Feb 6, 2025

View reviewed changes

scovich approved these changes Feb 6, 2025

View reviewed changes

nicklan added 3 commits February 13, 2025 12:53

Merge branch 'main' into part-4-read-table-uses-transform

6f79a37

Address comments

b5d8e10

Merge branch 'main' into part-4-read-table-uses-transform

88f7d33

zachschuermann approved these changes Feb 20, 2025

View reviewed changes

nicklan added 3 commits February 20, 2025 15:50

Merge branch 'main' into part-4-read-table-uses-transform

cd027fc

Merge branch 'main' into part-4-read-table-uses-transform

a4b7a90

add comment

bbb9465

nicklan merged commit baa3fc3 into delta-io:main Feb 21, 2025
19 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part 4: read_table.c uses transform in ffi #614

Part 4: read_table.c uses transform in ffi #614

nicklan commented Dec 20, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

scovich left a comment

nicklan Jan 25, 2025

nicklan Feb 6, 2025

nicklan commented Feb 6, 2025

nicklan Feb 6, 2025

scovich left a comment

scovich Feb 6, 2025

scovich Feb 6, 2025

nicklan Feb 13, 2025

scovich Feb 6, 2025

nicklan Feb 13, 2025

zachschuermann left a comment

zachschuermann Feb 20, 2025

		// TODO: Make this a data_type, and give a way for c code to go between schema <-> datatype
		output_type: Handle<SharedSchema>,

Part 4: read_table.c uses transform in ffi #614

Part 4: read_table.c uses transform in ffi #614

Conversation

nicklan commented Dec 20, 2024 • edited Loading

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented Feb 6, 2025

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented Dec 20, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading