-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support the v2Checkpoint
reader/writer feature
#685
base: main
Are you sure you want to change the base?
feat: support the v2Checkpoint
reader/writer feature
#685
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #685 +/- ##
==========================================
+ Coverage 84.19% 84.40% +0.20%
==========================================
Files 77 77
Lines 17960 18557 +597
Branches 17960 18557 +597
==========================================
+ Hits 15122 15663 +541
+ Misses 2121 2117 -4
- Partials 717 777 +60 ☔ View full report in Codecov by Sentry. |
25a6131
to
e405e69
Compare
v2Checkpoint
reader/writer feature
19cfe17
to
fc15f7d
Compare
This PR allows kernel to read tables with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the previous PR did all the heavy lifting, and this PR just does the final wiring up to accept uuid-named checkpoint files -- correct?
Yup 👍 this was an attempt to reduce the # of lines of the previous PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! - just some minor nits to maybe look at.
/// sidecar files contain the actual add/remove actions that would otherwise be | ||
/// stored directly in the checkpoint. The sidecar file batches are chained to the | ||
/// checkpoint batch in the top level iterator to be returned. | ||
fn create_checkpoint_stream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think the *_stream
naming is an artifact from the early days when this was infract a stream since then we moved to iterators. Maybe it makes sense to update function names accordingly?
kernel/src/log_segment.rs
Outdated
}) | ||
.flatten_ok() | ||
// Map converts Result<Result<Box<dyn EngineData>, _>,_> to Result<Box<dyn EngineData>, _> | ||
.map(|result| result?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this also be .flatten()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think so. Result
's IntoIter
impl drops the error case:
An iterator over the value in a Ok variant of a Result.
The iterator yields one value if the result is Ok, otherwise none.
So treating result as iter would lose the error information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we've been almost-burned by this sharp edge before... I honestly question whether Result
should even have an impl IntoIterator
. If somebody wants to iterate over the Ok
values while ignoring Err
, they can use ok()
and rely on impl IntoIterator for Option
which has unsurprising semantics.
fc15f7d
to
32a2b44
Compare
kernel/src/table_features/mod.rs
Outdated
/// A dummy variant used to represent an unsupported feature for testing purposes | ||
UnsupportedFeature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only used for testing, this should be #[cfg(test)]
I think.
On the other hand, I wonder if this is a useful concept to have for when we parse Protocol and encounter an unknown feature. Consider this:
readerFeatures: [V2Checkpoint, newerDeltaFeature1, newerDeltaFeature2]
What we would currently do is fail on parsing newerDeltaFeature1
, and we don't tell the user that we also can't read newerDeltaFeature2
.
Now consider an alternative where all unrecognized reader features are UnrecognizedReaderFeature(String)
. We can first parse all reader features, then communicate to the user all the unrecognized features we found
ERROR: Found unsupported reader features: newerDeltaFeature1, newerDeltaFeature2
This idea is similar to parsers and compilers. They want to tell you all the things that went wrong, instead of failing on the first compilation error.
This is a bit beyond the scope of this PR, but I'd like your thoughts @zachschuermann, @nicklan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch with the #[cfg(test)]
!
I'm also a fan of the idea of collecting and reporting all unrecognized features instead of failing on the first one +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, collecting the unrecognized features makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, here's my proposal:
- Make this
UrecognizedReaderFeature(String)
, update your test as well - Make it cfgtest
- Make a followup PR to apply it in error reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked here: #703
aside: We keep removing the |
Ah, it's because the bottom PR in the stack is a breaking change. |
I've been having issues with a couple of my PRs being labeled as breaking whilst also passing the semver checks. Including the bottom PR. |
433e8f5
to
3dcd085
Compare
What changes are proposed in this pull request?
Summary
This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be.
This PR specifically adds support for Item 1.
This PR enables support for the
v2Checkpoints
reader/writer table feature for delta kernel rust byv2Checkpoints
feature to the list of supported reader features.This PR is stacked on Item 2 here. Golden table tests are included in this PR.
More integration tests will be introduced in a follow-up PR tracked here: Port over V2 checkpoints delta-spark tests and tables #671
This PR stacks changes on top of feat: extract & insert sidecar batches in
replay
's action iterator #679. For the correct file diff view, please only review these commitsresolves #688
Changes
We already have the capability to recognize UUID-named checkpoint files with the variant
LogPathFileType::UuidCheckpoint(uuid)
. This PR does the folllowing:LogPathFileType::UuidCheckpoint(_)
to the list of valid checkpoint file types that are collected during log listingReaderFeatures::V2Checkpoint
to the list of supported reader featuresv2Checkpoints
reader featureUnsupportedFeature
reader/writer feature for testing purposes.How was this change tested?
Test coverage for the changes required to support building log segments with V2 checkpoints:
test_uuid_checkpoint_patterns
(already exists, small update)test_v2_checkpoint_supported
ensure_read_supported()
func appropriately validates protocol withReaderFeatures::V2Checkpoint
build_snapshot_with_uuid_checkpoint_json
build_snapshot_with_uuid_checkpoint_parquet
(already exists)build_snapshot_with_correct_last_uuid_checkpoint
Golden table tests:
v2-checkpoint-json
v2-checkpoint-parquet
Potential todos: