Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support the v2Checkpoint reader/writer feature #685

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Feb 7, 2025

What changes are proposed in this pull request?

Summary

This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be.

This PR specifically adds support for Item 1.

This PR enables support for the v2Checkpoints reader/writer table feature for delta kernel rust by

  1. Allowing snapshots to now leverage UUID-named checkpoints as part of their log segment.
  2. Adding the v2Checkpoints feature to the list of supported reader features.

resolves #688

Changes

We already have the capability to recognize UUID-named checkpoint files with the variant LogPathFileType::UuidCheckpoint(uuid). This PR does the folllowing:

  • Adds LogPathFileType::UuidCheckpoint(_) to the list of valid checkpoint file types that are collected during log listing
    • This addition allows V2 checkpoints to be included in log segments.
  • Adds ReaderFeatures::V2Checkpoint to the list of supported reader features
    • This addition allows protocol & metadata validation to pass for tables with the v2Checkpoints reader feature
  • Adds the UnsupportedFeature reader/writer feature for testing purposes.

How was this change tested?

Test coverage for the changes required to support building log segments with V2 checkpoints:

  • test_uuid_checkpoint_patterns (already exists, small update)
    • Verifies the behavior of parsing log file paths that follow the UUID-naming scheme
  • test_v2_checkpoint_supported
    • Tests the ensure_read_supported() func appropriately validates protocol with ReaderFeatures::V2Checkpoint
  • build_snapshot_with_uuid_checkpoint_json
  • build_snapshot_with_uuid_checkpoint_parquet (already exists)
  • build_snapshot_with_correct_last_uuid_checkpoint

Golden table tests:

  • v2-checkpoint-json
  • v2-checkpoint-parquet

Potential todos:

  • is it worth introducing a preference for V2 checkpoints vs V1 checkpoints if both are present in the log for a version
  • what about a preference for checkpoints referenced by _last_checkpoint?

Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 89.27445% with 68 lines in your changes missing coverage. Please review.

Project coverage is 84.40%. Comparing base (72b585d) to head (3dcd085).

Files with missing lines Patch % Lines
kernel/src/log_segment/tests.rs 88.50% 5 Missing and 51 partials ⚠️
kernel/src/log_segment.rs 88.23% 0 Missing and 10 partials ⚠️
kernel/src/path.rs 66.66% 0 Missing and 1 partial ⚠️
kernel/src/scan/mod.rs 96.77% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #685      +/-   ##
==========================================
+ Coverage   84.19%   84.40%   +0.20%     
==========================================
  Files          77       77              
  Lines       17960    18557     +597     
  Branches    17960    18557     +597     
==========================================
+ Hits        15122    15663     +541     
+ Misses       2121     2117       -4     
- Partials      717      777      +60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 7, 2025
@sebastiantia sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from 25a6131 to e405e69 Compare February 8, 2025 04:13
@sebastiantia sebastiantia changed the title Enable snapshot building with v2 checkpoints feat: support the v2Checkpoint reader/writer feature Feb 8, 2025
@sebastiantia sebastiantia removed the breaking-change Change that will require a version bump label Feb 8, 2025
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 8, 2025
@sebastiantia sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch 4 times, most recently from 19cfe17 to fc15f7d Compare February 10, 2025 19:37
@sebastiantia
Copy link
Collaborator Author

sebastiantia commented Feb 10, 2025

This PR allows kernel to read tables with the v2Checkpoint reader/writer feature as long as the v2Checkpoint feature is included in the table's protocol's reader features list. There is no check for the feature in the writer features list. Can I get a sanity check on this @scovich?

@sebastiantia sebastiantia marked this pull request as ready for review February 10, 2025 19:48
@sebastiantia sebastiantia requested review from zachschuermann, scovich, nicklan and OussamaSaoudi and removed request for scovich February 10, 2025 20:00
@sebastiantia sebastiantia removed the breaking-change Change that will require a version bump label Feb 10, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the previous PR did all the heavy lifting, and this PR just does the final wiring up to accept uuid-named checkpoint files -- correct?

@sebastiantia
Copy link
Collaborator Author

Looks like the previous PR did all the heavy lifting, and this PR just does the final wiring up to accept uuid-named checkpoint files -- correct?

Yup 👍 this was an attempt to reduce the # of lines of the previous PR

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! - just some minor nits to maybe look at.

/// sidecar files contain the actual add/remove actions that would otherwise be
/// stored directly in the checkpoint. The sidecar file batches are chained to the
/// checkpoint batch in the top level iterator to be returned.
fn create_checkpoint_stream(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think the *_stream naming is an artifact from the early days when this was infract a stream since then we moved to iterators. Maybe it makes sense to update function names accordingly?

})
.flatten_ok()
// Map converts Result<Result<Box<dyn EngineData>, _>,_> to Result<Box<dyn EngineData>, _>
.map(|result| result?))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this also be .flatten()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think so. Result's IntoIter impl drops the error case:

An iterator over the value in a Ok variant of a Result.
The iterator yields one value if the result is Ok, otherwise none.

So treating result as iter would lose the error information.

Copy link
Collaborator

@scovich scovich Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we've been almost-burned by this sharp edge before... I honestly question whether Result should even have an impl IntoIterator. If somebody wants to iterate over the Ok values while ignoring Err, they can use ok() and rely on impl IntoIterator for Option which has unsurprising semantics.

@sebastiantia sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from fc15f7d to 32a2b44 Compare February 11, 2025 21:40
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 11, 2025
Comment on lines 51 to 52
/// A dummy variant used to represent an unsupported feature for testing purposes
UnsupportedFeature,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only used for testing, this should be #[cfg(test)] I think.

On the other hand, I wonder if this is a useful concept to have for when we parse Protocol and encounter an unknown feature. Consider this:
readerFeatures: [V2Checkpoint, newerDeltaFeature1, newerDeltaFeature2]
What we would currently do is fail on parsing newerDeltaFeature1, and we don't tell the user that we also can't read newerDeltaFeature2.

Now consider an alternative where all unrecognized reader features are UnrecognizedReaderFeature(String). We can first parse all reader features, then communicate to the user all the unrecognized features we found
ERROR: Found unsupported reader features: newerDeltaFeature1, newerDeltaFeature2

This idea is similar to parsers and compilers. They want to tell you all the things that went wrong, instead of failing on the first compilation error.

This is a bit beyond the scope of this PR, but I'd like your thoughts @zachschuermann, @nicklan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch with the #[cfg(test)]!

I'm also a fan of the idea of collecting and reporting all unrecognized features instead of failing on the first one +1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, collecting the unrecognized features makes sense

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, here's my proposal:

  1. Make this UrecognizedReaderFeature(String), update your test as well
  2. Make it cfgtest
  3. Make a followup PR to apply it in error reporting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here: #703

@sebastiantia sebastiantia added merge hold Don't allow the PR to merge and removed breaking-change Change that will require a version bump labels Feb 12, 2025
@scovich
Copy link
Collaborator

scovich commented Feb 19, 2025

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

@scovich
Copy link
Collaborator

scovich commented Feb 19, 2025

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

Ah, it's because the bottom PR in the stack is a breaking change.

@sebastiantia
Copy link
Collaborator Author

sebastiantia commented Feb 19, 2025

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

Ah, it's because the bottom PR in the stack is a breaking change.

I've been having issues with a couple of my PRs being labeled as breaking whilst also passing the semver checks. Including the bottom PR.

@sebastiantia sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from 433e8f5 to 3dcd085 Compare February 19, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump merge hold Don't allow the PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support the v2Checkpoint reader/writer feature
4 participants