feat: support the `v2Checkpoint` reader/writer feature #685

sebastiantia · 2025-02-07T20:33:14Z

What changes are proposed in this pull request?

Summary

This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be.

This PR specifically adds support for Item 1.

This PR enables support for the v2Checkpoints reader/writer table feature for delta kernel rust by

Allowing snapshots to now leverage UUID-named checkpoints as part of their log segment.
Adding the v2Checkpoints feature to the list of supported reader features.

This PR is stacked on Item 2 here. Golden table tests are included in this PR.
More integration tests will be introduced in a follow-up PR tracked here: Port over V2 checkpoints delta-spark tests and tables #671
This PR stacks changes on top of feat: extract & insert sidecar batches in replay's action iterator #679. For the correct file diff view, please only review these commits

resolves #688

Changes

We already have the capability to recognize UUID-named checkpoint files with the variant LogPathFileType::UuidCheckpoint(uuid). This PR does the folllowing:

Adds LogPathFileType::UuidCheckpoint(_) to the list of valid checkpoint file types that are collected during log listing
- This addition allows V2 checkpoints to be included in log segments.
Adds ReaderFeatures::V2Checkpoint to the list of supported reader features
- This addition allows protocol & metadata validation to pass for tables with the v2Checkpoints reader feature
Adds the UnsupportedFeature reader/writer feature for testing purposes.

How was this change tested?

Test coverage for the changes required to support building log segments with V2 checkpoints:

test_uuid_checkpoint_patterns (already exists, small update)
- Verifies the behavior of parsing log file paths that follow the UUID-naming scheme
test_v2_checkpoint_supported
- Tests the ensure_read_supported() func appropriately validates protocol with ReaderFeatures::V2Checkpoint
build_snapshot_with_uuid_checkpoint_json
build_snapshot_with_uuid_checkpoint_parquet (already exists)
build_snapshot_with_correct_last_uuid_checkpoint

Golden table tests:

v2-checkpoint-json
v2-checkpoint-parquet

Potential todos:

is it worth introducing a preference for V2 checkpoints vs V1 checkpoints if both are present in the log for a version
what about a preference for checkpoints referenced by _last_checkpoint?

codecov · 2025-02-07T20:36:54Z

Codecov Report

Attention: Patch coverage is 89.27445% with 68 lines in your changes missing coverage. Please review.

Project coverage is 84.40%. Comparing base (72b585d) to head (3dcd085).

Files with missing lines	Patch %	Lines
kernel/src/log_segment/tests.rs	88.50%	5 Missing and 51 partials ⚠️
kernel/src/log_segment.rs	88.23%	0 Missing and 10 partials ⚠️
kernel/src/path.rs	66.66%	0 Missing and 1 partial ⚠️
kernel/src/scan/mod.rs	96.77%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #685      +/-   ##
==========================================
+ Coverage   84.19%   84.40%   +0.20%     
==========================================
  Files          77       77              
  Lines       17960    18557     +597     
  Branches    17960    18557     +597     
==========================================
+ Hits        15122    15663     +541     
+ Misses       2121     2117       -4     
- Partials      717      777      +60

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sebastiantia · 2025-02-10T19:41:17Z

This PR allows kernel to read tables with the v2Checkpoint reader/writer feature as long as the v2Checkpoint feature is included in the table's protocol's reader features list. There is no check for the feature in the writer features list. Can I get a sanity check on this @scovich?

scovich

Looks like the previous PR did all the heavy lifting, and this PR just does the final wiring up to accept uuid-named checkpoint files -- correct?

sebastiantia · 2025-02-10T21:02:25Z

Looks like the previous PR did all the heavy lifting, and this PR just does the final wiring up to accept uuid-named checkpoint files -- correct?

Yup 👍 this was an attempt to reduce the # of lines of the previous PR

roeap

LGTM! - just some minor nits to maybe look at.

roeap · 2025-02-11T17:54:44Z

kernel/src/log_segment.rs

+    /// sidecar files contain the actual add/remove actions that would otherwise be
+    /// stored directly in the checkpoint. The sidecar file batches are chained to the
+    /// checkpoint batch in the top level iterator to be returned.
+    fn create_checkpoint_stream(


nit: i think the *_stream naming is an artifact from the early days when this was infract a stream since then we moved to iterators. Maybe it makes sense to update function names accordingly?

roeap · 2025-02-11T17:59:11Z

kernel/src/log_segment.rs

+            })
+            .flatten_ok()
+            // Map converts Result<Result<Box<dyn EngineData>, _>,_> to Result<Box<dyn EngineData>, _>
+            .map(|result| result?))


could this also be .flatten()?

Unfortunately I don't think so. Result's IntoIter impl drops the error case:

An iterator over the value in a Ok variant of a Result.
The iterator yields one value if the result is Ok, otherwise none.

So treating result as iter would lose the error information.

Yeah, we've been almost-burned by this sharp edge before... I honestly question whether Result should even have an impl IntoIterator. If somebody wants to iterate over the Ok values while ignoring Err, they can use ok() and rely on impl IntoIterator for Option which has unsurprising semantics.

OussamaSaoudi · 2025-02-12T08:13:53Z

kernel/src/table_features/mod.rs

+    /// A dummy variant used to represent an unsupported feature for testing purposes
+    UnsupportedFeature,


If this is only used for testing, this should be #[cfg(test)] I think.

On the other hand, I wonder if this is a useful concept to have for when we parse Protocol and encounter an unknown feature. Consider this:
readerFeatures: [V2Checkpoint, newerDeltaFeature1, newerDeltaFeature2]
What we would currently do is fail on parsing newerDeltaFeature1, and we don't tell the user that we also can't read newerDeltaFeature2.

Now consider an alternative where all unrecognized reader features are UnrecognizedReaderFeature(String). We can first parse all reader features, then communicate to the user all the unrecognized features we found
ERROR: Found unsupported reader features: newerDeltaFeature1, newerDeltaFeature2

This idea is similar to parsers and compilers. They want to tell you all the things that went wrong, instead of failing on the first compilation error.

This is a bit beyond the scope of this PR, but I'd like your thoughts @zachschuermann, @nicklan

Good catch with the #[cfg(test)]!

I'm also a fan of the idea of collecting and reporting all unrecognized features instead of failing on the first one +1

Yeah, collecting the unrecognized features makes sense

Cool, here's my proposal:

Make this UrecognizedReaderFeature(String), update your test as well

Make it cfgtest

Make a followup PR to apply it in error reporting.

Tracked here: #703

scovich · 2025-02-19T19:49:26Z

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

scovich · 2025-02-19T19:53:05Z

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

Ah, it's because the bottom PR in the stack is a breaking change.

sebastiantia · 2025-02-19T20:33:57Z

aside: We keep removing the breaking-change label, and semver check keeps re-adding it. Are we sure it's not a breaking change?

Ah, it's because the bottom PR in the stack is a breaking change.

I've been having issues with a couple of my PRs being labeled as breaking whilst also passing the semver checks. Including the bottom PR.

This reverts commit 3b0b2a0.

github-actions bot assigned sebastiantia Feb 7, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 7, 2025

sebastiantia mentioned this pull request Feb 7, 2025

feat: extract & insert sidecar batches in replay's action iterator #679

Open

sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from 25a6131 to e405e69 Compare February 8, 2025 04:13

sebastiantia changed the title ~~Enable snapshot building with v2 checkpoints~~ feat: support the v2Checkpoint reader/writer feature Feb 8, 2025

sebastiantia removed the breaking-change Change that will require a version bump label Feb 8, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 8, 2025

sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch 4 times, most recently from 19cfe17 to fc15f7d Compare February 10, 2025 19:37

sebastiantia marked this pull request as ready for review February 10, 2025 19:48

sebastiantia requested review from zachschuermann, scovich, nicklan and OussamaSaoudi and removed request for scovich February 10, 2025 20:00

sebastiantia removed the breaking-change Change that will require a version bump label Feb 10, 2025

scovich approved these changes Feb 10, 2025

View reviewed changes

roeap approved these changes Feb 11, 2025

View reviewed changes

sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from fc15f7d to 32a2b44 Compare February 11, 2025 21:40

github-actions bot added the breaking-change Change that will require a version bump label Feb 11, 2025

sebastiantia mentioned this pull request Feb 12, 2025

tests: add V2 checkpoint read support integration tests #690

Open

OussamaSaoudi reviewed Feb 12, 2025

View reviewed changes

sebastiantia added merge hold Don't allow the PR to merge and removed breaking-change Change that will require a version bump labels Feb 12, 2025

sebastiantia added 20 commits February 19, 2025 11:17

remove redundant .into_iter

4d4e601

handle errors from windows os

415c2f4

remove unnecessary empty path check

2547c42

typo

ea7349a

nits

39a1451

infer type

ab9ef11

review & nits

302efed

remove test iterator

06e5c92

review

0e57bae

clippy

d49f835

link issue

510dc35

nits

c49402c

nits

2fd8216

test review

f8defbe

nits

638387c

remove debug statements

4984655

review

a77333c

comments & review

b25bb58

typo

93971c5

typo

501c675

sebastiantia added 5 commits February 19, 2025 12:40

snapshot creation with v2checkpoints mvp

0b9dc43

This reverts commit 3b0b2a0.

file name change??

d92e141

fix merge conflict errors

e5748e1

review & nits

a035f8c

merge error

3dcd085

sebastiantia force-pushed the enable_snapshot_building_with_v2_checkpoints branch from 433e8f5 to 3dcd085 Compare February 19, 2025 20:40

sebastiantia requested a review from OussamaSaoudi February 21, 2025 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support the `v2Checkpoint` reader/writer feature #685

feat: support the `v2Checkpoint` reader/writer feature #685

sebastiantia commented Feb 7, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading

sebastiantia commented Feb 10, 2025 •

edited

Loading

scovich left a comment

sebastiantia commented Feb 10, 2025

roeap left a comment

roeap Feb 11, 2025

roeap Feb 11, 2025

OussamaSaoudi Feb 12, 2025

scovich Feb 18, 2025 •

edited

Loading

OussamaSaoudi Feb 12, 2025

sebastiantia Feb 12, 2025

scovich Feb 18, 2025

OussamaSaoudi Feb 19, 2025

sebastiantia Feb 19, 2025

scovich commented Feb 19, 2025

scovich commented Feb 19, 2025

sebastiantia commented Feb 19, 2025 •

edited

Loading

		/// A dummy variant used to represent an unsupported feature for testing purposes
		UnsupportedFeature,

feat: support the v2Checkpoint reader/writer feature #685

Are you sure you want to change the base?

feat: support the v2Checkpoint reader/writer feature #685

Conversation

sebastiantia commented Feb 7, 2025 • edited Loading

What changes are proposed in this pull request?

Summary

Changes

How was this change tested?

codecov bot commented Feb 7, 2025 • edited Loading

Codecov Report

sebastiantia commented Feb 10, 2025 • edited Loading

scovich left a comment

Choose a reason for hiding this comment

sebastiantia commented Feb 10, 2025

roeap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich commented Feb 19, 2025

scovich commented Feb 19, 2025

sebastiantia commented Feb 19, 2025 • edited Loading

feat: support the `v2Checkpoint` reader/writer feature #685

feat: support the `v2Checkpoint` reader/writer feature #685

sebastiantia commented Feb 7, 2025 •

edited

Loading

codecov bot commented Feb 7, 2025 •

edited

Loading

sebastiantia commented Feb 10, 2025 •

edited

Loading

scovich Feb 18, 2025 •

edited

Loading

sebastiantia commented Feb 19, 2025 •

edited

Loading