feat: Introduce `TableConfiguration` to jointly manage metadata, protocol, and table properties #644

OussamaSaoudi · 2025-01-14T23:45:42Z

What changes are proposed in this pull request?

This PR introduces the TableConfiguration struct which is used to perform feature support and feature enablement checks the table's protocol, metadata, table properties, and schema.

Problem statement

To check that a feature is enabled, you often must check that a certain reader/writer feature is supported and that a table property is set to true. For example, a writer must check both the delta.enableDeletionVectors table property, and check that the deletionVectors writer/reader features are present in the table's Protocol. Probing two disparate structs to do a single check is error-prone and may lead to these metadata/protocol checks to become out of sync. Moreover checks are performed in the CDF path, snapshot scan path, and in the read path. Thus there are many ways in which protocol and metadata checks can diverge with one another.

Put simply, the problems are:

When checking feature support over multiple structs (like P&M), it is easy to forget one and violate correctness.
Duplicate checks for the same feature may diverge among different code paths

Solution

TableConfiguration consolidates all protocol and metadata checks to one place. It also ensures that the logic for checking feature enablement is kept consistent throughout the codebase. This addresses the problems outlined above.

Closes: #571

How was this change tested?

We add a couple tests to ensure that:

Creating TableConfiguration fails on tables for which reading is not supported
deletion vector support and enablement checks work as expected.

OussamaSaoudi · 2025-01-14T23:51:41Z

kernel/src/table_configuration.rs

+    /// See the documentation of [`TableChanges`] for more details.
+    ///
+    /// [`TableChanges`]: crate::table_changes::TableChanges
+    pub fn can_read_cdf(&self) -> DeltaResult<()> {


@zachschuermann @nicklan wdyt of this sort of API? The general pattern would be can_(read | write)_{table_feature}

Note that we'd only create such functions if/when we need them, and if they make sense. For example if there is a feature that is only relevant to readers, there would be no can_write_{tablefeature}

Is there a reason it doesn't return DeltaResult<bool>? Seems like there are cases (most of them?) where we don't want to necessarily throw, we just want to learn (but if something goes wrong while trying to learn, an error response still warranted).

Can always add a corresponding assert_can_XXX method if the error version is actually needed?

I was using the pattern of Ok(()) => true, Err(_) => false, with an explanation why. We use a similar pattern here. Do you think we should rethink this approach?

I think it depends entirely on whether "most" callers of a given method would treat false as an error. Protocol checks would of course treat it like an error, but other places in the code merely want to make control flow decisions based on the answer. I guess Result for control flow isn't as bad as exceptions for control flow, but in general it seems preferable to use Result for cases where something actually went wrong?

Yeah I expect most of the callers to use this as a bool or have a custom error message. I'll switch to bool 👍

codecov · 2025-01-14T23:51:59Z

Codecov Report

Attention: Patch coverage is 92.64706% with 15 lines in your changes missing coverage. Please review.

Project coverage is 84.20%. Comparing base (6751838) to head (3afaf66).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/table_configuration.rs	93.10%	9 Missing and 3 partials ⚠️
kernel/src/snapshot.rs	81.25%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #644      +/-   ##
==========================================
+ Coverage   84.08%   84.20%   +0.12%     
==========================================
  Files          76       77       +1     
  Lines       17526    17694     +168     
  Branches    17526    17694     +168     
==========================================
+ Hits        14736    14899     +163     
- Misses       2077     2080       +3     
- Partials      713      715       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kernel/src/table_configuration.rs

scovich · 2025-01-15T15:40:51Z

kernel/src/scan/mod.rs

@@ -412,7 +412,7 @@ impl Scan {
            partition_columns: self.snapshot.metadata().partition_columns.clone(),


I would think that everything related to protocol or metadata should reside in table configuration now, which would include table schema, partition columns, etc. Idea being, the P&M are "raw" information (with significant cross-validation requirements), which the table configuration aims to encapsulate cleanly. Almost nobody should be accessing the P&M directly after that.

One useful thought experiment (inspired by Delta spark's SnapshotDescriptor trait would be: how many parts of the code that currently take a Snapshot could/should take a TableConfiguration instead?
(distinction being, TableConfiguration accesses do not require I/O nor significant compute, while true snapshot operations like scans do). Methods that work with a Snapshot but which do not require an Engine are likely candidates.

Hmm meaning that we actually want to expose only table_configuration from Snapshot instead of table_configuraition() + protocol() + metadata()?

^ expose in pub(crate) ofc

yes, exactly. Pass a TC if you need to know things about the snapshot; pass a snapshot only if you actually need to scan it

Going by the thought experiment, it also seems like this should hold a version for the table configuration 🤔. A schema for a table only make sense given a fixed version.

Edit: same for the table_root. Now it's starting to feel like I'm blowing up the responsibilities of TableConfiguration. I'll sit on these thoughts for a bit more.

It'll probably be important to find the dividing line between this and ResolvedTable once catalogs come into play 😵

Yes, and yes

tracking issue?

I've got all of the tableconfig tasks here: #650

scovich · 2025-01-15T15:43:48Z

kernel/src/snapshot.rs

    pub fn schema(&self) -> &Schema {
-        &self.schema
+        self.table_configuration.schema()
    }

    /// Table [`Metadata`] at this `Snapshot`s version.
    pub fn metadata(&self) -> &Metadata {
-        &self.metadata
+        self.table_configuration.metadata()
    }

    /// Table [`Protocol`] at this `Snapshot`s version.
    pub fn protocol(&self) -> &Protocol {


How invasive would it be to remove these three methods from Snapshot?
(not in this PR)

Conceptually, P&M are Delta details that shouldn't be pub in the first place (and they're "raw" information that is easily misused). Schema is a border case, since we use it a lot and it's not a Delta-specific concept.

Just tried it, and dev-visibility works fine. pub(crate) makes an integration test fail to compile since it reads P & M.

You've convinced me that P&M shouldn't be exposed to "regular users" of kernel. I think it fits in well with the spirit of developer-visibility to have them though. If dev-vis exposes all the actions, folks with that feature flag probably want to see these implementation details.

Given all that, I'd advocate for dev-vis for these methods :)

Since you mentioned not to do it in this PR, I made an issue to keep track of it here: #648

Agree that dev-visibility is reasonable. If the other PR would be tiny, we could consider folding it into this PR -- but I suspect it will be big enough (and not directly on topic here) to be worth splitting out.

It is indeed a small change actually. I think I'd instead fold the issue above with consolidating all the protocol validation in TableConfiguration. This PR for introducing TC, that second PR for cementing it as the one-stop shop for validated P&M.

EDIT: To clarify, making methods dev-visible is small. Removing them outright is more work.

scovich · 2025-01-15T15:46:26Z

kernel/src/snapshot.rs

    }

    /// Table [`Protocol`] at this `Snapshot`s version.
    pub fn protocol(&self) -> &Protocol {
-        &self.protocol
+        self.table_configuration.protocol()
    }

    /// Get the [`TableProperties`] for this [`Snapshot`].
    pub fn table_properties(&self) -> &TableProperties {


Should (the whole raw set of) table properties be pub? Many table properties are mandated by the Delta spec as part of specific features, and shouldn't be manipulated directly. Many others are mandated by delta-spark convention, but we don't actually support them all here. Should we hide the raw properties and instead surface the ones users should actually see/set via individual methods on the table configuration?

This is another part where I imagine these details fit within the spirit of developer-visibility imo. A usecase I can think of is as follows:

Kernel doesn't support retention duration/vacuum. Iirc we don't plan to support it. Suppose delta-rs wanted to see these properties, they'd want access to the table properties. Dev-visibility is a good approach to letting them access these details.

kernel/src/snapshot.rs

kernel/src/table_configuration.rs

scovich · 2025-01-15T15:57:32Z

kernel/src/table_configuration.rs

+    pub fn column_mapping_mode(&self) -> &ColumnMappingMode {
+        &self.column_mapping_mode
+    }
+    pub fn schema(&self) -> &Schema {
+        self.schema.as_ref()
+    }
+    pub fn protocol(&self) -> &Protocol {
+        &self.protocol
+    }
+    pub fn metadata(&self) -> &Metadata {
+        &self.metadata
+    }
+    pub fn table_properties(&self) -> &TableProperties {


Can we make these pub(crate) (with developer-visibility) or even private? Ideally, few if any callers should need access to these raw objects because the various accessor methods cover their needs.

To make them private, we'd have to remove the protocol/metadata methods from the snapshot. You'd mentioned that should be in another PR, but I could go ahead and remove them here :)

If it doesn't bloat this PR, we could consider doing it here? Otherwise, it belongs in the other PR as you say.

kernel/src/table_configuration.rs

OussamaSaoudi · 2025-01-17T05:12:37Z

kernel/src/snapshot.rs

@@ -25,11 +24,7 @@ const LAST_CHECKPOINT_FILE_NAME: &str = "_last_checkpoint";
 pub struct Snapshot {
    pub(crate) table_root: Url,


aside: given that we have a fn table_root() and a fn _log_segment(), it seems we don't need to have these be pub(crate)? Also: why did we go with the _log_segment naming convention. Seems pythonic, and I haven't seen any similar naming in our code 🤔

Agree that's weird. I wonder if somebody was unaware that field names and method names don't collide in rust?

scovich · 2025-01-17T19:41:24Z

kernel/src/table_changes/mod.rs

+            if snapshot.table_configuration().is_cdf_read_supported() {
+                Ok(())
+            } else {
+                Err(Error::change_data_feed_unsupported(snapshot.version()))
+            }


aside: Intuitively, there should be a better way to express this in rust... but I can't find anything :(

agreed!! There is nice support for options, but I couldn't find anything. I'll give it some more thought 👍

update: The only options seem to be:

adding another crate

turning into option then result. More steps than is worth :/

Looks like we're stuck with this.

scovich · 2025-01-17T19:42:23Z

kernel/src/table_configuration.rs

-            Error::unsupported("Change data feed not supported when column mapping is enabled")
-        );
-        Ok(())
+        protocol_supported && cdf_enabled && column_mapping_disabled


I assume CDF + CM is a TODO?

scovich · 2025-01-17T19:44:31Z

kernel/src/table_configuration.rs

+            .protocol()
+            .has_writer_feature(&WriterFeatures::DeletionVectors)
+            && self.protocol.min_writer_version() == 7;
+        read_supported && write_supported


Is it normal to store RW features in both reader and writer list? I thought each feature appeared in only one list, and writers are required to check both? The spec says:

to write a table, writers must implement and respect all features listed in writerFeatures. Because writers have to read the table (or only the Delta log) before write, they must implement and respect all reader features as well.

At least, I would have interpreted the above to mean, "writer expects to find DV (a reader feature) in the reader list" rather than "writer checks both reader and writer lists and expects to find it in both"

I just checked, and we do indeed need to check both lists. Taken from our table-with-dv-small:

{"protocol":{"minReaderVersion":3,"minWriterVersion":7,"readerFeatures": ["deletionVectors"],"writerFeatures":["deletionVectors"]}}

Dv is in both lists, so we should check both lists.

While that's true, the question remains, if a table had deletionVectors only in readerFeatures would we still want to consider them "supported"? I think yes...

Hmm given what the protocol says, wouldn't that be considered an incorrect table?

I missed this part of the spec before:

Reader features should be listed in both readerFeatures and writerFeatures simultaneously, while writer features should be listed only in writerFeatures. It is not allowed to list a feature only in readerFeatures but not in writerFeatures.

That part of the spec doesn't acknowledge it, but the duplication is needed because a different part of the spec states that readerFeatures is not guaranteed to exist:

when a new table is created with a Reader Version up to 2 and Writer Version 7, its protocol action must only contain writerFeatures. When a new table is created with Reader Version 3 and Writer Version 7, its protocol action must contain both readerFeatures and writerFeatures. Creating a table with a Reader Version 3 and Writer Version less than 7 is not allowed.

However, only legacy table features like column mapping can encounter that state, because most reader table features require reader version 3.

Based on all that, I would recommend we enforce strictly:

Most reader features must appear in both lists

An an exception, "legacy" reader features that do not require reader version 3 can still be specified by reader version 3, but still allow writer version < 7. In that case, the writerFeatures list does not even exist and so the reader feature cannot appear there.

The set of legacy reader features is small and unchanging (I think column mapping is the only one), so we should just hard-wire those exceptions.

NOTE: protocol (2, 4) is invalid. It would enable column mapping for reads but not for writes.

I agree, that makes sense :) @OussamaSaoudi could we capture those three bullets in a doc comment?

I was planning a rework of how we handle reader/writer features when I port over protocol logic. Would it make sense to handle these cases and clearly document them in that followup PR?

Edit: Added the doc comment, but I'll implement it when we bring in the reader/writer feature validation to table configuration.

scovich · 2025-01-17T19:47:49Z

kernel/src/scan/mod.rs

@@ -412,7 +412,7 @@ impl Scan {
            partition_columns: self.snapshot.metadata().partition_columns.clone(),


yes, exactly. Pass a TC if you need to know things about the snapshot; pass a snapshot only if you actually need to scan it

OussamaSaoudi · 2025-01-17T23:00:42Z

kernel/src/table_configuration.rs

+    pub(crate) fn column_mapping_mode(&self) -> &ColumnMappingMode {
+        &self.column_mapping_mode
+    }
+    /// The [`Schema`] of this table configuration's [`Metadata`]


These docs are similar to the ones in Snapshot. Ex:

/// Version of this `Snapshot` in the table. pub fn version(&self) -> Version { self.log_segment.end_version }

I'm not convinced these are useful doc comments. Anyone have thoughts?

One-liners like this are always tricky... in theory the name is self-describing. On the other hand, a single short sentence to confirm that for the reader isn't hard to write and may do some good?

OussamaSaoudi · 2025-01-18T00:04:17Z

kernel/src/table_configuration.rs

+        version: Version,
+    ) -> DeltaResult<Self> {
+        // important! before a read/write to the table we must check it is supported
+        protocol.ensure_read_supported()?;


TODO: We should be checking that writes are supported as well.

We also need to check that the set of write features is a subset of the read features. I think I'll leave that to the future PR that ports all of the protocol checks to TableConfiguration.

well - I think this needs to be configurable (don't need to check if write is supported if we only read the table)? this feels like more of a code smell that we might not be doing it in the right spot.

Yeah, I suspect Scan should ensure_read_supported() and Transaction should ensure_write_supported()

@nicklan see #644 (comment)

I suspect Scan should ensure_read_supported()

Problem is, Snapshot already did things that could be affected by reader features. It exposes the schema, for example. Could some future reader feature change the way we interpret schemas? Also, when table features themselves were introduced, it changed the way we interpret protocol actions.

I just think there should be (either via types or just docs/etc.) some clearer communication for 'checking {read/write} is supported' - I don't think it's blindingly apparent that a TableConfiguration::try_new would do all of the read/write support checking. don't need to solve here - I think this PR is a step in the right direction but would love to have a follow-up issue so we don't forget :)

OussamaSaoudi · 2025-01-19T00:25:22Z

kernel/src/table_configuration.rs

+    ///
+    /// [`TableChanges`]: crate::table_changes::TableChanges
+    #[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
+    pub(crate) fn is_cdf_read_supported(&self) -> bool {


I don't do unit tests here because TableChanges already tests this path. I think it's worth creating an issue to port some of that to unit tests here.

Note that the testing in TableChanges is more of an integration test that reads a delta log and fails for various reasons. I don't think it's worth keeping integration tests.

If all this sounds good, I'll go ahead and make the issue :)

Not sure I quite followed... are you proposing that integration tests are unhelpful in general and should always be simplified to unit tests? Or that the specific existing TC tests don't need to be written as integration tests? Or something else?

Sorry, I'm saying that the integration tests can be simplified to unit tests. Unit tests make it clear that we're hitting all the edge cases that we want to hit.

lmk if you think my judgement is off though :)

I read that as: we have some TableChanges (integration) tests that basically need to be migrated to unit tests here, as they aren't specific to TableChanges and should just exercise this in isolation (in UT)

If that's the case - sounds good :)

Also added this to the tracking issue for #650

OussamaSaoudi · 2025-01-19T00:28:27Z

kernel/src/table_changes/mod.rs

+            if snapshot.table_configuration().is_cdf_read_supported() {
+                Ok(())
+            } else {
+                Err(Error::change_data_feed_unsupported(snapshot.version()))
+            }


update: The only options seem to be:

adding another crate

turning into option then result. More steps than is worth :/

Looks like we're stuck with this.

OussamaSaoudi · 2025-01-19T00:42:42Z

kernel/src/table_configuration.rs

+        version: Version,
+    ) -> DeltaResult<Self> {
+        // important! before a read/write to the table we must check it is supported
+        protocol.ensure_read_supported()?;


We also need to check that the set of write features is a subset of the read features. I think I'll leave that to the future PR that ports all of the protocol checks to TableConfiguration.

OussamaSaoudi · 2025-01-19T02:04:57Z

kernel/src/table_configuration.rs

+    use super::TableConfiguration;
+
+    #[test]
+    fn dv_supported_not_enabled() {


I'm not sure how useful these tests are :/ This is a lot of lines to test pretty simple logic. With more *_is_supported, *_is_enabled, this gets unwieldy.

I could try to factor this out, but covering all cases would involve specifying reader/writer versions, and configurations. At that point, we're just creating P & M structs through function parameters.

Would appreciate some advice on this :)

It seems like we want to verify that the information from the underlying P&M gets surfaced correctly by TC? But that gets annoying, because as you say we end up writing very specific tests that exactly fit the prod code they aim to exercise (such tests tend to be short-sighted). Combine that with a very large combination space for table features and it gets ugly fast.

darn okay. I think I'll leave more thorough testing for when we do more Metadata/Protocol validation. If we test that we corrected validated P&M + Schema + column mapping, the *_is_supported, *_is_enabled should be small enough that they're not worth testing.

I'm not sure we have a way around the issue of "many combinations" that we want to test.. I would probably tackle this by introducing some framework (function/macro/etc.) for passing some large matrix of combinations and expected outputs? can help out more here if needed

Hmm I think this is something I'll have to think about more and address in a followup PR.

Edit: Added this to the PR for improving TableConfigurations: #650

scovich

LGTM

kernel/src/snapshot.rs

scovich · 2025-01-21T21:34:27Z

kernel/src/table_changes/mod.rs

+        check_table_config(&start_snapshot)?;
+        check_table_config(&end_snapshot)?;


future looking: If we move to a ResolvedTable concept, do we actually need two snapshots? Or could we use just a start snapshot and an end resolved table?

agree I think we just need bits from ResolvedTable at the end

If ResolvedTable can hold the entire TableConfiguration, then we actually only need ResolvedTable for both start and end. CDF doesn't need the start_snapshot's log segment.

kernel/src/table_configuration.rs

zachschuermann

flushing comments

zachschuermann · 2025-01-21T22:27:33Z

kernel/src/snapshot.rs

    }

-    /// Get the [`TableProperties`] for this [`Snapshot`].
-    pub fn table_properties(&self) -> &TableProperties {


did we effectively remove table_properties API?

It's just moving from Snapshot to TableConfiguration, no?
https://github.com/delta-io/delta-kernel-rs/pull/644/files#diff-56f657b4ffdaff037266acb6a20e8723f0985a49b487f289cc2a5ce41bb8ca19R82

Zach makes a good point tho, it's best to stay consistent. The plan is to eventually move all of these fields into the table config, but I think I'll revert this and match it up with the protocol, metadata, and other fields.

zachschuermann · 2025-01-21T22:29:02Z

kernel/src/table_changes/mod.rs

+        check_table_config(&start_snapshot)?;
+        check_table_config(&end_snapshot)?;


agree I think we just need bits from ResolvedTable at the end

zachschuermann · 2025-01-21T22:30:09Z

kernel/src/table_configuration.rs

@@ -0,0 +1,262 @@
+//! Provides a high level api to check feature support/enablement for a table.


nit can we expound on this? i.e. include that it encapsulates Protocol/Metadata/TableProperties and since they are all intertwined it performs cross-validation in order to expose a nice API for bits like column mapping, etc.

Addressed this. PTAL!

kernel/src/table_configuration.rs

zachschuermann · 2025-01-22T01:08:41Z

kernel/src/table_configuration.rs

+    /// another, and that the kernel supports reading from this table.
+    pub(crate) fn try_new(


I wonder if there's a better way for us to communicate this? embedding the "can I read this table" into TableConfiguration::try_new() feels unintuitive?

Perhaps it's easier to think about if we s/read/access/ -- if kernel encounters a reader feature it doesn't understand, we can't do anything with that table because interpretation of key fields could have changed, log replay algorithm could have updated, etc. Think e.g. how v2 checkpoints change which files we should be looking for in the _delta_log dir, or how in-commit timestamps can change which version of the table a given timestamp resolves to.

And then we have writer[-only] features, which only matter if we want to change the table in some way.

I agree that it feels unintuitive. There's three questions here:
1) Is this table configuration valid
2) Can we read from the table (ensure_read_supported)
3) Can we write to the table. (ensure_write_supported)
It feels like we ought to split these concepts.

The following are true. I provide my reasoning in [1]

Claim 1: For both reads and writes, we should check ensure_read_supported.

Claim 2: For reads, it is important that we don't check write support.

So while it would be fine to check ensure_read_supported in try_new (claim 1), we definitely can't check ensure_write_supported (claim 2).

I think we should have a separate functions TableConfiguration::is_read_supported and TableConfiguration::is_write_supported. try_new will only be responsible for asserting that the table config is valid. These separately address the 3 questions above.

I'll go do that. Lmk if I'm missing something @zachschuermann @scovich

[1]:

Claim 1: Both reads and writes check ensure_read_supported

Reads of course check ensure_read_supported In the case when we want to write, we need to perform a read anyway. The protocol states:

to write a table, writers must implement and respect all features listed in writerFeatures. Because writers have to read the table (or only the Delta log) before write, they must implement and respect all reader features as well.

This is why it's not incorrect to check 1&2 in try_new

Claim 2: Checking ensure_write_supported on the read path is incorrect

Suppose we don't support appendOnly, and we want to read a table. ensure_write_supported would fail on this. But this is a perfectly fine table for us to read. This is why it's incorrect to check ensure_write_supported in try_new.

The biggest reason to keep the read-supported check in try_new is that it's a narrow waist. Everyone who wants to access the table in any way needs to pass that check first, and putting it there catches them all.

For writes, a similar narrow waist would be the transaction builder -- tho that doesn't cover things like metadata cleanup and vacuum, which don't perform any commit.

We really don't want to forget those checks... Bad Things can happen.

Alight, I'll revert the changes for read then.

zachschuermann · 2025-01-22T01:09:42Z

kernel/src/table_configuration.rs

+        version: Version,
+    ) -> DeltaResult<Self> {
+        // important! before a read/write to the table we must check it is supported
+        protocol.ensure_read_supported()?;


well - I think this needs to be configurable (don't need to check if write is supported if we only read the table)? this feels like more of a code smell that we might not be doing it in the right spot.

kernel/src/table_configuration.rs

zachschuermann · 2025-01-22T17:15:22Z

kernel/src/table_configuration.rs

+    ///
+    /// [`TableChanges`]: crate::table_changes::TableChanges
+    #[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
+    pub(crate) fn is_cdf_read_supported(&self) -> bool {


I read that as: we have some TableChanges (integration) tests that basically need to be migrated to unit tests here, as they aren't specific to TableChanges and should just exercise this in isolation (in UT)

If that's the case - sounds good :)

zachschuermann · 2025-01-22T17:18:01Z

kernel/src/table_configuration.rs

+    use super::TableConfiguration;
+
+    #[test]
+    fn dv_supported_not_enabled() {


I'm not sure we have a way around the issue of "many combinations" that we want to test.. I would probably tackle this by introducing some framework (function/macro/etc.) for passing some large matrix of combinations and expected outputs? can help out more here if needed

nicklan

Couple small things, but that could be moved to issues and/or resolved here, but basically lgtm!

nicklan · 2025-01-23T23:32:02Z

kernel/src/table_configuration.rs

+        version: Version,
+    ) -> DeltaResult<Self> {
+        // important! before a read/write to the table we must check it is supported
+        protocol.ensure_read_supported()?;


Yeah, I suspect Scan should ensure_read_supported() and Transaction should ensure_write_supported()

nicklan · 2025-01-23T23:34:09Z

kernel/src/table_configuration.rs

+            .protocol()
+            .has_writer_feature(&WriterFeatures::DeletionVectors)
+            && self.protocol.min_writer_version() == 7;
+        read_supported && write_supported


While that's true, the question remains, if a table had deletionVectors only in readerFeatures would we still want to consider them "supported"? I think yes...

zachschuermann

LGTM left a few comments to clean everything up / track follow-ups mostly

zachschuermann · 2025-01-27T16:41:28Z

kernel/src/scan/mod.rs

@@ -412,7 +412,7 @@ impl Scan {
            partition_columns: self.snapshot.metadata().partition_columns.clone(),


tracking issue?

zachschuermann · 2025-01-27T16:44:31Z

kernel/src/table_configuration.rs

@@ -0,0 +1,275 @@
+//! This module defines [`TableConfiguration`], a high level api to check feature support/enablement for a table.
+//! This encapsulates [`Protocol`], [`Metadata`], and extracts the [`TableProperties`] and [`ColumnMappingMode`].
+//! Protocol and Metadata are deeply intertwined when dealing with table features and


unfinished?

zachschuermann · 2025-01-27T16:45:08Z

kernel/src/table_configuration.rs

@@ -0,0 +1,275 @@
+//! This module defines [`TableConfiguration`], a high level api to check feature support/enablement for a table.
+//! This encapsulates [`Protocol`], [`Metadata`], and extracts the [`TableProperties`] and [`ColumnMappingMode`].
+//! Protocol and Metadata are deeply intertwined when dealing with table features and


maybe add that it effectively consumes protocol/metadata but encapsulates all of metadata/protocol/schema/properties/features/etc.

zachschuermann · 2025-01-27T16:48:00Z

kernel/src/table_configuration.rs

+        version: Version,
+    ) -> DeltaResult<Self> {
+        // important! before a read/write to the table we must check it is supported
+        protocol.ensure_read_supported()?;


I just think there should be (either via types or just docs/etc.) some clearer communication for 'checking {read/write} is supported' - I don't think it's blindingly apparent that a TableConfiguration::try_new would do all of the read/write support checking. don't need to solve here - I think this PR is a step in the right direction but would love to have a follow-up issue so we don't forget :)

zachschuermann · 2025-01-27T16:51:53Z

kernel/src/table_configuration.rs

+            .protocol()
+            .has_writer_feature(&WriterFeatures::DeletionVectors)
+            && self.protocol.min_writer_version() == 7;
+        read_supported && write_supported


I agree, that makes sense :) @OussamaSaoudi could we capture those three bullets in a doc comment?

github-actions bot assigned OussamaSaoudi Jan 14, 2025

OussamaSaoudi force-pushed the table-configuration branch from 3f74148 to a6d74d6 Compare January 14, 2025 23:47

OussamaSaoudi commented Jan 14, 2025

View reviewed changes

OussamaSaoudi mentioned this pull request Jan 15, 2025

refactor: Make CDF use TableConfiguration and refactor log replay #645

Open

OussamaSaoudi commented Jan 15, 2025

View reviewed changes

kernel/src/table_configuration.rs Show resolved Hide resolved

scovich reviewed Jan 15, 2025

View reviewed changes

OussamaSaoudi commented Jan 17, 2025

View reviewed changes

github-actions bot added the breaking-change Change that will require a version bump label Jan 17, 2025

OussamaSaoudi changed the title ~~[wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and column mapping~~ [wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properities Jan 17, 2025

OussamaSaoudi changed the title ~~[wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properities~~ [wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties Jan 17, 2025

scovich reviewed Jan 17, 2025

View reviewed changes

OussamaSaoudi commented Jan 17, 2025

View reviewed changes

OussamaSaoudi commented Jan 18, 2025

View reviewed changes

OussamaSaoudi mentioned this pull request Jan 18, 2025

feat: incremental Snapshot update #651

Open

OussamaSaoudi force-pushed the table-configuration branch from 2ffdeea to 829cf43 Compare January 19, 2025 00:08

OussamaSaoudi changed the title ~~[wip] feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties~~ feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties Jan 19, 2025

OussamaSaoudi changed the title ~~feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties~~ feat!: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties Jan 19, 2025

OussamaSaoudi marked this pull request as ready for review January 19, 2025 00:21

OussamaSaoudi commented Jan 19, 2025

View reviewed changes

OussamaSaoudi requested review from scovich, zachschuermann and nicklan January 19, 2025 00:47

OussamaSaoudi commented Jan 19, 2025

View reviewed changes

scovich approved these changes Jan 21, 2025

View reviewed changes

zachschuermann reviewed Jan 22, 2025

View reviewed changes

OussamaSaoudi force-pushed the table-configuration branch from 3d0044b to 64b44ec Compare January 23, 2025 06:57

OussamaSaoudi requested a review from zachschuermann January 23, 2025 07:15

OussamaSaoudi-db added 2 commits January 23, 2025 13:01

Update comments

52944b5

Add back newline

c1acc15

OussamaSaoudi added 17 commits January 23, 2025 13:01

Fix compiler warnings

eed813c

add more docs

951996d

add docs, move table_root and version to TableConfiguration

0a0729f

Add docs on the version + table url

3bf9812

fix the ordering of methods

a4dda2a

Add test for dvs

0882611

Add dv check tests, and creation test

0b6105e

fix visibility of log_segment

2949b93

appease cargo doc

f137f54

Fix documentation and compiler error

ae1cf73

Address nits

2b5daea

factor out is_read_supported

ac3e84c

Add documentation for is_read/write_supported

fb08045

Fix small docs

f7bc01e

Fix clippy

81344c5

Add is_read_supported to try_new

62caa66

Fix clippy and tests

6a1f6e6

OussamaSaoudi force-pushed the table-configuration branch from b46c921 to 6a1f6e6 Compare January 23, 2025 21:34

OussamaSaoudi changed the title ~~feat!: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties~~ feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties Jan 23, 2025

nicklan approved these changes Jan 23, 2025

View reviewed changes

Merge branch 'main' into table-configuration

10cac11

zachschuermann approved these changes Jan 27, 2025

View reviewed changes

OussamaSaoudi added 6 commits January 27, 2025 10:00

Fix comment

0bb129e

fixup doc comment with schema

b0a30f4

Add doc comment

feee24e

fixup docs

c97ad0f

clarify doc

a54b278

fix wording

3afaf66

OussamaSaoudi removed the breaking-change Change that will require a version bump label Jan 28, 2025

OussamaSaoudi merged commit 3305d3a into delta-io:main Jan 28, 2025
21 checks passed

		@@ -412,7 +412,7 @@ impl Scan {
		partition_columns: self.snapshot.metadata().partition_columns.clone(),

		@@ -25,11 +24,7 @@ const LAST_CHECKPOINT_FILE_NAME: &str = "_last_checkpoint";
		pub struct Snapshot {
		pub(crate) table_root: Url,

		check_table_config(&start_snapshot)?;
		check_table_config(&end_snapshot)?;

		@@ -0,0 +1,262 @@
		//! Provides a high level api to check feature support/enablement for a table.

		/// another, and that the kernel supports reading from this table.
		pub(crate) fn try_new(

feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties #644

feat: Introduce TableConfiguration to jointly manage metadata, protocol, and table properties #644

Conversation

OussamaSaoudi commented Jan 14, 2025 • edited Loading

What changes are proposed in this pull request?

Problem statement

Solution

How was this change tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

scovich left a comment

feat: Introduce `TableConfiguration` to jointly manage metadata, protocol, and table properties #644

feat: Introduce `TableConfiguration` to jointly manage metadata, protocol, and table properties #644

OussamaSaoudi commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading

OussamaSaoudi Jan 17, 2025 •

edited

Loading

OussamaSaoudi Jan 17, 2025 •

edited

Loading

OussamaSaoudi Jan 27, 2025 •

edited

Loading

OussamaSaoudi Jan 17, 2025 •

edited

Loading

OussamaSaoudi Jan 27, 2025 •

edited

Loading

OussamaSaoudi Jan 23, 2025 •

edited

Loading