Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle predicates on non-nullable columns without stats #700

Merged
merged 6 commits into from
Feb 19, 2025

Conversation

adamreeve
Copy link
Contributor

Fixes #698

What changes are proposed in this pull request?

Updates the DataSkippingFilter to treat all columns as nullable for the purpose of parsing stats, as suggested in #698 (comment).

This is particularly important for partition columns, which won't have values present in stats. But stats are also only usually stored for the first 32 columns, so we shouldn't rely on stats being present for non-partition fields either.

How was this change tested?

I've added a new unit test.

I've also tested building duckdb-delta with this change (cherry-picked onto 0.6.1) and verified that the code in #698 now works.

@adamreeve
Copy link
Contributor Author

But stats are also only usually stored for the first 32 columns, so we shouldn't rely on stats being present for non-partition fields either.

I haven't added a separate test for this, but I verified this is also a problem with DuckDB-Delta. The following code also crashes without this fix:

import duckdb
from deltalake import write_deltalake
import pyarrow as pa
import numpy as np


delta_dir = "./delta_table"
num_cols = 33

schema = pa.schema([
    pa.field(f"x{i}", pa.int32(), nullable=False) for i in range(num_cols)
])

rng = np.random.default_rng(0)
for p in range(10):
    num_rows = 1_000
    table = pa.Table.from_pydict({
        f'x{i}': pa.array(rng.integers(p * 100, (p + 1) * 100, num_rows), type=pa.int32()) for i in range(num_cols)
    })
    write_deltalake(
            table_or_uri=delta_dir,
            data=table,
            mode='append',
            schema=schema)

results = duckdb.query(f"SELECT * FROM delta_scan('{delta_dir}') WHERE x32 = 101")
print(results)

Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.26%. Comparing base (16d2557) to head (ba67db8).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/scan/data_skipping.rs 90.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #700      +/-   ##
==========================================
+ Coverage   84.22%   84.26%   +0.04%     
==========================================
  Files          77       77              
  Lines       17926    17950      +24     
  Branches    17926    17950      +24     
==========================================
+ Hits        15098    15126      +28     
+ Misses       2110     2109       -1     
+ Partials      718      715       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this, the approach looks good. Couple questions tho.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for fixing this!

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for catching + fixing this!!

@@ -906,6 +908,129 @@ fn with_predicate_and_removes() -> Result<(), Box<dyn std::error::Error>> {
Ok(())
}

#[tokio::test]
async fn predicate_on_non_nullable_partition_column() -> Result<(), Box<dyn std::error::Error>> {
// Test for https://github.com/delta-io/delta-kernel-rs/issues/698
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for adding test!

@zachschuermann zachschuermann merged commit 2e4f555 into delta-io:main Feb 19, 2025
21 checks passed
@adamreeve adamreeve deleted the partition-stats-fix branch February 19, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Encountered unmasked nulls" error parsing stats for a non-nullable partition column
3 participants