-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle predicates on non-nullable columns without stats #700
Conversation
I haven't added a separate test for this, but I verified this is also a problem with DuckDB-Delta. The following code also crashes without this fix: import duckdb
from deltalake import write_deltalake
import pyarrow as pa
import numpy as np
delta_dir = "./delta_table"
num_cols = 33
schema = pa.schema([
pa.field(f"x{i}", pa.int32(), nullable=False) for i in range(num_cols)
])
rng = np.random.default_rng(0)
for p in range(10):
num_rows = 1_000
table = pa.Table.from_pydict({
f'x{i}': pa.array(rng.integers(p * 100, (p + 1) * 100, num_rows), type=pa.int32()) for i in range(num_cols)
})
write_deltalake(
table_or_uri=delta_dir,
data=table,
mode='append',
schema=schema)
results = duckdb.query(f"SELECT * FROM delta_scan('{delta_dir}') WHERE x32 = 101")
print(results) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #700 +/- ##
==========================================
+ Coverage 84.22% 84.26% +0.04%
==========================================
Files 77 77
Lines 17926 17950 +24
Branches 17926 17950 +24
==========================================
+ Hits 15098 15126 +28
+ Misses 2110 2109 -1
+ Partials 718 715 -3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this, the approach looks good. Couple questions tho.
873d0b5
to
ba67db8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for fixing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for catching + fixing this!!
@@ -906,6 +908,129 @@ fn with_predicate_and_removes() -> Result<(), Box<dyn std::error::Error>> { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn predicate_on_non_nullable_partition_column() -> Result<(), Box<dyn std::error::Error>> { | |||
// Test for https://github.com/delta-io/delta-kernel-rs/issues/698 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks for adding test!
Fixes #698
What changes are proposed in this pull request?
Updates the
DataSkippingFilter
to treat all columns as nullable for the purpose of parsing stats, as suggested in #698 (comment).This is particularly important for partition columns, which won't have values present in stats. But stats are also only usually stored for the first 32 columns, so we shouldn't rely on stats being present for non-partition fields either.
How was this change tested?
I've added a new unit test.
I've also tested building duckdb-delta with this change (cherry-picked onto 0.6.1) and verified that the code in #698 now works.