Add fuzz testing for UTF8 LIKE pruning #13253

alamb · 2024-11-04T21:58:43Z

Draft as it builds on #12978

Which issue does this PR close?

Part of #507

Rationale for this change

While working on #12978 with @adriangb and @findepi I am having nightmares of subtle bugs introduced with truncated statistics

What changes are included in this PR?

Fuzz tests for pruning with truncated statistics / prefix values

Are these changes tested?

It is only tests

cargo test --test fuzz -- pruning

Are there any user-facing changes?

No, tests only

adriangb · 2024-11-04T22:57:23Z

datafusion/core/tests/fuzz_cases/pruning.rs

+/// Tests for `LIKE` with truncated statistics to validate incrementing logic
+///
+/// Create several 2 row batches and ensure that `LIKE` with the min and max value
+/// are correctly pruned even when the "statistics" are trunated.
+#[test]
+fn test_prune_like_truncated_statistics() {


I think it's also worth having tests for = and maybe other operators, it was not immediately obvious to me that there wasn't a bug with those as well.

adriangb · 2024-11-04T23:00:02Z

datafusion/core/tests/fuzz_cases/pruning.rs

+    // Make 2 row random UTF-8 strings
+    let mut rng = thread_rng();
+    let statistics = TestPruningStatistics::new(&mut rng, 100);


I imagine a lot of the bugs are going to be around edge cases: empty strings, non-ascii characters, etc. Is there any way we could inject those into the randomness? Maybe what we need here more than random fuzzing is a matrix style test:

Generate N full length values, including some random ones?

Arrange them into row groups in multiple orders, of multiple sizes

Truncate the stats to lengths between 1 and large

And make sure the results with and without pruning match?

THis is a good idea

adriangb and others added 5 commits November 4, 2024 16:56

Implement predicate pruning for like expressions

f09a38a

add function docstring

838d00d

re-order bounds calculations

798147e

fmt

ce0dd18

Add fuzz testing for UTF8 LIKE pruning

f896d5f

alamb changed the title ~~Alamb/like prune fuzz~~ Add fuzz testing for UTF8 LIKE pruning Nov 4, 2024

github-actions bot added the core Core DataFusion crate label Nov 4, 2024

alamb mentioned this pull request Nov 4, 2024

Implement predicate pruning for like expressions (prefix matching) #12978

Open

adriangb reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzz testing for UTF8 LIKE pruning #13253

Add fuzz testing for UTF8 LIKE pruning #13253

alamb commented Nov 4, 2024

adriangb Nov 4, 2024

adriangb Nov 4, 2024

alamb Nov 6, 2024

Add fuzz testing for UTF8 LIKE pruning #13253

Are you sure you want to change the base?

Add fuzz testing for UTF8 LIKE pruning #13253

Conversation

alamb commented Nov 4, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

adriangb Nov 4, 2024

Choose a reason for hiding this comment

adriangb Nov 4, 2024

Choose a reason for hiding this comment

alamb Nov 6, 2024

Choose a reason for hiding this comment