Implement predicate pruning for `like` expressions (prefix matching) #12978

adriangb · 2024-10-16T18:29:31Z

The idea is that we can push certain like expressions down into statistics pruning.
For example, a filter like url LIKE 'https://www.google.com%' can be (basically, some caveats and other tricks) used to filter url_min <= 'https://www.google.com' and 'https://www.google.com' <= url_maxsuch that a row group that only hashttps://www.example.com` would be excluded.

Closes #507

adriangb · 2024-10-16T18:55:50Z

cc @alamb

alamb · 2024-10-16T19:02:47Z

This is very clever -- I will review it tomorrow

alamb · 2024-10-17T19:56:34Z

datafusion/core/src/physical_optimizer/pruning.rs

+                (false, true) => Operator::ILikeMatch,
+                (true, true) => Operator::NotILikeMatch,


I think this is dead code as if like_expr.case_insensitive() { catches the case insensitive case

I think this code would be clearer if it just matched on like_expr.negated() (or alternately returned unhandled_hook.handle(expr); directly for these last 2 cases

Sure will change it to do so. I think I was getting a bit ahead of myself to implement ILIKE support, which as per the comment should be possible, maybe you can show me how to construct the physical expression to call lower() and upper() on another expression.

alamb

Thank you @adriangb -- this is very cool.

I think there are a few more tests needed but otherwise the implementation looks very solid. Thank you

alamb · 2024-10-17T20:13:19Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -1610,6 +1625,93 @@ fn build_statistics_expr(
    Ok(statistics_expr)
 }

+fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Result<&String> {
+    if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() {
+        if let ScalarValue::Utf8(Some(s)) = lit.value() {


I think you should probably also handle the cases ScalarValue::LargeUtf8, ScalarValue::Utff8View as well.

And Dictionary!

alamb · 2024-10-17T20:16:58Z

datafusion/core/src/physical_optimizer/pruning.rs

+    if prefix.is_empty() {
+        return plan_err!("Empty prefix in LIKE expression");
+    }
+    Ok(Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(


Do we also have to test if there are other occurences of % in the string 🤔 (like foo%bar%)?

The logic is pretty simple (truncate at the first one) but I agree another test would be nice.

alamb · 2024-10-17T20:18:17Z

datafusion/core/src/physical_optimizer/pruning.rs

+    // column LIKE '%foo%' => min <= '' && '' <= max => true
+    // column LIKE 'foo' => min <= 'foo' && 'foo' <= max
+
+    // I *think* that ILIKE could be handled by making the min lowercase and max uppercase


👍 I agree. Figuring out how to make those call would be the trick

alamb · 2024-10-17T20:20:14Z

datafusion/core/src/physical_optimizer/pruning.rs

+fn build_like_match(
+    expr_builder: &mut PruningExpressionBuilder,
+) -> Result<Arc<dyn PhysicalExpr>> {
+    // column LIKE literal => (min, max) LIKE literal split at % => min <= split literal && split literal <= max


I don't understand the LIKE literal split at % part

column LIKE literal is the same as column = literal if there are no wild cards, so you should be able to use the same rules as equality I think

Right that's the point, by splitting it at the first % we are able to apply the same rules as equality:

column LIKE literal -> (min, max) LIKE (literal split at %) -> min <= split literal && split literal <= max vs column = literal -> (min, max) = literal -> min <= literal && literal <= max

alamb · 2024-10-17T20:28:25Z

datafusion/core/src/physical_optimizer/pruning.rs

+        let expected_ret = &[true, true, false, false, true, true];
+
+        prune_with_expr(
+            // s1 LIKE 'A%'


the comments say s1 LIKE A% but the code builds a different expressions

Okay I've tried to get all of these comments right, I think they can be removed tbh the expression is pretty self explanatory, but left them for now, let me know if you'd prefer that I remove them or if you want to keep them if there are any obviously wrong

alamb · 2024-10-17T20:29:56Z

datafusion/core/src/physical_optimizer/pruning.rs

+        let expected_ret = &[true, true, false, false, true, true];
+
+        prune_with_expr(
+            // s1 LIKE 'A%'


Can you also please add tests for other combinations:

s1 LIKE 'A'

s1 LIKE '%A%'
I think it is important the matching is doing the right thing

I also think it is important to to cover cases for NOT LIKE as well

alamb · 2024-10-17T20:30:49Z

datafusion/core/src/physical_optimizer/pruning.rs

+        let expected_ret = &[true, true, true, true, true, true];
+
+        prune_with_expr(
+            // s1 LIKE 'A%'


Suggested change

// s1 LIKE 'A%'

// s1 LIKE '%A'

this one is still wrong

Yeah I haven't pushed pending discussing the correctness of the general approach.

adriangb · 2024-10-17T22:52:46Z

@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is "B" could it be that the actual min value is "BA"? If so I think this approach may not work at all. Imagine we have a row group with data ["BA", "ZD"] which generates min/max stats ["B", "Z"]. Now we want to know if col LIKE '%A%' is possible. Clearly the answer should be yes but if we convert it to the predicate form we get 'B' <= '' AND '' <= 'Z' which gives false 😞. I think this could be resolved by truncating the stats column to be the same length as the prefix?

alamb · 2024-10-18T20:14:19Z

@adriangb in theory I think parquet statistics can be truncated.

Now we want to know if col LIKE '%A%'

I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching

so like col LIKE 'A%'

The predicate would be transformed into 'B' <= 'A' AND 'A' <= 'Z' which I do think is correct

adriangb · 2024-10-18T20:31:06Z

Consider the values ["ABC", "XYZ"] with stats ["AB", "XY"] and the filter col like 'A%'. This becomes 'AB' <= 'A' AND 'A' <= 'XY' which is false, but we need true. To fix this we'd need to truncate the stats to the length of the filter to get 'A' <= 'A' AND 'A' <= 'XY' which then gives the right result. %A% is just an obvious case because you get '' as the prefix which gives obvious issues.

adriangb · 2024-10-19T14:23:24Z

Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the not like part; I'm thinking this is complex enough as is and most of the benefit (maybe even in ClickBench?) will come from like. We can tackle not like, ilike and not ilike in future PRs. Especially since it's going to be important to evaluate each test case carefully.

I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is "","". You basically know nothing about the data, there could be anything in there. Yet col = 'A' transforms into '' <= 'A' and 'A' <= '' which is false. Substitute in non-truncated stats and 'A' <= 'A' and 'A' <= 'Z' is true. I added a test case for this behavior on the existing code. It doesn't differentiate between "ABC" truncated to "" and "" actually being the min string but it shows the behavior which would be the same in both cases.

This is important because for the case of a max stat of "A" and a filter col like 'A_' if the stat might have been truncated from "AB" I need to let it pass, if I know for a fact that "A" is the max string in the column I can indeed reject the column.

adriangb · 2024-10-19T14:27:31Z

Argh the only ClickBench query this maybe could improve is 23:

WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%'

Since the like starts with a wildcard this won't help. Maybe we can get not like to do something smart there, tbd...

Dandandan · 2024-10-19T14:35:58Z

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied or we need to implement that case instead of for like?

adriangb · 2024-10-19T14:39:33Z

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied.

Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run datafusion-cli against a very large parquet file in object storage (high latency) and a query that should filter (col like 'A%') and one that can't?

I don't see where startswith or any other transformations (lower, upper, etc.) are handled in the pruning transformation.

Dandandan · 2024-10-19T14:41:57Z

I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function.

adriangb · 2024-10-19T14:56:50Z

I made a big parquet file as follows:

import random
import string
import polars as pl

df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')

This came out to ~1GB. I then uploaded it to a GCS bucket.

I ran queries col = 'Z' and col like 'Z' against it and got 2s and 23s respectively. IMO that means it's not getting pushed down.

The explain plans reflect that as well:

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]

The like query also has:

FilterExec: col@0 LIKE Z, metrics=[output_rows=0, elapsed_compute=1.878551ms]

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

alamb · 2024-10-21T19:57:49Z

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

That certainly makes sense to me

Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for LIKE into starts_with https://docs.rs/datafusion/latest/datafusion/functions/expr_fn/fn.starts_with.html (though since it is a function figuring out how to make the rewrite work might be tricky)

Then we can implement the rules in pruning predicate for starts_with 🤔

adriangb · 2024-10-21T20:03:02Z

At most we could simplify the col like 'A%' case but we can't simplify 'A%B' so I think it's still worth it to implement pruning rewrites for both.

Do you have any thoughts on my concerns for possibly truncated stats, in particular how = may even be wrong as of today if the stats are truncated enough?

Dandandan · 2024-10-21T20:15:30Z

I think it’s fine to support like for now and leave the further combination / optimization for future work. I see that only simplifying to starts_with won't get all the benefits.

Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for = pruning is incorrect.

adriangb · 2024-10-21T20:37:08Z

Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell.

The current implementation of = is certainly not wrong in the real world, but I'm not sure if that's because it's not used in situations where stats are truncated, if truncation only happens at extremes like a 10MB value where practically it's not a problem, etc.

alamb · 2024-10-21T21:03:40Z

I suggest we (I can help tomorrow):

File a ticket to simplify like to = and starts_with when possible (to help follow on optimizations like this)
File / find a ticket about testing with truncated statistics
Determining what, if anything, is left to add directly to pruning prediate

adriangb · 2024-10-21T21:07:02Z

Sounds good.

All of that said, I think this PR is currently as correct as = and had pretty good test coverage. Does it need to wait on those tasks or can it proceed in parallel?

findepi · 2024-10-22T10:52:13Z

datafusion/core/src/physical_optimizer/pruning.rs

+    let (min_lit, max_lit) = if let Some(wildcard_index) = first_wildcard_index {
+        let prefix = &s[..wildcard_index];
+        let prefix_min_lit = Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(
+            format!("{prefix}\u{10ffff}"),


The 'highest character' should be appended to the max range, not the min.

When implementing similar thing for Trino (trinodb/trino@6aea881), i decided to stay within ASCII characters to avoid any potential issues due to misinterpretation of "difficult" code points.

Hmm my intuition was that you want to add the highest character to the upper lower bound of the min value such that 'A%' can match 'AB'. Assuming a column with only 1 row "AB" and the query 'AB' like 'A%':

'AB' <= 'A\u{10ffff}' and 'A' <= 'AB' -> t

'AB' <= 'A' and 'A\u{10ffff}' <= 'AB' -> f
Right?

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

I'm wondering if comment is related to the confusing naming in https://github.com/apache/datafusion/pull/12978/files#r1810513072?

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

We can if only we want to. The code will do whatever we ask it to do.
Then the question is whether we want to. If we apply the predicate locally in memory only, then no need to be cautious, no need for "stay within ASCII". If we later interop with other systems (eg send a plan somewhere or TableProvider calls remote system), then it might be beneficial to restrict ourselves.

Hmm my intuition was that you want to add the highest character to the upper lower bound

i agree with this

... of the min value

min value of what?
all column values need to be in the range [like_constant_prefix, like_constant_prefix[0..-1] + \u10ffff)

... of the min value

min value of what?

i get it now

So for a column we have stats: min value and max value. Let's call them col_min and col_max.
For like AB% we derive lower and upper bound (AB and AB\u10ffff which is actually incorrect, will comment about this elsewhere).

For pruning we need to check whether [col_min, col_max] ∩ [lower_bound, upper_bound) is non-empty (note the upper_bound will be non-inclusive)
It's empty when upper_bound <= col_min OR col_max < lower_bound
It's non-empty when upper_bound > col_min AND col_max >= lower_bound

For correct upper bound and why exclusive see #12978 (comment)

findepi · 2024-10-22T10:58:02Z

datafusion/core/src/physical_optimizer/pruning.rs

+    // Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
+    // Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.
+    // ANSI SQL specifies two wildcards: % and _. % matches zero or more characters, _ matches exactly one character.
+    let first_wildcard_index = s.find(['%', '_']);


we do not support escape characters, right?

> SELECT '' LIKE '' ESCAPE '%'; Execution error: LIKE does not support escape_char

So I think not.
But just for reference, how would you suggest escape characters be handled? I've never used them in practice (I think at that point I'd just go for a regex).

But just for reference, how would you suggest escape characters be handled?

simple state automata
(eg https://github.com/trinodb/trino/blob/8f279a84bfe7f6f8301a44d26b01c981fc2156f9/core/trino-main/src/main/java/io/trino/type/LikeFunctions.java#L87-L102)

findepi · 2024-10-22T11:01:45Z

datafusion/core/src/physical_optimizer/pruning.rs

+    // **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
+    // If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
+    // when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
+    // Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!


This important, but a bit difficult to follow.

I'm open to clarifications on the wording. The point I'm trying to make is why we have to append characters.

I'd remove this whole comment. If someone understand how LIKE works, they will get the code without comment.
If someone doesn't understand how LIKE works, they won't understand the code even with the comment.

findepi · 2024-10-22T11:05:30Z

datafusion/core/src/physical_optimizer/pruning.rs

+    let min_expr = Arc::new(phys_expr::BinaryExpr::new(
+        min_column_expr.clone(),
+        Operator::LtEq,
+        min_lit,


min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

I'm open to any suggestions on naming but I do think it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

👍🏻

it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Yes, it is.

i understand now that you're fitting this into terminology of existing code.
i am not sure what the right naming would be. maybe neutral: lower_bound and upper_bound?

findepi · 2024-10-22T11:29:05Z

datafusion/core/src/physical_optimizer/pruning.rs

+        let prefix_lit =
+            Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(s.clone()))));
+        (prefix_lit.clone(), prefix_lit)


In such case we should produce single Eq check

Not sure what you mean by a single eq check. We are basically saying col like 'constant' -> col = 'constant' -> col_min <= 'constant' and 'constant' <= col_max

findepi · 2024-10-25T12:42:11Z

datafusion/core/src/physical_optimizer/pruning.rs

+    let s = extract_string_literal(scalar_expr)?;
+    // **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
+    // If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
+    // when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.


for A% pattern the lower bound is A (obvious)
what should be the upper bound?

A\u{10ffff} is not a correct upper bound since A\u{10ffff}\u{10ffff} is even bigger but still matches A% input.
The correct upper bound would be:

A\u{10ffff}\u{10ffff}\u{10ffff}...\u{10ffff} include -- up to max length of the column, so potentially very very long, so absolutely not practical

B (exclusive).

Thus to calculate upper bound you need (pseudo-code)

let s = extract_string_literal(scalar_expr)?; let first_wildcard_index = ...; let prefix = &s[..wildcard_index]; let last_incrementable_character = /* find last code point of `prefix` that can be incremented if we choose to stay within ascii, this will be a code point < 127 otherwise it will be any code point != the max code point (0x10FFFF) */; if last_incrementable_character not found { // For `%`, or `\u{10ffff}...\u{10ffff}%` patterns, we cannot calculate an upper bound return None } let upper_bound = prefix[..last_incrementable_character-1] + // take prefix of the prefix up to and excluding the last character that can be incremented str(prefix[last_incrementable_character] + 1) // take last character and increment it

I'm a bit confused about this explanation. Maybe you can provide a failing test case that would help me understand? There is already a test case for 'A%' and it is as far as I can tell doing the correct thing.

try add a test case for 'A%' where stats are min=AB max=A\u{10ffff}\u{10ffff}\u{10ffff}

Thanks for the suggestion. I added a test case in e29ed50. It worked as expected, let me know if I got the expected outcomes wrong or missed something.

See #12978 (comment) on how to make the test expose the problem

findepi · 2024-10-25T12:43:33Z

datafusion/core/src/physical_optimizer/pruning.rs

+    // If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
+    // when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
+    // Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
+    // Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.


This link isn't useful. That page conveniently avoids any details that are important. Please remove the link.

Do you have a better resource or name for this transformation? I'm happy to point at any documentation or prior art in Trino.

I am just not finding this link useful being being inspirational. But source of inspiration doesn't need to be reflected in the code comment 🙂
I am not asking for replacing if any other link.

I personally find trinodb/trino@6aea881 valuable because (1) I know this code as its author and (2) it might actually be correct. Once we get the code here correct, that link wouldn't be useful either.

datafusion/core/src/physical_optimizer/pruning.rs

adriangb · 2024-10-26T21:50:15Z

@alamb I re-arranged some of the comments on assertions in ae3426d which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit.

I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?

datafusion/core/src/physical_optimizer/pruning.rs

alamb · 2024-11-01T21:52:19Z

This is one of the PRs I hope to push through this weekend -- what I would really like to do given the potential for very subtle bugs in this code, is to write some more tests. I keep hoping I will find time, but I haven't. Maybe this weekend will be better

findepi · 2024-11-02T07:50:37Z

Given LIKE can get desugared to something else during planning (eg #13061), maybe it would be great to support this functionality with end-to-end tests.

adriangb · 2024-11-02T19:27:28Z

Given LIKE can get desugared to something else during planning (eg #13061), maybe it would be great to support this functionality with end-to-end tests.

I feel like we need both e2e and more focused tests: e2e tests risk depending on a combination of the optimizer and pruning rewrites that could be broken if e.g. someone customized the optimizer (which I believe is pluggable?).

drauschenbach · 2024-11-03T14:48:08Z

This PR should possibly indicate that it Closes #507.

alamb · 2024-11-04T20:40:45Z

This PR should possibly indicate that it Closes #507.

Updated the description

alamb

Sorry for the delay in merging this one, just the pruning logic is so subtle I don't want to introduce any unintended behaviors.

I wrote up some fuzz tests here: #13253

They don't yet pass -- I think there might be some bug in the test. I need to spend more time debugging it but I am out of time for today

Not sure it is correct yet

adriangb · 2024-11-04T22:56:22Z

No worries, I'll try to take a look at #13253. I agree it's better to be cautious and make sure we get this right rather than create a bug that results in incorrect data.

findepi · 2024-11-05T11:36:09Z

I wrote up some fuzz tests here: #13253

They don't yet pass -- I think there might be some bug in the test.

Good tests should find #12637 and therefore shouldn't pass.

alamb · 2024-11-27T19:14:53Z

To follow up here -- what I think this PR needs prior to merging is to add fuzz testing as I think it has the potential to have hard to track down / subtle bugs

Some important corner cases:

With truncated statistics (actual min/max different than stats)
With truncated prefix matching -- as in min / max is actually foo but the query has a predicate like column='fo'

I think there are questions on correctness / testing so let't not accidentally merge this before those are addressed

etseidl · 2024-12-11T17:53:08Z

datafusion/core/src/physical_optimizer/pruning.rs

+        // Filter out non-characters (https://www.unicode.org/versions/corrigendum9.html)
+        if [0xFFFE, 0xFFFF].contains(&cp) || (0xFDD0..=0xFDEF).contains(&cp) {
+            return false;
+        }


After reading the link, I'm not sure we need to filter out non-characters. They are allowed to appear in well formed strings, and will sort properly AFAICT. Then this whole function could boil down to

c as u32 < 0x110000

From the link above

Noncharacters in the Unicode Standard are intended for internal use and have no standard interpretation when exchanged outside the context of internal use.

whereas the filter condition we gonna produce here may be exchanged between systems

Ah, ok. I assumed this predicate was consumed internally.

Usually is, but AFAICT it can also reach TableProvider.scan and then ... 🤷‍♂️

etseidl · 2024-12-11T17:55:45Z

datafusion/core/src/physical_optimizer/pruning.rs

+        // Test surrogate pair range (0xD800..=0xDFFF)
+        assert_eq!(increment_utf8("a\u{D7FF}").unwrap(), "b");
+        assert!(increment_utf8("\u{D7FF}").is_none());


This is an interesting case. While trying to patch up parquet-rs, I found that a byte-by-byte approach will actually wind up incrementing U+D7FF to U+E000 (which is private use, but still a valid character).

why so? 0xD7FF + 1 would be 0xD800, not 0xE800

BTW @etseidl has a PR upstream in arrow-rs related to incrementing utf8:

Improvements to UTF-8 statistics truncation arrow-rs#6870

Maybe we can focus our efforts on making sure we have a solid utf8 increment routine and then use it in both places

thanks for the link.

@alamb i am not against reusing, but note that business logic for Parquet stats and here is different.
For stats, it's enough to find some value bigger than input.
Here, it's mandatory to find next value bigger than input.

i have feeling that the routine in this PR is ready to be merged.
@adriangb can you please rebase the PR?
if a better version emerges, it's not a problem to improve later, but there is no need to tie this PR

@findepi @alamb feels that this needs some fuzz testing before it can be merged and I heavily agree.

I plan on doing this but have not had time.
Since we have a DataFusion meetup next week in Chicago I've asked for 2 days of dedicated work to spend on this + prepping for the meetup next week. So hopefully I can tackle that Monday and Tuesday.

why so? 0xD7FF + 1 would be 0xD800, not 0xE800

Because 0xD800-0xDFFF are not valid code points. The next higher valid code point is 0xE000. It's a happy coincidence that this happens.

adriangb · 2024-12-13T20:51:42Z

Hi @alamb I took a stab at fuzz tests in f6c5314.
They're heavy and slow so I had to restrict the search space a lot more than I would have liked. Maybe you or @tustvold can suggest ways to cut out the heavy parsing of Parquet metadata and such to speed these up? Ultimately I do think it's worth re-using whatever creates parquet stats from data so that we use the "real" thing but I don't think we need to test the serialization / deserialization repeatedly like this does.
Also happy to restrict the search space by being more deliberate about how we build the values, row groups, predicates, etc.

github-actions bot added the core Core DataFusion crate label Oct 16, 2024

alamb reviewed Oct 17, 2024

View reviewed changes

adriangb changed the title ~~Implement predicate pruning for LIKE and NOT LIKE~~ Implement predicate pruning for like expressions Oct 19, 2024

findepi reviewed Oct 22, 2024

View reviewed changes

adriangb mentioned this pull request Oct 22, 2024

Simplify EXPR LIKE 'constant' to expr = 'constant' #13061

Closed

findepi requested changes Oct 25, 2024

View reviewed changes

adriangb commented Oct 25, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved

adriangb force-pushed the like-prune branch 2 times, most recently from a0a1c37 to ae3426d Compare October 26, 2024 21:46

findepi approved these changes Nov 1, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved

findepi previously approved these changes Nov 2, 2024

View reviewed changes

adriangb mentioned this pull request Nov 3, 2024

Support pruning on string columns using starts_with #507

Open

alamb mentioned this pull request Nov 4, 2024

Add fuzz testing for UTF8 LIKE pruning #13253

Draft

alamb reviewed Nov 4, 2024

View reviewed changes

alamb mentioned this pull request Nov 4, 2024

Support vectorized append and compare for multi group by #12996

Merged

alamb changed the title ~~Implement predicate pruning for like expressions~~ Implement predicate pruning for like expressions (prefix matching) Dec 5, 2024

etseidl mentioned this pull request Dec 11, 2024

Parquet UTF-8 max statistics are overly pessimistic apache/arrow-rs#6867

Open

etseidl reviewed Dec 11, 2024

View reviewed changes

etseidl mentioned this pull request Dec 13, 2024

Improvements to UTF-8 statistics truncation apache/arrow-rs#6870

Open

adriangb added 4 commits December 13, 2024 09:27

Implement predicate pruning for like expressions

cbd56f7

add function docstring

9f8eb8c

re-order bounds calculations

e300aee

fmt

bf652e5

adriangb force-pushed the like-prune branch from 34ec425 to 969e83c Compare December 13, 2024 20:49

github-actions bot added the optimizer Optimizer rules label Dec 13, 2024

add fuzz tests

f6c5314

adriangb force-pushed the like-prune branch from 969e83c to f6c5314 Compare December 13, 2024 20:52

		(false, true) => Operator::ILikeMatch,
		(true, true) => Operator::NotILikeMatch,

Implement predicate pruning for like expressions (prefix matching) #12978

Are you sure you want to change the base?

Implement predicate pruning for like expressions (prefix matching) #12978

Conversation

adriangb commented Oct 16, 2024 • edited by alamb Loading

adriangb commented Oct 16, 2024

alamb commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Oct 17, 2024 • edited Loading

alamb commented Oct 18, 2024

adriangb commented Oct 18, 2024

adriangb commented Oct 19, 2024 • edited Loading

adriangb commented Oct 19, 2024

Dandandan commented Oct 19, 2024 • edited Loading

adriangb commented Oct 19, 2024

Dandandan commented Oct 19, 2024

adriangb commented Oct 19, 2024 • edited Loading

alamb commented Oct 21, 2024

adriangb commented Oct 21, 2024 • edited Loading

Dandandan commented Oct 21, 2024

adriangb commented Oct 21, 2024 • edited Loading

alamb commented Oct 21, 2024

adriangb commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Oct 26, 2024

alamb commented Nov 1, 2024

findepi commented Nov 2, 2024

adriangb commented Nov 2, 2024

drauschenbach commented Nov 3, 2024

alamb commented Nov 4, 2024

alamb left a comment

Choose a reason for hiding this comment

adriangb commented Nov 4, 2024

findepi commented Nov 5, 2024

alamb commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Implement predicate pruning for `like` expressions (prefix matching) #12978

Implement predicate pruning for `like` expressions (prefix matching) #12978

adriangb commented Oct 16, 2024 •

edited by alamb

Loading

adriangb commented Oct 17, 2024 •

edited

Loading

adriangb commented Oct 19, 2024 •

edited

Loading

Dandandan commented Oct 19, 2024 •

edited

Loading

adriangb commented Oct 19, 2024 •

edited

Loading

adriangb commented Oct 21, 2024 •

edited

Loading

adriangb commented Oct 21, 2024 •

edited

Loading

etseidl Dec 11, 2024 •

edited

Loading

adriangb commented Dec 13, 2024 •

edited

Loading