feat: partial predicate pushdown for LIKE on cache SQL layer #2249

abcpro1 · 2023-12-06T06:18:55Z

Speed improvement is very significant. 380ms with predicate pushdown compared to 3s 540ms without, in a 280MB dataset of 230K records. (Note the predicate pushdown case, while faster, incorrectly returns 25699 records only, while the non-pushdown case correctly returns 62813 records).
The lack of projection pushdown in the cache means that all 230K records, with all fields included, must be read from LMDB, converted to JSON, converted from JSON to Arrow format, then fed to Datafusion for filtering. Predicate pushdown for LIKE demonstrably reduces this overhead (although the cache doesn't do substring search, which makes this comparison inconclusive).

This pull-request is marked as a draft because the implementation is only partially correct. The $contains filtering option of the cache currently doesn't support substring search, making it insufficient for correct pushdown of LIKE %word% as "$contains": "word".

The subset of LIKE patterns supported: - % - %word% Notes: `%word%` only matches a full word because the cache's current implementation doesn't do substring search.

chubei

I think we can merge it because it's a marginal improvement.

abcpro1 · 2023-12-06T06:51:20Z

@chubei it returns incorrect results for a query string including the filter LIKE '%appl%'. Datafusion, slow as it is, will consider the word "apple" a valid result, while the cache currently will miss it.

chubei · 2023-12-06T06:53:09Z

@chubei it returns incorrect results for a query string including the filter LIKE '%appl%'. Datafusion, slow as it is, will consider the word "apple" a valid result, while the cache currently will miss it.

Right I missed this case.

abcpro1 · 2023-12-06T07:32:41Z

Right I missed this case.

I updated the description a bit to add some more information.

Jesse-Bakker · 2023-12-06T12:40:00Z

There are multiple ways to support substring search in the cache. We can subdivide the problem into multiple distinct cases:

Single-word search
- Full-word (field.contains("word"))
  Simple inverted index using words as tokens
- Prefix (field.contains("wor\w*"))
  Range search (key_min=wor, key_max=wos) in inverted index
- In (field.contains("\w*wor\w*"))
  Index scan on inverted index
Phrase search
- Full phrase
  Full-string inverted index (may want to only index a fixed-length prefix and compare against primary index to confirm match)
- Prefix search
  Range search on full-string inverted index (see above)
- In
  Table scan OR tokenize query and look up in tokenized index (can be sped up by embedding token positions in index and using those).

For substring pushdown, we can specialize at the single-word prefix and IN and phrase full phrase, prefix and IN levels.

feat: partial predicate pushdown for LIKE on cache SQL layer

1474c7e

The subset of LIKE patterns supported: - % - %word% Notes: `%word%` only matches a full word because the cache's current implementation doesn't do substring search.

abcpro1 requested review from snork-alt, v3g42 and chubei December 6, 2023 06:33

chubei approved these changes Dec 6, 2023

View reviewed changes

v3g42 closed this Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: partial predicate pushdown for LIKE on cache SQL layer #2249

feat: partial predicate pushdown for LIKE on cache SQL layer #2249

abcpro1 commented Dec 6, 2023 •

edited

Loading

chubei left a comment

abcpro1 commented Dec 6, 2023

chubei commented Dec 6, 2023

abcpro1 commented Dec 6, 2023

Jesse-Bakker commented Dec 6, 2023

feat: partial predicate pushdown for LIKE on cache SQL layer #2249

feat: partial predicate pushdown for LIKE on cache SQL layer #2249

Conversation

abcpro1 commented Dec 6, 2023 • edited Loading

chubei left a comment

Choose a reason for hiding this comment

abcpro1 commented Dec 6, 2023

chubei commented Dec 6, 2023

abcpro1 commented Dec 6, 2023

Jesse-Bakker commented Dec 6, 2023

abcpro1 commented Dec 6, 2023 •

edited

Loading