Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: partial predicate pushdown for LIKE on cache SQL layer #2249

Closed

Conversation

abcpro1
Copy link
Contributor

@abcpro1 abcpro1 commented Dec 6, 2023

Speed improvement is very significant. 380ms with predicate pushdown compared to 3s 540ms without, in a 280MB dataset of 230K records. (Note the predicate pushdown case, while faster, incorrectly returns 25699 records only, while the non-pushdown case correctly returns 62813 records).
The lack of projection pushdown in the cache means that all 230K records, with all fields included, must be read from LMDB, converted to JSON, converted from JSON to Arrow format, then fed to Datafusion for filtering. Predicate pushdown for LIKE demonstrably reduces this overhead (although the cache doesn't do substring search, which makes this comparison inconclusive).

This pull-request is marked as a draft because the implementation is only partially correct. The $contains filtering option of the cache currently doesn't support substring search, making it insufficient for correct pushdown of LIKE %word% as "$contains": "word".

The subset of LIKE patterns supported:
- %
- %word%

Notes:
`%word%` only matches a full word because the cache's current implementation doesn't do substring search.
@abcpro1 abcpro1 requested review from snork-alt, v3g42 and chubei December 6, 2023 06:33
Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge it because it's a marginal improvement.

@abcpro1
Copy link
Contributor Author

abcpro1 commented Dec 6, 2023

@chubei it returns incorrect results for a query string including the filter LIKE '%appl%'. Datafusion, slow as it is, will consider the word "apple" a valid result, while the cache currently will miss it.

@chubei
Copy link
Contributor

chubei commented Dec 6, 2023

@chubei it returns incorrect results for a query string including the filter LIKE '%appl%'. Datafusion, slow as it is, will consider the word "apple" a valid result, while the cache currently will miss it.

Right I missed this case.

@abcpro1
Copy link
Contributor Author

abcpro1 commented Dec 6, 2023

Right I missed this case.

I updated the description a bit to add some more information.

@Jesse-Bakker
Copy link
Contributor

There are multiple ways to support substring search in the cache. We can subdivide the problem into multiple distinct cases:

  • Single-word search
    • Full-word (field.contains("word"))
      Simple inverted index using words as tokens
    • Prefix (field.contains("wor\w*"))
      Range search (key_min=wor, key_max=wos) in inverted index
    • In (field.contains("\w*wor\w*"))
      Index scan on inverted index
  • Phrase search
    • Full phrase
      Full-string inverted index (may want to only index a fixed-length prefix and compare against primary index to confirm match)
    • Prefix search
      Range search on full-string inverted index (see above)
    • In
      Table scan OR tokenize query and look up in tokenized index (can be sped up by embedding token positions in index and using those).

For substring pushdown, we can specialize at the single-word prefix and IN and phrase full phrase, prefix and IN levels.

@v3g42 v3g42 closed this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants