feat: partial predicate pushdown for LIKE on cache SQL layer #2249
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Speed improvement is very significant. 380ms with predicate pushdown compared to 3s 540ms without, in a 280MB dataset of 230K records. (Note the predicate pushdown case, while faster, incorrectly returns 25699 records only, while the non-pushdown case correctly returns 62813 records).
The lack of projection pushdown in the cache means that all 230K records, with all fields included, must be read from LMDB, converted to JSON, converted from JSON to Arrow format, then fed to Datafusion for filtering. Predicate pushdown for LIKE demonstrably reduces this overhead (although the cache doesn't do substring search, which makes this comparison inconclusive).
This pull-request is marked as a draft because the implementation is only partially correct. The
$contains
filtering option of the cache currently doesn't support substring search, making it insufficient for correct pushdown ofLIKE %word%
as"$contains": "word"
.