-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic pruning filters from TopK state (optimize ORDER BY LIMIT
queries)
#15037
Comments
There was a talk at CIDR this year that mentioned this: Sponsor Talk 3: The Fine Art of Work Skipping It seems they wrote a blog about it too here: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ |
Nice to know I'm not totally off on the idea 😄 |
Not at all! |
BTW I am pretty sure DuckDB is using this technique and why they are so much faster on ClickBench Q23: |
Does anyone have a handle on how we might implement this? I was thinking we’d need to add a method to exec operators called |
To begin with I would suggest:
Then the trick will be to figure out how to share the
With the And then orchestrate concurrent access to it somehow |
Thanks @adriangb -- I will try and review it asap (hopefully tomorrow afternoon or tomorrow) |
We already have Statistics on PartitionedFile so we could potentially use Dynamic filters to prune based on those before opening the file |
@adriangb and I had a discussion about #15301 here are some notes: Usecases:
Pros / ConsThe pros for merging this PR are:
Nice to haves
I believe adrian is going to look into these – but I also think they could easily be done as a follow on PR |
wrt waiting for filter pushdown to be enabled by default, I think we're just making our lives harder by coupling them, especially since we can already test them together under a feature flag. I also would like to leverage this work to justify a lot of other optimizations:
This is unfortunately a blocker for all of that |
ORDER BY LIMIT
queries)
I plan to spend a non trivial amount of time working on this with @adriangb this week |
Is your feature request related to a problem or challenge?
From discussion with @alamb yesterday the idea came up of optimizing queries like
select * from data order by timestamp desc limit 10
for the case where the data is not perfectly sorted bytimestamp
but mostly follows a sorted pattern.You can imagine this data gets created if multiple sources with clock skews, network delays, etc. are writing data and you don't do anything fancy to guarantee perfect sorting by
timestamp
(i.e. you naively write out the data to Parquet, maybe do some compaction, etc.). The point is that 99% of yesterday's files have atimestamp
smaller than 99% of today's files but there may be a couple seconds of overlap between files. To be concrete, let's say this is our data:Currently DataFusion will exhaustively open each file, read the
timestamp
column and feed it into aTopK
.I think we can do a lot better if we:
TopK
. The only way more things would get added to ourTopK
is if they are greater than the smallest item already seen (let's say that's20
, the smallest value in file 3).timestamp > 20
.Extrapolating this to scenarios where you have years worth / TBs of data and want a
limit 5
would yield orders of magnitude improvement I think.@alamb mentioned this sounds similar to Dynamic Filters, I assume this must be a known technique (or my analysis may be completely wrong 😆 ) but I don't know what it would be called.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: