-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize SELECT min/max queries with limit #7198
Comments
I am not sure if we can get a general purpose optimization that also handles queries with different aggregates SELECT tag, field, max(time), min(other_field)
FROM t
GROUP BY tag
ORDER BY max(time) DESC
LIMIT 10 Though maybe that is not so useful |
I don't think #7192 can handle your example. It works by "evicting" (nice term @tustvold ) groups from the accumulator unless they are the current min/max. So if we run your example: SELECT tag, field, max(time), min(other_field)
FROM t
GROUP BY tag
ORDER BY max(time) DESC
LIMIT 10 on:
We will:
Hopefully this example makes it clear why we can only accumulate values present in the |
The more I think about this, the more I like where @avantgardnerio is going with #7192, and I think we could use the same operator in #7192 for this ticket as well as #6899, and #7196. I hope we can use the same operator for all these queries because:
"Observation" -- No AggregatesOne key observation that @avantgardnerio made (perhaps implicitly) in #7192 is that even though the query in this ticket has aggregates ( ProposalThus, I think we could make the code in #7192 into a
Use for min/max queries with limit (this ticket)So for the SELECT tag, max(time)
FROM t
GROUP BY tag
ORDER BY max(time) DESC
LIMIT 10 We would use
General purpose ORDER BY limit query #7196SELECT c1, c2
FROM t
ORDER BY c3
LIMIT 10 We could use the same operator (though maybe it has a more optimized implementation when there are no groups, like we have for no group aggregate streams):
Queries that have a predicate on
|
@JayjeetAtGithub thank you! I just sent a request for access... |
Note that the dataset in the above example is in the form of an influxdb_iox catalog. If you prefer a Download traces.zip (240MB): In ❯ create external table traces stored as parquet location 'traces';
0 rows in set. Query took 0.030 seconds.
❯ SELECT trace_id, MAX(time) FROM traces GROUP BY trace_id ORDER BY MAX(time) DESC LIMIT 1; |
Completed in #7192 |
Is your feature request related to a problem or challenge?
The following query pattern shows up in many of our usecases:
There may also be predicates
In English this query returns the top 10 groups that had the most recent values
A more specific example, @JayjeetAtGithub found that the Jaeger tool issues this query to show the top ten most recent queries
Describe the solution you'd like
Implement some sort of optimization for this query
Describe alternatives you've considered
I believe #7191 / #7192 from @avantgardnerio is designed for this use case, so that may be sufficient. I did think it was worth documenting the actual end user effect of the change as a separate item which is why I filed this ticket
Additional context
No response
The text was updated successfully, but these errors were encountered: