Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: coalesce identical Series() requests #4290

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented May 31, 2021

I spent quite some time recently staring at the fan-out code recently and thought that implementing this would be a good idea: let's join multiple requests for the same data if they are being sent to the same StoreAPI nodes (:

Very often when a new dashboard gets opened, a lot of different queries come in where the same metric data is being used. Clicking around the classic node_exporter dashboard shows via the newly implemented metric that this assumption is quite true. It is very unlikely that the retrieved data from SeriesAPI nodes will different if the requests come a few milliseconds between them. So, I think that this is a reasonable optimization to make.

Unfortunately, because upper layers assume sole ownership of the returned storage.SeriesSet it means that we have to make copies of the data before passing the data on. There are potential bigger savings here. Added a TODO for that.

Finally, I think that caching here is unnecessary because caching of the end-user responses is already performed by query-frontend.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Added a broadcastingSeriesServer that waits for a response on a query and then fans it out to all of the "listeners".

Verification

Tests still pass

Deduplicate same Series() requests occurring at the same moment. This is
pretty crucial because query-frontend caches depending on the query
itself and sometimes dashboads such as node_exporter might ask for the
same series but with different functions.

In the popular node_exporter dashboard I can count at least 9 different
queries that use the same `node_cpu_seconds_total` albeit with different
label selectors & functions.

My tests show with this improvement show that indeed there is a lot of
overlap:

```
thanos_proxy_store_deduplicated_stream_requests_total 633
```

After opening the node_exporter dashboard a few times.

Signed-off-by: Giedrius Statkevičius <[email protected]>
@GiedriusS GiedriusS force-pushed the feature/deduplicate_same_series_requests branch from 90fa8c5 to 8d96c24 Compare May 31, 2021 20:57
@GiedriusS GiedriusS changed the title Feature/deduplicate same series requests proxy: coalesce identical Series() requests May 31, 2021
pkg/store/proxy.go Outdated Show resolved Hide resolved
@GiedriusS GiedriusS force-pushed the feature/deduplicate_same_series_requests branch 2 times, most recently from 821c762 to a42a3e0 Compare May 31, 2021 21:32
Signed-off-by: Giedrius Statkevičius <[email protected]>
@GiedriusS GiedriusS force-pushed the feature/deduplicate_same_series_requests branch from a42a3e0 to 2e0ef9f Compare May 31, 2021 21:36
@GiedriusS GiedriusS marked this pull request as ready for review May 31, 2021 22:10
Set aggregations to some default value if only 0s resolution data has
been requested because it doesn't matter in this case what aggregations
we're asking for since at this resolution we only have RAW data. This
potentially can lead to even more savings.

Signed-off-by: Giedrius Statkevičius <[email protected]>
@GiedriusS
Copy link
Member Author

GiedriusS commented Jun 1, 2021

The benchmarks aren't too happy:

benchmark                                                                                            old ns/op      new ns/op      delta
BenchmarkProxySeries/1000000SeriesWith1Samples/4_client_with_1_samples,_250000_series_each-16        1077205114     1256668659     +16.66%
BenchmarkProxySeries/1000000SeriesWith1Samples/single_client_with_1_samples,_1000000_series-16       1023182953     1295766170     +26.64%
BenchmarkProxySeries/100000SeriesWith100Samples/4_client_with_25_samples,_25000_series_each-16       106213070      129375959      +21.81%
BenchmarkProxySeries/100000SeriesWith100Samples/single_client_with_100_samples,_100000_series-16     100188783      126916986      +26.68%
BenchmarkProxySeries/1SeriesWith10000000Samples/4_client_with_2500000_samples,_1_series_each-16      37307          53264          +42.77%
BenchmarkProxySeries/1SeriesWith10000000Samples/single_client_with_10000000_samples,_1_series-16     22434          38284          +70.65%

benchmark                                                                                            old allocs     new allocs     delta
BenchmarkProxySeries/1000000SeriesWith1Samples/4_client_with_1_samples,_250000_series_each-16        5298639        5299245        +0.01%
BenchmarkProxySeries/1000000SeriesWith1Samples/single_client_with_1_samples,_1000000_series-16       6000189        6000345        +0.00%
BenchmarkProxySeries/100000SeriesWith100Samples/4_client_with_25_samples,_25000_series_each-16       516583         520291         +0.72%
BenchmarkProxySeries/100000SeriesWith100Samples/single_client_with_100_samples,_100000_series-16     540136         575343         +6.52%
BenchmarkProxySeries/1SeriesWith10000000Samples/4_client_with_2500000_samples,_1_series_each-16      143            184            +28.67%
BenchmarkProxySeries/1SeriesWith10000000Samples/single_client_with_10000000_samples,_1_series-16     90             128            +42.22%

benchmark                                                                                            old bytes     new bytes     delta
BenchmarkProxySeries/1000000SeriesWith1Samples/4_client_with_1_samples,_250000_series_each-16        213085944     258318696     +21.23%
BenchmarkProxySeries/1000000SeriesWith1Samples/single_client_with_1_samples,_1000000_series-16       269202688     314400880     +16.79%
BenchmarkProxySeries/100000SeriesWith100Samples/4_client_with_25_samples,_25000_series_each-16       20379208      25327196      +24.28%
BenchmarkProxySeries/100000SeriesWith100Samples/single_client_with_100_samples,_100000_series-16     22262598      29730443      +33.54%
BenchmarkProxySeries/1SeriesWith10000000Samples/4_client_with_2500000_samples,_1_series_each-16      8611          10207         +18.53%
BenchmarkProxySeries/1SeriesWith10000000Samples/single_client_with_10000000_samples,_1_series-16     4271          5826          +36.41%

Let me dig into this more. Any ideas are welcome or comments.

EDIT: it doesn't seem that much can be done. I have compared CPU profiles of the benchmarks and the whole extra time seems to be spent on generating strings & locking/unlocking. Considering it's only a few milliseconds vs. potentially hundreds of them to perform a Series() call, I think this is still worthwhile.

GiedriusS added 3 commits June 2, 2021 15:46
Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this attempt. This is generally known as singleflight and there might be even some gRPC libraries for this.

At the end I would worry about real case:

  • How often request are exactly the same? Often it's for different millisecond which already makes those "different"
  • If we do this on querier side we are risking that the requests are already load-balanced among many queries.

I wonder, isn't that a good logic to add in query frontend which would allow us to plan ahead better? 🤔

@GiedriusS
Copy link
Member Author

GiedriusS commented Jun 3, 2021

Thanks for this attempt. This is generally known as singleflight and there might be even some gRPC libraries for this.

At the end I would worry about real case:

  • How often request are exactly the same? Often it's for different millisecond which already makes those "different"
  • If we do this on querier side we are risking that the requests are already load-balanced among many queries.

I wonder, isn't that a good logic to add in query frontend which would allow us to plan ahead better?

Thank you for your comment. Thanks for pointing me to the exact phrase that I was looking for. https://pkg.go.dev/golang.org/x/sync/singleflight this library looks nice, I could reimplement this using that library if the whole approach here is OK to you.

It seems some people are even recommending groupcache as a way to implement this https://www.reddit.com/r/golang/comments/98bzcr/team_exploring_how_to_cache_with_grpc_in_golang/e4g77i5/?utm_source=reddit&utm_medium=web2x&context=3. I'll try researching a bit more. We have an issue about it, maybe this part of code would be a good first user? :) However, let's discuss your other points.


Let's take a real-life case such as the node_exporter dashboard which probably more or less everyone uses. It has a CPU Busy panel that uses a query such as:

(((count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))

Then there is the CPU Cores panel:

count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))

Then CPU System Load (1m):

avg(node_load1{instance=~"$node:$port",job=~"$job"}) /  count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)) * 100

And so on. There are quite a few panels using node_cpu_seconds_total with identical matchers. So, once someone opens up a dashboard and even with query-frontend, we will happily go and pull all of those metrics a few times from remote object storage. This adds up when machines have 100+ cores. It's the same with some of the other metrics. I imagine the situation is quite similar in other cases.

When a dashboard gets loaded initially it's obviously the slowest but afterward asking for just the newest data with query-frontend is typically quite fast because it usually gets pulled from Prometheus instances.

  • If we do this on querier side we are risking that the requests are already load-balanced among many queries.

Could you please elaborate on this? Even if the gRPC connections are load-balanced, it's still fine because we'd be sending one Series() call instead of 2 or more.

I don't think we should do this in query-frontend because essentially the whole expression evaluation i.e. PromQL evaluation engine part would have to be pushed up to query-frontend, right? Unless I misunderstood you. In general, I don't think this is related to planning at all. Isn't planning about things such as QoS i.e. assigning some kind of priorities to queries coming from different tenants and so on?

IMHO this is quite a simple optimization that has an orthogonal goal and can be in Thanos independently of query planning.

@yeya24
Copy link
Contributor

yeya24 commented Jul 1, 2021

@GiedriusS I am looking at the same use case. Have you tried this implementation already and does it help improve the query performance?

Another question, when Grafana dashboard loads, does it cache the results of the same query or it just simply queries multiple times?

@yeya24
Copy link
Contributor

yeya24 commented Jul 11, 2021

Looks like this pr works for us and I can see the same Series requests are coalesced. This should solve #4407.

@GiedriusS
Copy link
Member Author

Yes, I have tried it. I wouldn't raise a pull request if the code didn't work 😄 The performance question depends on the queries. If there are a lot of them using the same metrics then there is a huge performance win. This still needs a bit of work on the benchmarks. Probably will need to try out https://github.com/lestrrat-go/bufferpool 🤔

@GiedriusS
Copy link
Member Author

GiedriusS commented Jul 12, 2021

#4379 is also an optimization that implements a similar thing. @yeya24 could you PTAL at it?

EDIT: we are working on groupcache so that PR is kind of pointless.

@GiedriusS
Copy link
Member Author

I have tried implementing this as a gRPC interceptor but doing this is not viable there because the first RecvMsg() returns so quickly and then the single-flight mechanism doesn't work properly at all. Here's my code: https://gist.github.com/GiedriusS/313b3be1543f543487df5f624f84d836.

So, I would like to come back to this pull request very soon.

…ame_series_requests

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
@stale
Copy link

stale bot commented Nov 25, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Nov 25, 2021
@stale stale bot closed this Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants