Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote path expression bulk fetch cache #2

Open
wants to merge 4 commits into
base: datacratic
Choose a base branch
from

Conversation

jraby
Copy link

@jraby jraby commented Dec 1, 2014

This is a proof of concept. It is working, but in a hacky state, so do not merge yet :-)

Note that sadly, both the backend and the frontend need to run the new code to test this, I had to add a field to the backend response to be able to map the series it returns to the initial pathExpression (or target: *.*.*.metricName)

The goal of this is to reduce the amount of queries between the frontend and the backends, as this is believed to be the primary source of slowdown/latency in our setups.

Background

Here's an example of that latency in action.
When serving a query containing a few wildcards and a couple metrics, for example: target=*.*.*.cpuUsage&target=movingAverage(*.*.*.testMetrics), older version of graphite (0.9.x before November 2014) would query each backends with the following queries:

  • /metric/find : to discover what metrics are of which backends. That's 2 queries in our example.
    Then, for each matching metric, it would send a query like this:
  • /render?format=pickle&target=...
    In our example, let say it matches 5 metrics.

So that's a total of 7 http queries per backend for a simple graph request. If there's some latency between the frontend and the backends, let's say 80ms (east-west coast), that amounts to 560ms of minimum latency/service time. We then need to add the actual request processing time on the backend and the graph generation time.

In this fork, we added a thread pool which is used to query the backends in parallel so that the latency induced by the above process would be somewhat limited. (we wait in parallel, woohoo!)
This helps a bit, but the latency is still very much noticeable.

Then, earlier this month, there was a patch committed upstream which brought 'bulk querying' between the frontend and the backends. With that patch, the frontend skips the 'metric discovery' step mentioned above and does one request to the backends per target.
In the above example, instead of doing 7 queries, it would only do 2 queries. So we'd have a latency of ~160ms

Patch description

The current pull request builds on that, but the approach is a bit more radical.

The idea is to fetch all metrics in a single query, reducing the latency to the minimum, 80ms in our example.

It sounds simple and obvious, but it is not exactly straightforward because of the way graphite does its parsing / querying. Basically, graphite parses the multiple targets recursively and does the appropriate action for each target it encounters (fetch the data, fetch data for a math function then apply the math function, etc...)

I didn't want to get too deep into this and modify the request parsing, so I implemented the following:

  • Have a function extract all the pathExpressions (metric name/patterns) from the targets
  • fetch all pathExpressions from all backends and store the result in a hash table keyed per backend and per metric name + timerange for easy lookups:
    cache[remoteBackendName][originalTargetWithWildcards-startTime-endTime] = [ series_from_remote_backend]
  • Store that hash table in the requestContext (which has the targets list, startTime, endTime, etc...)
  • Continue processing as before (via the recursive parser discussed above)
  • In the fetchData method (which, surprise!, fetches data from local and remote sources) do a hashtable lookup for the requested data.
    If the data is there, or if there is no data but a prefetch call was made for that pathExpr, skip the remote fetch and use the data from the cache.
    If there is a cache miss, do a regular remote fetch. This case is pretty rare, it happens when functions needs data out of the timerange of the original query. For example: timeshift, movingAverage, etc.

So, here it is. It is a bit hacky in the sense that I hammered the bulk fetching into the RemoteNode.fetch function and it doesn't quite feel right, but other than that it seems to work.

I haven't done stress testing yet so I do not know if it is leaking or anything, but I did get some good result.

Here's some numbers. 2 backends <1ms latency, 5 different metrics with no wildcards.

Without prefetch:

Time per request:       245.265 [ms] (mean)
Time per request:       245.265 [ms] (mean, across all concurrent requests)
Transfer rate:          256.69 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   217  245  83.4    239    1518
Waiting:      217  245  83.4    239    1518
Total:        217  245  83.4    239    1518

Percentage of the requests served within a certain time (ms)
  50%    239
  66%    241
  75%    244
  80%    245
  90%    249
  95%    254
  98%    266
  99%    295

With prefetch:

Time per request:       223.962 [ms] (mean)
Time per request:       223.962 [ms] (mean, across all concurrent requests)
Transfer rate:          283.23 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   200  224  30.2    219     506
Waiting:      200  224  30.1    219     506
Total:        200  224  30.2    219     507

Percentage of the requests served within a certain time (ms)
  50%    219
  66%    224
  75%    227
  80%    229
  90%    235
  95%    239
  98%    245
  99%    437
 100%    507 (longest request)

I did not really expect an improvement in that case since this was on a lan, but hey, I won't complain ;-)

Here's some numbers for a more real world case:

  • 10 backend servers
    • 6 on east coast (~10ms latency)
    • 4 on west coast (~80ms latency)

Request containing 6 metrics and a few math functions:

http://192.168.168.52:4888/render?width=1350&
height=250&
from=-24 hours&
target=alias(maxSeries(*.*.*.*.x.a.*.mean),%27a%27)&
target=alias(maxSeries(*.*.*.*.x.b.mean),%27b%27)&
target=alias(maxSeries(*.*.*.*.x.c.mean),%27c%27)&
target=alias(maxSeries(*.*.*.*.x.d.mean),%27d%27)&
target=alias(maxSeries(*.*.*.*.x.e.mean),%27e%27)&
target=alias(scale(*.*.*.y.f.mean,0.01),%27f%27)&
vtitle=Percent%20%20%20%20&
title=Blah%20%20%20%20&
yMin=0.0&
yMax=1.0&
hideLegend=false&
noCache=1

Result without patch:

Concurrency Level:      1
Time taken for tests:   600.826 seconds
Complete requests:      302
Failed requests:        294
   (Connect: 0, Receive: 0, Length: 294, Exceptions: 0)
Write errors:           0
Total transferred:      31300616 bytes
HTML transferred:       31212432 bytes
Requests per second:    0.50 [#/sec] (mean)
Time per request:       1989.490 [ms] (mean)
Time per request:       1989.490 [ms] (mean, across all concurrent requests)
Transfer rate:          50.87 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:  1920 1989 300.8   1952    7006
Waiting:     1920 1989 300.8   1952    7006
Total:       1920 1989 300.8   1952    7006
Percentage of the requests served within a certain time (ms)
  50%   1952
  66%   1960
  75%   1967
  80%   1977
  90%   2028
  95%   2136
  98%   2209
  99%   2290
 100%   7006 (longest request)

With patch:

Concurrency Level:      1
Time taken for tests:   600.778 seconds
Complete requests:      751
Failed requests:        677
   (Connect: 0, Receive: 0, Length: 677, Exceptions: 0)
Write errors:           0
Total transferred:      77768341 bytes
HTML transferred:       77549049 bytes
Requests per second:    1.25 [#/sec] (mean)
Time per request:       799.971 [ms] (mean)
Time per request:       799.971 [ms] (mean, across all concurrent requests)
Transfer rate:          126.41 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   740  800 171.4    790    5423
Waiting:      740  800 171.4    790    5423
Total:        740  800 171.4    790    5423
Percentage of the requests served within a certain time (ms)
  50%    790
  66%    798
  75%    803
  80%    805
  90%    813
  95%    829
  98%    863
  99%    889
 100%   5423 (longest request)

The improvement is pretty massive, more than 2x.

Can I get a review of this? I'd appreciate idea/feedback on the hacky parts so we could get this in a better shape and send it upstream.

Thx

@nicolaskruchten @marccardinal

  - HACK for RemoteNode.name
  - fixup targets in query_params
  - replace hashData calls with a simple '-'.join
  - move call to extractPathExpressions in views.py to avoid circular dep
    between evaluator and datalib
  - s/node.host/node.store.host/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant