Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: implement single-flight caching in the inmemory cache #4379

Closed

Conversation

GiedriusS
Copy link
Member

Implement a single-flight mode for the inmemory cache. Since our code
goes and does something when Fetch() misses, let's implement the
single-flight mode based on this assumption - wait for the results of
the previous request and use it.

https://pkg.go.dev/golang.org/x/sync/singleflight is not used to avoid
complexity - there is no single function that we can call to get results
of some operation. Potentially it could be wrapped around the Bucket
interface but then the caching/remote object storage interfaces would be
muddled together and I really think we should avoid that.

Tested on a cluster and with unit tests.

Signed-off-by: Giedrius Statkevičius [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Implement a single-flight mode for the inmemory cache. Since our code
goes and does something when Fetch() misses, let's implement the
single-flight mode based on this assumption - wait for the results of
the previous request and use it.

https://pkg.go.dev/golang.org/x/sync/singleflight is not used to avoid
complexity - there is no single function that we can call to get results
of some operation. Potentially it could be wrapped around the `Bucket`
interface but then the caching/remote object storage interfaces would be
muddled together and I really think we should avoid that.

Tested on a cluster and with unit tests.

Signed-off-by: Giedrius Statkevičius <[email protected]>
bwplotka
bwplotka previously approved these changes Jun 28, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. I think I like it generally. Have you tried run any benchmarks with it? (:

pkg/cache/inmemory.go Outdated Show resolved Hide resolved
pkg/cache/inmemory.go Outdated Show resolved Hide resolved
Signed-off-by: Giedrius Statkevičius <[email protected]>
@GiedriusS
Copy link
Member Author

GiedriusS commented Jun 30, 2021

I don't write benchmarks every day but this small benchmark that I wrote only shows a difference with high concurrency for the same keys:

BenchmarkInmemorySingleflight/inmemory_get_set_keys10000_conc1-16                    	     139	   7217798 ns/op	 4316280 B/op	   59475 allocs/op
BenchmarkInmemorySingleflight/inmemory_singleflight_get_set_keys10000_conc1-16       	     136	   7452180 ns/op	 4315249 B/op	   59473 allocs/op
BenchmarkInmemorySingleflight/inmemory_get_set_keys10000_conc5-16                    	      32	  34888344 ns/op	21581605 B/op	  297421 allocs/op
BenchmarkInmemorySingleflight/inmemory_singleflight_get_set_keys10000_conc5-16       	      32	  35949705 ns/op	21582620 B/op	  297430 allocs/op
BenchmarkInmemorySingleflight/inmemory_get_set_keys10000_conc10-16                   	      15	  72321593 ns/op	43164038 B/op	  594854 allocs/op
BenchmarkInmemorySingleflight/inmemory_singleflight_get_set_keys10000_conc10-16      	      14	  76983315 ns/op	43164684 B/op	  594859 allocs/op

This is expected since we now do some locking 🤔 But it potentially saves lots of requests and reduces costs so it's kind of like a no brainer 😄

@GiedriusS GiedriusS force-pushed the feature/inmemory_singleflight branch 2 times, most recently from 706f07b to 033ae20 Compare June 30, 2021 12:56
@GiedriusS GiedriusS force-pushed the feature/inmemory_singleflight branch from 033ae20 to af0c1e8 Compare June 30, 2021 12:57
Copy link
Contributor

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a good optimisation to make 👍

The main questions I have is:

  • Do we need this? Are we getting a big enough increase in performance to justify the non-trivial code overhead? I would hope that the benchmarks will tell us this concretely.

Aside from that it took me quite a while to figure out what is single-flighting and what does it do for us. In fact - I'm still not 100% sure 😅 It would be good to explicitly spell this out both in the PR and the code.

CHANGELOG.md Outdated Show resolved Hide resolved
added prometheus.Counter
current prometheus.Gauge
sfSaved prometheus.Counter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unintentional newline?


c.mu.Lock()

if c.subs[key] == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just checking for key's membership of map subs then let's be consistent with L300

if val, ok := c.subs[key]; !ok {
    continue
}

}

c.mu.Lock()
if c.subs[key] == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above RE membership tests

pkg/cache/inmemory.go Show resolved Hide resolved
g := &errgroup.Group{}

g.Go(func() error {
hits := c.Fetch(context.Background(), testKeys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hits := c.Fetch(context.Background(), testKeys)
// This blocks :(
hits := c.Fetch(context.Background(), testKeys)

To be consistent with the other test cases :)

return nil
})

time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these sleeps? Magic sleeps always make me wonder what wasn't working before these were added 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To wait until the previous goroutine spawned and blocked on the Fetch() call. In practice, that will take only a few nano/milliseconds. It would be hard to implement some kind of synchronization primitive here because we are actually calling the caching code here hence a simple time.Sleep() has been added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(1 * time.Second)
// We need to wait until the previous goroutine has spawned and blocked on the Fetch() call. In practice, this will take only a few nano/milliseconds, so we wait for that to happen.
time.Sleep(1 * time.Second)

Ok - let's make that explicit then 👍


ctx, cancel := context.WithCancel(context.Background())

const testKey = "test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to be a const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It's constant and doesn't change. Ideally this would be a const slice but no such things exist in Go (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha fair point - on the whole I only really see consts defined at the package-scope instead of in the function-scope. Not a blocker ¯\_(ツ)_/¯

})
cancel()

time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - are these magic sleeps needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, because we want to wait until c.Fetch() has been called. In practice, we should only wait for a millisecond or so but a second is here to be on the safe side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(1 * time.Second)
// We need to wait until c.Fetch() has been called. In practice, we should only wait for a millisecond or so but a second is here to be on the safe side.
time.Sleep(1 * time.Second)

continue
}

c.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to be sharing this mutex across the read and write path? The naive argument would say - might this actually increase the time it takes to respond as we're waiting around a bunch.

Copy link
Member Author

@GiedriusS GiedriusS Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading and writing from a map from more than one goroutine panics the whole program hence this lock is needed. Yes, the benchmarks show a negligible increase in the time it takes to respond however those extra nanoseconds are much smaller than requesting data from the remote object storage

@GiedriusS
Copy link
Member Author

GiedriusS commented Jul 7, 2021

This feels like a good optimisation to make

The main questions I have is:

  • Do we need this? Are we getting a big enough increase in performance to justify the non-trivial code overhead? I would hope that the benchmarks will tell us this concretely.

Aside from that it took me quite a while to figure out what is single-flighting and what does it do for us. In fact - I'm still not 100% sure It would be good to explicitly spell this out both in the PR and the code.

From my tests, the in-memory cache blows away all other options with this improvement because Thanos Store will try harder to avoid expensive read operations which is what's causing the sluggishness. Of course, it doesn't work well with multiple nodes. Ideally, we'd have #2962 but the library there needs extra improvements. Also, you can see this in action by opening some dashboards with this feature. The counter jumps to a hundred and thousands very quickly.

In fact, single-flight functionality for caches (not just the simple, in-memory one) is so good that I would argue that this needs to be the default. But this PR adds a toggle switch for it and users who do not want this will have no impact on the performance. So, I would really like to see this get merged.

EDIT: there's also #4407 which is tangentially related so it's another statistical point in favor of this albeit it is anecdotal.

@GiedriusS GiedriusS requested a review from bwplotka July 7, 2021 12:56
Signed-off-by: Giedrius Statkevičius <[email protected]>
@GiedriusS
Copy link
Member Author

@thanos-io/thanos-maintainers could you PTAL? I'd love to get this into the next release :(

Copy link
Contributor

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a functional point of view - I'm confident that this change does what it says it does. I do still think we are missing some benchmarks to make it obvious beyond any reasonable doubt that the increase in complexity is worth the performance boost.

From a maintainability perspective, I think there are still some improvements to be made.

This is a non-trivial bit of code that has taken me ~30 mins on both reviews to understand and conceptualise how it fits into the wider read & write paths. If the code is easy to understand, people will maintain it and vice versa - what if Giedrus was hit by a bus?

Personally, I would like to see extra context in the comments and ideally a sequence diagram that explains why we are making this change.

I'm not going to block this PR as I'm still pretty inexperienced in the code base, but that is my 2c 🤗

return nil
})

time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(1 * time.Second)
// We need to wait until the previous goroutine has spawned and blocked on the Fetch() call. In practice, this will take only a few nano/milliseconds, so we wait for that to happen.
time.Sleep(1 * time.Second)

Ok - let's make that explicit then 👍


ctx, cancel := context.WithCancel(context.Background())

const testKey = "test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha fair point - on the whole I only really see consts defined at the package-scope instead of in the function-scope. Not a blocker ¯\_(ツ)_/¯

})
cancel()

time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(1 * time.Second)
// We need to wait until c.Fetch() has been called. In practice, we should only wait for a millisecond or so but a second is here to be on the safe side.
time.Sleep(1 * time.Second)

pkg/cache/inmemory.go Show resolved Hide resolved
pkg/cache/inmemory.go Show resolved Hide resolved
@@ -289,6 +314,24 @@ func (c *InMemoryCache) reset() {
func (c *InMemoryCache) Store(ctx context.Context, data map[string][]byte, ttl time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this implemented on the InMemoryCache and not a remote cache? I would have thought that the benefits of this code would be much greater with a remote cache 🤔 Perhaps this is a 'test' implementation to get right before implementing the remote cache?

Copy link
Member Author

@GiedriusS GiedriusS Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer probably is that popular remote caches do not have such functionality - as far as I can tell from my research, it is impossible to block until a key appears on Redis and/or Memcached. For example, here is one implementation: https://github.com/sqlalchemy/dogpile.cache/blob/dbf57e4a7b8f4061b735be8b24bbb880fb75802f/dogpile/cache/backends/memcached.py#L42

As you can see, the "locking" works by creating another key and then sleeping on it. No bueno. Something like this https://github.com/mailgun/groupcache needs to be used but before that mailgun/groupcache#9 should be implemented. For now, let's implement the same functionality in memory - these two features are completely orthogonal :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-topic: My attempt to add groupcache failed because of the drastic changes that needs to be introduced to cache interfaces. The Get issue actually would help to solve this. It would be amazing to migrate groupcache and simplify what we have (we can drop memcached support when we move).

@GiedriusS
Copy link
Member Author

From a functional point of view - I'm confident that this change does what it says it does. I do still think we are missing some benchmarks to make it obvious beyond any reasonable doubt that the increase in complexity is worth the performance boost.

From a maintainability perspective, I think there are still some improvements to be made.

This is a non-trivial bit of code that has taken me ~30 mins on both reviews to understand and conceptualise how it fits into the wider read & write paths. If the code is easy to understand, people will maintain it and vice versa - what if Giedrus was hit by a bus?

Personally, I would like to see extra context in the comments and ideally a sequence diagram that explains why we are making this change.

I'm not going to block this PR as I'm still pretty inexperienced in the code base, but that is my 2c

Thanks for your review! I will try adding those things. Coming back to it, I still cannot see a way of implementing this in some other way besides working with context.Context unless we'd rework a whole lot of interfaces. It's nice ATM that they are generic, it means that we can plug them in wherever we want to. Also, all of this new logic is hidden from the caller's perspective so I would argue that this doesn't introduce a lot of complexity.

@GiedriusS
Copy link
Member Author

GiedriusS commented Aug 3, 2021

I have added a benchmark of Series() with 4 different setups:

  • Singleflight on, unlimited throughput
  • Singleflight off, unlimited throughput
  • Singleflight on, 2MiB/s throughput
  • Singleflight off, 2MiB/s throughput

The results are pretty good 😱 In some cases, the difference is 10-100x on small throughput benchmarks:

(/bin/go test -benchmem -run='^$' -benchtime 20x -timeout 9999s -bench '^(BenchmarkBucketSeries)$' github.com/thanos-io/thanos/pkg/store &>bench_data)

Benchmark results
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/store
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=0/1of1000000-16         	      20	  91872278 ns/op	62053467 B/op	    9697 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=0/10of1000000-16        	      20	  90943501 ns/op	62068092 B/op	    9827 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=0/1000000of1000000-16   	      20	 856642356 ns/op	1249098976 B/op	10045235 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=2.097152e+06/1of1000000-16         	      20	4459123241 ns/op	62086337 B/op	    9700 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=2.097152e+06/10of1000000-16        	      20	4455079743 ns/op	62085030 B/op	    9828 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=true,Throughput=2.097152e+06/1000000of1000000-16   	      20	5193180212 ns/op	1247463237 B/op	10045257 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=0/1of1000000-16                   	      20	  84673427 ns/op	62106584 B/op	    9768 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=0/10of1000000-16                  	      20	  87083260 ns/op	62115551 B/op	    9872 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=0/1000000of1000000-16             	      20	 841538409 ns/op	1274593827 B/op	10056803 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=2.097152e+06/1of1000000-16        	      20	4456800747 ns/op	62144327 B/op	    9786 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=2.097152e+06/10of1000000-16       	      20	4468411467 ns/op	62141646 B/op	    9896 allocs/op
BenchmarkBucketSeries/1000000SeriesWith1Samples,Singleflight=false,Throughput=2.097152e+06/1000000of1000000-16  	      20	7576618808 ns/op	1275337376 B/op	10060593 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=0/1of10000000-16                  	      20	   6576437 ns/op	 4872760 B/op	    1133 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=0/100of10000000-16                	      20	   6357284 ns/op	 4863187 B/op	    1164 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=0/10000000of10000000-16           	      20	  82506036 ns/op	128755490 B/op	 1010018 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=2.097152e+06/1of10000000-16       	      20	 483283161 ns/op	 4899822 B/op	    1146 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=2.097152e+06/100of10000000-16     	      20	 482696981 ns/op	 4911803 B/op	    1181 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=true,Throughput=2.097152e+06/10000000of10000000-16         	      20	 560929529 ns/op	123472229 B/op	 1009873 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=0/1of10000000-16                          	      20	   7243218 ns/op	 4898300 B/op	    1167 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=0/100of10000000-16                        	      20	   6538936 ns/op	 4904600 B/op	    1201 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=0/10000000of10000000-16                   	      20	  87463841 ns/op	150081864 B/op	 1021428 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=2.097152e+06/1of10000000-16               	      20	 499757455 ns/op	 4949073 B/op	    1189 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=2.097152e+06/100of10000000-16             	      20	 500387347 ns/op	 4949694 B/op	    1230 allocs/op
BenchmarkBucketSeries/100000SeriesWith100Samples,Singleflight=false,Throughput=2.097152e+06/10000000of10000000-16        	      20	2971774125 ns/op	152286798 B/op	 1025274 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=0/1of10000000-16                           	      20	    151952 ns/op	  225324 B/op	     229 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=0/100of10000000-16                         	      20	    141794 ns/op	  219999 B/op	     227 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=0/10000000of10000000-16                    	      20	  18142334 ns/op	41465792 B/op	  186320 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=2.097152e+06/1of10000000-16                	      20	  72945077 ns/op	  238615 B/op	     243 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=2.097152e+06/100of10000000-16              	      20	  73019216 ns/op	  242035 B/op	     243 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=true,Throughput=2.097152e+06/10000000of10000000-16         	      20	 114852266 ns/op	41475730 B/op	  186373 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=0/1of10000000-16                          	      20	    164981 ns/op	  258854 B/op	     269 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=0/100of10000000-16                        	      20	    161352 ns/op	  262310 B/op	     269 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=0/10000000of10000000-16                   	      20	  29896776 ns/op	151055664 B/op	  229318 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=2.097152e+06/1of10000000-16               	      20	  88654022 ns/op	  272788 B/op	     292 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=2.097152e+06/100of10000000-16             	      20	  88854959 ns/op	  284258 B/op	     291 allocs/op
BenchmarkBucketSeries/1SeriesWith10000000Samples,Singleflight=false,Throughput=2.097152e+06/10000000of10000000-16        	      20	8996025188 ns/op	156873714 B/op	  243499 allocs/op
PASS
ok  	github.com/thanos-io/thanos/pkg/store	1202.814s

Essentially, the get operation now has a small overhead and the overhead becomes small in comparison when lots of requests for the same data come in which, in my experience, is a common thing because Grafana dashboards commonly re-use the same metrics but with different PromQL operators, for example.

Being network-bound is the main limiting factor ATM in Thanos performance IMHO so this seems to me like a very important thing to have. Please see some of the rationales here: #4379 (comment). Given this, I'd like to merge this functionality for the in-memory cache and then we could move to the more advanced case where there are multiple Store instances.

@GiedriusS
Copy link
Member Author

BTW, don't merge this yet:

  • I will remove time.Sleep from tests;
  • I have also noticed an edge-case that isn't covered - what if Store() never comes? Then a key will be left dangling forever. Let me also improve the code to cover this case :)

@kakkoyun kakkoyun marked this pull request as draft August 12, 2021 09:45
@kakkoyun
Copy link
Member

@GiedriusS I have converted to draft, just in case to prevent accidental merges.

@stale
Copy link

stale bot commented Oct 16, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Oct 16, 2021
@stale stale bot closed this Oct 24, 2021
@matej-g
Copy link
Collaborator

matej-g commented Feb 25, 2022

What's the status of this @GiedriusS? Looks like this was nearly finished but then marked stale? Would be a shame to leave this unmerged it seems!

@GiedriusS
Copy link
Member Author

We have groupcache now which does "single-flight" so I don't think this is needed anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants