-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GetTabletsByCell in healthcheck #14693
Conversation
Signed-off-by: deepthi <[email protected]>
Signed-off-by: deepthi <[email protected]>
…Cell Signed-off-by: deepthi <[email protected]>
Signed-off-by: deepthi <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this! Makes for a great improvement.
… from stdlib Signed-off-by: deepthi <[email protected]>
Signed-off-by: deepthi <[email protected]>
I'm working on a test case as suggested by @mattlord, so we'll merge after that is done. |
Signed-off-by: deepthi <[email protected]>
Signed-off-by: deepthi <[email protected]>
…blet even if one or more fail Signed-off-by: deepthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some minor comments. The most consequential ones -- although still relatively minor -- were related to using a semaphore and deferring mutex unlocking. Let me know what you think. I can come back to this quickly and re-approve at any time.
} | ||
if _, err := ts.UpdateTabletFields(context.Background(), tablet.Alias, func(t *topodatapb.Tablet) error { | ||
}) | ||
require.Nil(t, err, "UpdateTabletFields failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use require.Nil instead of require.NoError here and in a few other places? It's fine, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the func has two return values, not just a single return value of type error. I'd have preferred the NoError
form, but it doesn't work for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that you can use this instead (in the other cases you're just using the error
return value directly from the function):
require.NoError(t, err, "UpdateTabletFields failed")
AFAIK they are equivalent, but it's a little more explicit. Not a problem at all either way.
tw.loadTablets() | ||
checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "GetTablet": 0, "RemoveTablet": 1}) | ||
checkOpCounts(t, counts, map[string]int64{"ListTablets": 1, "RemoveTablet": 1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this new call is equivalent to the old one. I tend to prefer the previous as it's more explicit about what we expect (w/o looking at how checkOpCounts works), but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't call GetTablet
at all any more, so this change reflects that fact. I thought about deprecating and eventually deleting the GetTablet metric, but it didn't seem worth it.
Maybe it is worth changing the tests to explicitly expect 0 calls to GetTablet
as a way of catching regressions in this behavior.
…tch previous behavior, address review Signed-off-by: deepthi <[email protected]>
@mattlord I think I've addressed all the comments. This is ready for final review. If it looks good to you, please remove the "Do Not Merge" label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor notes -- sorry to keep going back and forth on this. IMO it's worth being extra cautious in this case though.
tabletInfo, err := ts.GetTablet(ctx, tabletAlias) | ||
mutex.Lock() | ||
<-sem | ||
mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can get rid of this mutex usage here and replace the related returnError logic with sync.Once
?
The other option is to use a concurrency.FirstErrorRecorder
. Those both feel more standard to me.
I'm being paranoid (probably overly so) as it's easy to end up with deadlocks and panics. I guess we have unit test coverage though and thus -race
tests for the function?
tabletInfo, err := ts.GetTablet(ctx, tabletAlias) | ||
mutex.Lock() | ||
<-sem | ||
mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever we do end up locking it, we should unlock it in a defer here as we release it at the end of the closure/goroutine anyway.
|
||
for _, tabletAlias := range tabletAliases { | ||
wg.Add(1) | ||
go func(tabletAlias *topodatapb.TabletAlias) { | ||
defer wg.Done() | ||
if err := sem.Acquire(ctx, 1); err != nil { | ||
// Only happens if context is cancelled. | ||
mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should do this:
mu.Lock()
defer mu.Unlock()
...
return
We definitely want to return in this block as we didn't acquire the semaphore so we don't want to continue on in the goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be getting hard to follow at this point, so this is what I was thinking: https://gist.github.com/mattlord/089986f60bffe33d1d7b3faa21942e06
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I was looking at the code again and realized I forgot a return 😮💨
Glad you caught it too. I think we actually need a lot more test coverage than we have, but I'm calling that debt at this point. We are slightly better off than before given that at least I added 4 unit tests.
Error conditions are usually the hardest to test, and that is where most the missing coverage tends to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use different functions in sync.Once.Do, it will only execute the Do method once. But it doesn't matter. It's not much different than what is there now.
go/vt/topo/tablet.go
Outdated
// Server.FindAllShardsInKeyspace. | ||
type GetTabletsByCellOptions struct { | ||
// Concurrency controls the maximum number of concurrent calls to GetTablet. | ||
// For backwards compatibility, concurrency of 0 is considered unlimited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is right anymore, is it? I say that because we currently only use the value if it's > 0 (at least in GetTabletMap
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. this has gone through 3 different implementations at this point, and somewhere in there we lost the infinite concurrency. I'm inclined to think 32 is fine as a default and we don't need to worry about infinite concurrency at all 🤷
Signed-off-by: deepthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (again)! 🙂 I'm glad to see that we finally got this done. Thank you for that.
Signed-off-by: deepthi <[email protected]>
This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
* Backport Use GetTabletsByCell in healthcheck This backports upstram PR vitessio#14693, with a few minor changes to make it work with the Go version we are using and a small change to topology_watcher.go so that test cases reflect and test for the same behavior as the upstream code. The description of the original PR follows: VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once. This PR does a few more things: * GetTabletsForCell now handles the case where the response size violates gRPC limits by falling back to one tablet at a time in case of error. * Previously, the one tablet at a time method had unlimited concurrency. In this PR we introduce a configuration option for concurrency. * We pass topoReadConcurrency from healthcheck into GetTabletsForCell. * The behavior of --refresh_known_tablets flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known. The basic fix has already been tried in production and shown to reduce the number of Get calls from vtgate -> topo from O(n) to O(1). We can consider deprecating and deleting --refresh_known_tablets in a future release. The concerns that originally motivated adding that flag in vitessio#3965 are alleviated by fetching all tablets in one call to the topo.
Description
VTGate's healthcheck module currently calls GetTablet for each tablet alias that it discovers in a cell. Instead we can use GetTabletsForCell to fetch all tablets for a cell at once.
This PR does a few more things:
topoReadConcurrency
from healthcheck into GetTabletsForCell.--refresh_known_tablets
flag is different now. Previously we would not read those tablets at all, now we do read them, but ignore any changes if they are already known.The basic fix has already been tried in production and shown to reduce the number of
Get
calls from vtgate -> topo from O(n) to O(1).We can consider deprecating and deleting
--refresh_known_tablets
in a future release. The concerns that originally motivated adding that flag in #3965 are alleviated by fetching all tablets in one call to the topo.Related Issue(s)
Fixes #14277
Checklist
Deployment Notes