-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent vtctld from creating tons of S3 connections #15296
Conversation
Signed-off-by: Renan Rangel <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Renan Rangel <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15296 +/- ##
==========================================
- Coverage 67.41% 65.44% -1.98%
==========================================
Files 1560 1562 +2
Lines 192752 193914 +1162
==========================================
- Hits 129952 126906 -3046
- Misses 62800 67008 +4208 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Renan Rangel <[email protected]>
go/vt/mysqlctl/s3backupstorage/s3.go
Outdated
defaultS3Transport.TLSClientConfig = tlsClientConf | ||
} | ||
|
||
return defaultS3Transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this concurrency safe / does it need to be? Since it doesn't seem like it is.
I'm also not a big fan of having a global for this, those tend to bite us later on. Can we move it at least into S3BackupStorage
instead then? That's still kinda global, but at least a tad better then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can then also initialize it already in the init()
to avoid the whole concurrency question / potential problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it should be, specially considering the default on the SDK already uses http.DefaultClient
, which in turn uses http.DefaultTransport
, this would have been a shared default in most projects that don't define their transport/client from scratch, see the Config
docs.
I thought the S3BackupStorage
would be created for each request being processed, but if that is not the case we could move it there. I will test it out and report back.
if the above is not possible and setting it up on init()
is preferred I can do that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you added a unit test for this. What's the conclusion?
@dbussink time for a re-review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepthi I’m back next week but I don’t see a refactor of this or a reason why that couldn’t be done. I strongly prefer not having a global like this if that can reasonably be avoided.
Also happy to push up that refactor once I’m back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved this to inside the S3BackupStorage
and re-run our tests and seems the connections are getting re-used, so I have update the PR to reflect that, thanks for pointing it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one.
Can you please take the description and split out the problem part of it into a new issue? How it is being fixed can be kept in the PR description.
We may want to backport this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rvrangel ! I only had a few small nits/suggestions. Please let me know what you think. I can come back quickly and review again and help to get it merged.
go/vt/mysqlctl/s3backupstorage/s3.go
Outdated
if defaultS3Transport == nil { | ||
tlsClientConf := &tls.Config{InsecureSkipVerify: tlsSkipVerifyCert} | ||
|
||
defaultS3Transport = http.DefaultTransport.(*http.Transport).Clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems correct after reading through: golang/go#26013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should get rid of this getter and instead initialize the variable in init()
though. This is a more clear signal of our intent and a more standard pattern. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that makes sense. I will move it to the init()
function. I will updated the PR tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned in the other comment, I have moved this to the S3BackupStorage
struct, so we don't need to initialise a global anymore 🙌
@@ -276,3 +276,16 @@ func TestSSECustomerFileBase64Key(t *testing.T) { | |||
assert.Nil(t, sseData.customerKey, "customerKey expected to be nil") | |||
assert.Nil(t, sseData.customerMd5, "customerMd5 expected to be nil") | |||
} | |||
|
|||
func Test_getS3Transport(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but TestGetS3Transport
would be the typical name in the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to TestGetTransport
to avoid the redundancy
go/vt/mysqlctl/s3backupstorage/s3.go
Outdated
@@ -83,6 +83,9 @@ var ( | |||
|
|||
// path component delimiter | |||
delimiter = "/" | |||
|
|||
// use a shared transport for all connections | |||
defaultS3Transport *http.Transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but s3 is a bit redundant here. We can just call it defaultTransport
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got rid of the global variable 😌
Signed-off-by: Renan Rangel <[email protected]> Signed-off-by: 'Renan Rangel' <[email protected]>
go/vt/mysqlctl/s3backupstorage/s3.go
Outdated
// This creates a new transport based off http.DefaultTransport the first time and returns the same | ||
// transport on subsequent calls so connections can be reused as part of the same transport. | ||
func (bs *S3BackupStorage) getTransport() *http.Transport { | ||
if bs.transport == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as before though, does this get run concurrently?
I think we can create the transport whenever we create the struct instead to avoid that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possible, considering that we will have multiple checks on different backups can happen at the same time. perhaps adding a mutex here would be better so it is always safe, regardless if it is set when we create the struct or when we first call this function, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this is only called on S3BackupStorage.client()
which acquires a mutex when it is running, so we shouldn't see this running concurrently
vitess/go/vt/mysqlctl/s3backupstorage/s3.go
Lines 457 to 463 in eff2a92
func (bs *S3BackupStorage) client() (*s3.S3, error) { | |
bs.mu.Lock() | |
defer bs.mu.Unlock() | |
if bs._client == nil { | |
logLevel := getLogLevel() | |
httpClient := &http.Client{Transport: bs.getTransport()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvrangel i think we can then also move the initialization there too? No need to wrap it in another lazy loading function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that’s wrong. I mean where bs
is initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we can do that if you prefer - originally I didn't to make it easier to have a unit test around this (and the existing client()
function doesn't have a test, looks like it makes a call to S3 to validate it) and since we are not using a contructor function like New()
to create this struct - everything gets done in this function - it is harder to mock the S3 client that makes an actual HeadBucket()
call.
so my question is, would you be okay with moving this into client()
without a unit test, or do you want to make this into a bigger PR where we can mock that call and change how S3BackupStorage
gets created (or all structs that implement BackupStorage
in this case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is also the consideration that we might have a single S3BackupStorage
but we might call S3BackupStorage.client()
for each request - which then will use a different transport for each client we initialise, while the current approach uses the same transport for all clients generated by the same S3BackupStorage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a small refactor with a (private) constructor and then testing transport is fine?
I don’t see how we create different transports though? We can then set directly bs.transport
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking a better look at the code, we do that initialisation under init()
so it should be easy enough to update it with a new()
method instead of the current line, so I will update it tomorrow:
vitess/go/vt/mysqlctl/s3backupstorage/s3.go
Lines 512 to 513 in eff2a92
func init() { | |
backupstorage.BackupStorageMap["s3"] = &S3BackupStorage{params: backupstorage.NoParams()} |
but if we set bs.transport
directly, the problem is the functions in this struct were designed in a way to call bs.client()
every time, like here:
vitess/go/vt/mysqlctl/s3backupstorage/s3.go
Lines 292 to 294 in eff2a92
func (bs *S3BackupStorage) ListBackups(ctx context.Context, dir string) ([]backupstorage.BackupHandle, error) { | |
log.Infof("ListBackups: [s3] dir: %v, bucket: %v", dir, bucket) | |
c, err := bs.client() |
as an example take S3BackupStorage.ListBackups()
, every time we call it, it will in turn call S3BackupStorage.client()
which will set bs.transport =
again, making it a different transport each time. or unless you meant to lazy load it inside the client()
function as we are doing in the current chage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbussink I have moved this to a private constructor as we talked, and instead of setting bs.transport
inside client()
it is now only set on the constructor - and I have removed the lazy loading, but that will cause issues if there is any place where we use S3StorageBackup{}
without initialising the transport, but it doesn't seem like we do?
let me know if this looks good to you or if you want to add extra safety on the client()
function
Signed-off-by: Renan Rangel <[email protected]> Signed-off-by: 'Renan Rangel' <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is now much simpler imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @rvrangel ! Thank you!
Signed-off-by: Renan Rangel <[email protected]> Signed-off-by: 'Renan Rangel' <[email protected]>
This is a really great contribution @rvrangel! Thank you. |
…15296) (#15401) Signed-off-by: Renan Rangel <[email protected]> Signed-off-by: 'Renan Rangel' <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Description
This fixes the issue described in #15335. It seems the bug was introduced in #4200.
The default transport was changed in favour of one were we could skip validating S3 certificates, but instead of cloning the default transport and adjusting the settings, we are creating a new transport from scratch but not setting any defaults (see current Go defaults for
DefaultTransport
). By doing so, we don't set keepalives, timeouts, max idle connections, etc and other values inTransport
and since some default to zero it basically means "no limit". To make matters worse, since we are creating a new transport for every client (in this case, for each request from what I understand), there is no sharing of connections between different requests, so the first call toGetBackups()
will open some S3 connections and keep them on its private transport instead of sharing it across different requests.After this fix, the number of established connections seems to be consistently below 5, and from initial testing it seems S3 connection errors seem to be gone.
Related Issue(s)
Checklist
Deployment Notes