Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent vtctld from creating tons of S3 connections #15296

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

rvrangel
Copy link
Contributor

@rvrangel rvrangel commented Feb 20, 2024

Description

This fixes the issue described in #15335. It seems the bug was introduced in #4200.

The default transport was changed in favour of one were we could skip validating S3 certificates, but instead of cloning the default transport and adjusting the settings, we are creating a new transport from scratch but not setting any defaults (see current Go defaults for DefaultTransport). By doing so, we don't set keepalives, timeouts, max idle connections, etc and other values in Transport and since some default to zero it basically means "no limit". To make matters worse, since we are creating a new transport for every client (in this case, for each request from what I understand), there is no sharing of connections between different requests, so the first call to GetBackups() will open some S3 connections and keep them on its private transport instead of sharing it across different requests.

After this fix, the number of established connections seems to be consistently below 5, and from initial testing it seems S3 connection errors seem to be gone.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Feb 20, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Feb 20, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone Feb 20, 2024
Signed-off-by: Renan Rangel <[email protected]>
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 65.44%. Comparing base (696fe0e) to head (13a3911).
Report is 68 commits behind head on main.

Files Patch % Lines
go/vt/mysqlctl/s3backupstorage/s3.go 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15296      +/-   ##
==========================================
- Coverage   67.41%   65.44%   -1.98%     
==========================================
  Files        1560     1562       +2     
  Lines      192752   193914    +1162     
==========================================
- Hits       129952   126906    -3046     
- Misses      62800    67008    +4208     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rvrangel rvrangel added the Component: General Changes throughout the code base label Feb 21, 2024
Signed-off-by: Renan Rangel <[email protected]>
@rvrangel rvrangel marked this pull request as ready for review February 21, 2024 13:08
defaultS3Transport.TLSClientConfig = tlsClientConf
}

return defaultS3Transport
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this concurrency safe / does it need to be? Since it doesn't seem like it is.

I'm also not a big fan of having a global for this, those tend to bite us later on. Can we move it at least into S3BackupStorage instead then? That's still kinda global, but at least a tad better then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can then also initialize it already in the init() to avoid the whole concurrency question / potential problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should be, specially considering the default on the SDK already uses http.DefaultClient, which in turn uses http.DefaultTransport, this would have been a shared default in most projects that don't define their transport/client from scratch, see the Config docs.

I thought the S3BackupStorage would be created for each request being processed, but if that is not the case we could move it there. I will test it out and report back.

if the above is not possible and setting it up on init() is preferred I can do that too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you added a unit test for this. What's the conclusion?
@dbussink time for a re-review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi I’m back next week but I don’t see a refactor of this or a reason why that couldn’t be done. I strongly prefer not having a global like this if that can reasonably be avoided.

Also happy to push up that refactor once I’m back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this to inside the S3BackupStorage and re-run our tests and seems the connections are getting re-used, so I have update the PR to reflect that, thanks for pointing it out!

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one.
Can you please take the description and split out the problem part of it into a new issue? How it is being fixed can be kept in the PR description.
We may want to backport this too.

@deepthi deepthi added Backport to: release-19.0 Needs to be back ported to release-19.0 and removed NeedsWebsiteDocsUpdate What it says NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 22, 2024
@rvrangel
Copy link
Contributor Author

thanks @deepthi , I created #15335 and mentioned it here in the PR description

@systay systay mentioned this pull request Feb 26, 2024
38 tasks
@frouioui frouioui removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request labels Feb 26, 2024
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rvrangel ! I only had a few small nits/suggestions. Please let me know what you think. I can come back quickly and review again and help to get it merged.

if defaultS3Transport == nil {
tlsClientConf := &tls.Config{InsecureSkipVerify: tlsSkipVerifyCert}

defaultS3Transport = http.DefaultTransport.(*http.Transport).Clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct after reading through: golang/go#26013

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should get rid of this getter and instead initialize the variable in init() though. This is a more clear signal of our intent and a more standard pattern. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that makes sense. I will move it to the init() function. I will updated the PR tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in the other comment, I have moved this to the S3BackupStorage struct, so we don't need to initialise a global anymore 🙌

@@ -276,3 +276,16 @@ func TestSSECustomerFileBase64Key(t *testing.T) {
assert.Nil(t, sseData.customerKey, "customerKey expected to be nil")
assert.Nil(t, sseData.customerMd5, "customerMd5 expected to be nil")
}

func Test_getS3Transport(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but TestGetS3Transport would be the typical name in the code base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to TestGetTransport to avoid the redundancy

@@ -83,6 +83,9 @@ var (

// path component delimiter
delimiter = "/"

// use a shared transport for all connections
defaultS3Transport *http.Transport
Copy link
Contributor

@mattlord mattlord Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but s3 is a bit redundant here. We can just call it defaultTransport.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got rid of the global variable 😌

Signed-off-by: Renan Rangel <[email protected]>
Signed-off-by: 'Renan Rangel' <[email protected]>
// This creates a new transport based off http.DefaultTransport the first time and returns the same
// transport on subsequent calls so connections can be reused as part of the same transport.
func (bs *S3BackupStorage) getTransport() *http.Transport {
if bs.transport == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before though, does this get run concurrently?

I think we can create the transport whenever we create the struct instead to avoid that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible, considering that we will have multiple checks on different backups can happen at the same time. perhaps adding a mutex here would be better so it is always safe, regardless if it is set when we create the struct or when we first call this function, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is only called on S3BackupStorage.client() which acquires a mutex when it is running, so we shouldn't see this running concurrently

func (bs *S3BackupStorage) client() (*s3.S3, error) {
bs.mu.Lock()
defer bs.mu.Unlock()
if bs._client == nil {
logLevel := getLogLevel()
httpClient := &http.Client{Transport: bs.getTransport()}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvrangel i think we can then also move the initialization there too? No need to wrap it in another lazy loading function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that’s wrong. I mean where bs is initialized.

Copy link
Contributor Author

@rvrangel rvrangel Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can do that if you prefer - originally I didn't to make it easier to have a unit test around this (and the existing client() function doesn't have a test, looks like it makes a call to S3 to validate it) and since we are not using a contructor function like New() to create this struct - everything gets done in this function - it is harder to mock the S3 client that makes an actual HeadBucket() call.

so my question is, would you be okay with moving this into client() without a unit test, or do you want to make this into a bigger PR where we can mock that call and change how S3BackupStorage gets created (or all structs that implement BackupStorage in this case)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also the consideration that we might have a single S3BackupStorage but we might call S3BackupStorage.client() for each request - which then will use a different transport for each client we initialise, while the current approach uses the same transport for all clients generated by the same S3BackupStorage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a small refactor with a (private) constructor and then testing transport is fine?

I don’t see how we create different transports though? We can then set directly bs.transport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking a better look at the code, we do that initialisation under init() so it should be easy enough to update it with a new() method instead of the current line, so I will update it tomorrow:

func init() {
backupstorage.BackupStorageMap["s3"] = &S3BackupStorage{params: backupstorage.NoParams()}

but if we set bs.transport directly, the problem is the functions in this struct were designed in a way to call bs.client() every time, like here:

func (bs *S3BackupStorage) ListBackups(ctx context.Context, dir string) ([]backupstorage.BackupHandle, error) {
log.Infof("ListBackups: [s3] dir: %v, bucket: %v", dir, bucket)
c, err := bs.client()

as an example take S3BackupStorage.ListBackups(), every time we call it, it will in turn call S3BackupStorage.client() which will set bs.transport = again, making it a different transport each time. or unless you meant to lazy load it inside the client() function as we are doing in the current chage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbussink I have moved this to a private constructor as we talked, and instead of setting bs.transport inside client() it is now only set on the constructor - and I have removed the lazy loading, but that will cause issues if there is any place where we use S3StorageBackup{} without initialising the transport, but it doesn't seem like we do?

let me know if this looks good to you or if you want to add extra safety on the client() function

Signed-off-by: Renan Rangel <[email protected]>
Signed-off-by: 'Renan Rangel' <[email protected]>
Copy link
Contributor

@dbussink dbussink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is now much simpler imho.

@mattlord mattlord self-requested a review March 4, 2024 19:25
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @rvrangel ! Thank you!

@mattlord mattlord merged commit 259d963 into vitessio:main Mar 4, 2024
102 checks passed
vitess-bot pushed a commit that referenced this pull request Mar 4, 2024
Signed-off-by: Renan Rangel <[email protected]>
Signed-off-by: 'Renan Rangel' <[email protected]>
@deepthi
Copy link
Member

deepthi commented Mar 4, 2024

This is a really great contribution @rvrangel! Thank you.

frouioui pushed a commit that referenced this pull request Mar 4, 2024
…15296) (#15401)

Signed-off-by: Renan Rangel <[email protected]>
Signed-off-by: 'Renan Rangel' <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport to: release-19.0 Needs to be back ported to release-19.0 Component: General Changes throughout the code base Type: Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: vtctlds can create an excessive amount of connections when receiving concurrent GetBackups() calls
5 participants