-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3: add RecreateWhenSlow for S3 GetObject API #59342
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: lance6716 <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #59342 +/- ##
================================================
+ Coverage 73.0960% 74.8231% +1.7271%
================================================
Files 1690 1736 +46
Lines 467135 477451 +10316
================================================
+ Hits 341457 357244 +15787
+ Misses 104710 97714 -6996
- Partials 20968 22493 +1525
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
case <-done: | ||
} | ||
|
||
if err = r.recreateConn(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might keep recreating if new connection is also slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's expected. If new connection is slow, we still can't make progress, so still recreating to see if the network is getting better. Also I will double expectedDuration
to avoid the duration is set too small and always fails
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
/retest |
@lance6716: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
a little messy to workaround like this, I recommend we merge it unless we have to, and let the kernel fix it. the kernel bug not only affect TiDB, it affect every other applications require access S3 or other network endpoint, the user should better choose a distribution without the bug. Also I check some popular distribution of linux, kernel 6.x is mostly used in more recent version, it has bigger chance that it's not the mainstream version used. |
I have tested that without this fix, using Amazon Linux as OS (mimic EKS behaviour) the merge step will still stuck. This stuck is caused by S3 server deliberately choose not to send data after a TCP connection has been in a bad state for a long time. So we still need to reopen it. AWS support says this is because S3 is not designed to serve long connection. Malicious users can use long connections to exhaust the S3 server resources. But our merge step really need the long connection. |
What problem does this PR solve?
Issue Number: close #59341
Problem Summary:
What changed and how does it work?
We found a problem that in AWS Golang SDK when we don't read the HTTP connection of GetObject for a long time, the connection may be stuck. After inspecting packets from
tcpdump
and consulting AWS supports, we choose to reopen the HTTP connection for the slow connection as a temporary workaround. A better and future solution could be let multiple HTTP requests using the 1 TCP connection but it may need to manually enable http2 or choose AWS CRT client which is not availble in Golang.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.