-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to fetch files from S3 #1200
Comments
The above suggests that the SDK is failing to communicate with EC2's IMDS to get your credentials. This likely has nothing to do with calls to S3 at all since the error is being returned way before the service would actually be called. Are you sure that IMDS is correctly configured to provide credentials or that it is the credential provider you want? Here is some info on troubleshooting IMDS: https://repost.aws/knowledge-center/ec2-linux-metadata-retrieval Here is our guide on credential retrieval and the order creds are resolved in: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credproviders.html |
I don't think it's a credentials issue because if it was the very first request would fail and I wouldn't see successful responses in the server side logs. I am seeing ~800-900 completed requests out of the 2369. |
According to the offline discussion, this is due to requests for S3 getting throttled. The solution was to add a semaphore around spawning the read tasks to control the number of concurrency and to switch back to the default client config. IMDS and stalled stream protection were irrelevant to the observed failure. |
Comments on closed issues are hard for our team to see. |
Describe the bug
I am trying to read many small files from S3 in parallel but am experiencing timeout errors. My timeout configuration consists of only setting operation timeout at 25 seconds. This is more than enough time to complete my workload (explanation below). The “timed out” requests never reach S3 as the keys are not present in the server side logs. It seems the SDK is throttling requests causing them to time out before they’re executed.
Regression Issue
Expected Behavior
My workload is 2369 files where each file is <=8MB. My environment consists of a p4de EC2 instance in the same account as the bucket being accessed.
Reading a single 8MB file should take <=1ms. A single NIC on a p4de achieves 100 Gbps throughput, this means:
8MB * 8 / 1000 = 0.064Gb
0.064Gb / 100Gbps = 0.00064s = 0.64ms = ~1ms
Furthermore, a p4de has 96 cores and tokio-rs is configured to use all cores. This means only 96 requests can be processed at a time. This means theoretically in 25ms I should be able to complete all 2369 requests.
2369 tasks / 96 cores * 1ms = 24.677ms = ~25ms.
Using an operation timeout of 25ms AND 25s both give timeout errors. I expect 25s to be more than enough to not experience any timeouts ever.
Current Behavior
Error from application layer
From debug logs, nothing stands out prior to the error. Except for getting a lot of these
Reproduction Steps
Client initialization
Reading function
Driver code
Possible Solution
No response
Additional Information/Context
No response
Version
Environment details (OS name and version, etc.)
AL2 5.10.214-202.855.amzn2.x86_64
Logs
No response
The text was updated successfully, but these errors were encountered: