Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to fetch files from S3 #1200

Closed
1 task
pkasravi opened this issue Sep 30, 2024 · 4 comments
Closed
1 task

Unable to fetch files from S3 #1200

pkasravi opened this issue Sep 30, 2024 · 4 comments
Labels
bug This issue is a bug.

Comments

@pkasravi
Copy link

Describe the bug

I am trying to read many small files from S3 in parallel but am experiencing timeout errors. My timeout configuration consists of only setting operation timeout at 25 seconds. This is more than enough time to complete my workload (explanation below). The “timed out” requests never reach S3 as the keys are not present in the server side logs. It seems the SDK is throttling requests causing them to time out before they’re executed.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

My workload is 2369 files where each file is <=8MB. My environment consists of a p4de EC2 instance in the same account as the bucket being accessed.

Reading a single 8MB file should take <=1ms. A single NIC on a p4de achieves 100 Gbps throughput, this means:
8MB * 8 / 1000 = 0.064Gb
0.064Gb / 100Gbps = 0.00064s = 0.64ms = ~1ms

Furthermore, a p4de has 96 cores and tokio-rs is configured to use all cores. This means only 96 requests can be processed at a time. This means theoretically in 25ms I should be able to complete all 2369 requests.
2369 tasks / 96 cores * 1ms = 24.677ms = ~25ms.

Using an operation timeout of 25ms AND 25s both give timeout errors. I expect 25s to be more than enough to not experience any timeouts ever.

Current Behavior

Error from application layer

RuntimeError: Failed to fetch. Bucket: XXXX Key: XXXX

Caused by:
    0: dispatch failure
    1: other
    2: an error occurred while loading credentials
    3: an unexpected error occurred communicating with IMDS
    4: error trying to connect: HTTP connect timeout occurred after 1s
    5: HTTP connect timeout occurred after 1s
    6: timed out

From debug logs, nothing stands out prior to the error. Except for getting a lot of these

DEBUG aws_smithy_runtime::client::http::body::minimum_throughput::http_body_0_4_x: current throughput: 0 B/s is below mini
mum: 1 B/s

Reproduction Steps

Client initialization

let timeout_config = TimeoutConfig::builder()
    .operation_timeout(Duration::from_secs(25))
    .build();

let config = aws_config::from_env()
    .region(aws_config::Region::new("us-west-2"))
    .timeout_config(timeout_config)
    .load()
    .await;
if cfg!(debug_assertions) {
    tracing_subscriber::fmt::init();
}
aws_sdk_s3::Client::new(&config)

Reading function

pub async fn read_from_s3(client: &Client, bucket: &str, key: &str) -> Result<Vec<u8>> {
    let mut res = client
        .get_object()
        .bucket(bucket.to_string())
        .key(key)
        .send()
        .await
        .with_context(|| format!("Failed to fetch. Bucket: {} Key: {}", bucket, key))?;
    let mut body_bytes = Vec::new();
    while let Some(chunk) = res.body.next().await {
        body_bytes.extend_from_slice(&chunk.unwrap());
    }
    Ok(body_bytes)
}

Driver code

let mut tasks = vec![];                                                                                                                       
for filename in tasks {
    tasks.push(tokio::spawn(async move {
        let buffer: Vec<u8> = read_from_s3(&client_clone, &bucket_clone, &filename).await?;
    }));
}
let results = join_all(tasks).await;

Possible Solution

No response

Additional Information/Context

No response

Version

aws-sdk-s3 v1.47.0
aws-config v1.5.5

Environment details (OS name and version, etc.)

AL2 5.10.214-202.855.amzn2.x86_64

Logs

No response

@pkasravi pkasravi added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 30, 2024
@landonxjames
Copy link
Contributor

landonxjames commented Sep 30, 2024

3: an unexpected error occurred communicating with IMDS

The above suggests that the SDK is failing to communicate with EC2's IMDS to get your credentials. This likely has nothing to do with calls to S3 at all since the error is being returned way before the service would actually be called. Are you sure that IMDS is correctly configured to provide credentials or that it is the credential provider you want?

Here is some info on troubleshooting IMDS: https://repost.aws/knowledge-center/ec2-linux-metadata-retrieval

Here is our guide on credential retrieval and the order creds are resolved in: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credproviders.html

@pkasravi
Copy link
Author

3: an unexpected error occurred communicating with IMDS

The above suggests that the SDK is failing to communicate with EC2's IMDS to get your credentials. This likely has nothing to do with calls to S3 at all since the error is being returned way before the service would actually be called. Are you sure that IMDS is correctly configured to provide credentials or that it is the credential provider you want?

Here is some info on troubleshooting IMDS: https://repost.aws/knowledge-center/ec2-linux-metadata-retrieval

Here is our guide on credential retrieval and the order creds are resolved in: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credproviders.html

I don't think it's a credentials issue because if it was the very first request would fail and I wouldn't see successful responses in the server side logs. I am seeing ~800-900 completed requests out of the 2369.

@Velfi Velfi removed the needs-triage This issue or PR still needs to be triaged. label Oct 3, 2024
@ysaito1001
Copy link
Collaborator

According to the offline discussion, this is due to requests for S3 getting throttled. The solution was to add a semaphore around spawning the read tasks to control the number of concurrency and to switch back to the default client config.

IMDS and stalled stream protection were irrelevant to the observed failure.

Copy link

github-actions bot commented Oct 3, 2024

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

4 participants