Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] Fix Excessive IMDS related error logging #1440

Merged
merged 11 commits into from
Dec 4, 2024
Merged

Conversation

zhihonl
Copy link
Contributor

@zhihonl zhihonl commented Nov 21, 2024

Description of the issue

    status code: 404, request id: D! should retry true for imds error : EC2MetadataError: failed to make EC2Metadata request
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

In version 1.300049.0 and above, the agent will log the above message regardless of log levels on a defined interval. This is because we recently enabled instance tags by default to retrieve ASG name and instance tag name for entity service names. This becomes an issue when instance metadata tags is not enabled which can be majority case since instance metadata tags is an opt-in feature. The issue is especially apparent in EKS since EKS does not support instance metadata tags

Description of changes

  1. Only log the error when debug mode is enabled
  2. Only enable instance tag related features in strictly EC2 mode
  3. Only retry up to 3 times for instance tag errors.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Unit tests

Manual Test

EKS Agent log

Before the fix, we will immediately see EC2MetadataError logs during agent start. In the below logs, we do not see any.

2024-11-21T22:05:45Z I! CWAGENT_LOG_LEVEL is set to "DEBUG"                                                                                                                                                   │
│ 2024-11-21T22:05:45Z I! Starting AmazonCloudWatchAgent CWAgent/Unknown (go1.22.7; linux; amd64) with log file  with log target lumberjack                                                                     │
│ 2024-11-21T22:05:45Z I! AWS SDK log level not set                                                                                                                                                             │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/service.go:115","msg":"Setting up own telemetry..."}                                                                                                      │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/service.go:156","msg":"Skipped telemetry setup.","address":"","level":"None"}                                                                             │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/exporter.go:280","msg":"Development component. May change in the future.","kind":"exporter","data_type":"traces","name":"debug/application_signals"}     │
│ 2024-11-21T22:05:45Z W! {"caller":"[email protected]/emf_exporter.go:99","msg":"the default value for DimensionRollupOption will be changing to NoDimensionRollupin a future release. See https://githu │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/exporter.go:280","msg":"Development component. May change in the future.","kind":"exporter","data_type":"metrics","name":"debug/application_signals"}    │
│ 2024-11-21T22:05:45Z W! {"caller":"[email protected]/emf_exporter.go:99","msg":"the default value for DimensionRollupOption will be changing to NoDimensionRollupin a future release. See https://githu │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/service.go:182","msg":"Starting CWAgent...","Version":"Unknown","NumCPU":2}                                                                               │
│ 2024-11-21T22:05:45Z I! {"caller":"extensions/extensions.go:34","msg":"Starting extensions..."}                                                                                                               │
│ 2024-11-21T22:05:45Z I! {"caller":"extensions/extensions.go:37","msg":"Extension is starting...","kind":"extension","name":"agenthealth/logs"}                                                                │
│ 2024-11-21T22:05:45Z I! {"caller":"extensions/extensions.go:52","msg":"Extension started.","kind":"extension","name":"agenthealth/logs"}                                                                      │
│ 2024-11-21T22:05:45Z I! {"caller":"extensions/extensions.go:37","msg":"Extension is starting...","kind":"extension","name":"server"}                                                                          │
│ 2024-11-21T22:05:45Z I! {"caller":"extensions/extensions.go:52","msg":"Extension started.","kind":"extension","name":"server"}                                                                                │
│ 2024-11-21T22:05:45Z D! Fallback shared config file(s): [/root/.aws/credentials]                                                                                                                              │
│ 2024-11-21T22:05:45Z D! Successfully created credential sessions                                                                                                                                              │
│ 2024-11-21T22:05:45Z D! Using credential ASIA5SCS6GJ6V4MGQ5HI from EC2RoleProvider                                                                                                                            │
│ 2024-11-21T22:05:45Z I! {"caller":"[email protected]/extension.go:39","msg":"X-Ray proxy server started on 0.0.0.0:2000","kind":"extension","name":"awsproxy/application_signals"}                            │
│ W1121 22:05:45.914346       1 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                        │
│ I1121 22:05:47.914827       1 shared_informer.go:313] Waiting for caches to sync for podWatcher                                                                                                               │
│ I1121 22:05:48.015389       1 shared_informer.go:320] Caches are synced for podWatcher                                                                                                                        │
│ 2024-11-21T22:05:48Z I! {"caller":"resolver/kubernetes.go:277","msg":"podWatcher: Cache synced","kind":"processor","name":"awsapplicationsignals","pipeline":"traces/application_signals"}                    │
│ I1121 22:05:48.015474       1 shared_informer.go:313] Waiting for caches to sync for serviceWatcher                                                                                                           │
│ I1121 22:05:48.015483       1 shared_informer.go:320] Caches are synced for serviceWatcher                                                                                                                    │
│ 2024-11-21T22:05:48Z I! {"caller":"resolver/kubernetes.go:324","msg":"serviceWatcher: Cache synced","kind":"processor","name":"awsapplicationsignals","pipeline":"traces/application_signals"}                │
│ 2024-11-21T22:05:48Z I! {"caller":"internal/resourcedetection.go:125","msg":"began detecting resource information","kind":"processor","name":"resourcedetection","pipeline":"traces/application_signals"}     │
│ 2024-11-21T22:05:48Z I! {"caller":"internal/resourcedetection.go:139","msg":"detected resource information","kind":"processor","name":"resourcedetection","pipeline":"traces/application_signals","resource": │
│ 2024-11-21T22:05:48Z I! {"caller":"cardinalitycontrol/metrics_limiter.go:59","msg":"creating metrics limiter with config","kind":"processor","name":"awsapplicationsignals","pipeline":"metrics/application_s │
│ 2024-11-21T22:05:48Z I! {"caller":"cardinalitycontrol/metrics_limiter.go:88","msg":"metrics limiter created.","kind":"processor","name":"awsapplicationsignals","pipeline":"metrics/application_signals"}     │
│ 2024-11-21T22:05:48Z W! {"caller":"[email protected]/warning.go:42","msg":"Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable t │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] [Server #1]Server created","grpc_log":true}                                                                                          │
│ 2024-11-21T22:05:48Z I! {"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp/application_signals","data_type":"metrics","endpoint":"0.0.0.0:4315"}       │
│ 2024-11-21T22:05:48Z W! {"caller":"[email protected]/warning.go:42","msg":"Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable t │
│ 2024-11-21T22:05:48Z I! {"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp/application_signals","data_type":"metrics","endpoint":"0.0.0.0:4316"}       │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] [Server #1 ListenSocket #2]ListenSocket created","grpc_log":true}                                                                    │
│ 2024-11-21T22:05:48Z I! {"caller":"host/ec2metadata.go:78","msg":"Fetch instance id and type from ec2 metadata","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics"}                 │
│ I1121 22:05:48.282329       1 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file  │
│ I1121 22:05:48.282947       1 factory.go:219] Registration of the docker container factory failed: failed to validate Docker info: failed to detect Docker info: Cannot connect to the Docker daemon at unix: │
│ I1121 22:05:48.282965       1 factory.go:221] Registration of the systemd container factory successfully                                                                                                      │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] original dial target is: \"unix:///run/containerd/containerd.sock\"","grpc_log":true}                                                │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] [Channel #3]Channel created","grpc_log":true}                                                                                        │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] [Channel #3]parsed dial target is: resolver.Target{URL:url.URL{Scheme:\"unix\", Opaque:\"\", User:(*url.Userinfo)(nil), Host:\"\", P │
│ 2024-11-21T22:05:48Z I! {"caller":"zapgrpc/zapgrpc.go:176","msg":"[core] [Channel #3]Channel authority set to \"localhost\"","grpc_log":true}

EC2 Agent Journalctl Log

[ec2-user@ip-172-31-39-228 ~]$ date
Thu Nov 21 23:13:33 UTC 2024
[ec2-user@ip-172-31-39-228 ~]$ sudo journalctl -u amazon-cloudwatch-agent --since "5 minutes ago" | cat
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal systemd[1]: Started amazon-cloudwatch-agent.service - Amazon CloudWatch Agent.
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: D! [EC2] Found active network interface
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: I! imds retry client will retry 1 timesI! Detected the instance is EC2
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: 2024/11/21 23:10:32 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: 2024/11/21 23:10:32 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_log.json ...
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: 2024/11/21 23:10:32 I! Valid Json input schema.
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: I! Detecting run_as_user...
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: I! Trying to detect region from ec2
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46875]: 2024/11/21 23:10:32 Configuration validation first phase succeeded
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46871]: I! Detecting run_as_user...
Nov 21 23:10:32 ip-172-31-39-228.us-west-2.compute.internal start-amazon-cloudwatch-agent[46871]: 2024-11-21T23:10:32Z I! CWAGENT_LOG_LEVEL is set to "DEBUG"

EC2 Agent local log

2024-11-21T23:11:32Z D! [logagent] open file count, 1
2024-11-21T23:11:32Z D! {"caller":"entitystore/retryer.go:75","msg":"attribute retrieval retry count","kind":"extension","name":"entitystore","retry":1}
2024-11-21T23:11:32Z D! {"caller":"entitystore/retryer.go:80","msg":"there was an issue when retrieving entity attributes but will not affect agent functionality","kind":"extension","name":"entitystore","error":"EC2MetadataError: failed to make EC2Metadata request\n<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n\t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n  <title>404 - Not Found</title>\n </head>\n <body>\n  <h1>404 - Not Found</h1>\n </body>\n</html>\n\n\tstatus code: 404, request id: "}
2024-11-21T23:11:32Z D! {"caller":"entitystore/retryer.go:75","msg":"attribute retrieval retry count","kind":"extension","name":"entitystore","retry":1}
2024-11-21T23:11:32Z D! {"caller":"entitystore/retryer.go:80","msg":"there was an issue when retrieving entity attributes but will not affect agent functionality","kind":"extension","name":"entitystore","error":"EC2MetadataError: failed to make EC2Metadata request\n<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n\t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n  <title>404 - Not Found</title>\n </head>\n <body>\n  <h1>404 - Not Found</h1>\n </body>\n</html>\n\n\tstatus code: 404, request id: "}
2024-11-21T23:11:32Z D! could not perform operation without imds v1 fallback enable thus enable fallback
2024-11-21T23:11:32Z D! {"caller":"entitystore/ec2Info.go:110","msg":"Failed to get AutoScalingGroup from instance tags. This is likely because instance tag is not enabled for IMDS but will not affect agent functionality.","kind":"extension","name":"entitystore"}
2024-11-21T23:11:32Z D! could not perform operation without imds v1 fallback enable thus enable fallback
2024-11-21T23:11:32Z D! {"caller":"entitystore/serviceprovider.go:243","msg":"Failed to get service name from instance tags. This is likely because instance tag is not enabled for IMDS but will not affect agent functionality.","kind":"extension","name":"entitystore"}

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Related Issue

  1. Metadata error logs in Cloudwatch agent - 404 EC2MetadataError: failed to make EC2Metadata request #1435

@zhihonl zhihonl requested a review from a team as a code owner November 21, 2024 21:17
go e.serviceprovider.startServiceProvider()
// Instance metadata tags is not usable for EKS nodes
// https://github.com/kubernetes/cloud-provider-aws/issues/762
if e.kubernetesMode == "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may seem drastic but based on the codebase serviceprovider isn't really used in Kubernetes scenario. It always gets overriden by either Instrumentation or K8sWorkload service name source.

return
// Instance metadata tags is not usable for EKS nodes
// https://github.com/kubernetes/cloud-provider-aws/issues/762
if ei.kubernetesMode == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to keep the retryer on native k8s on ec2 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in new commit

internal/retryer/imdsretryer.go Outdated Show resolved Hide resolved
// Instance metadata tags is not usable for EKS nodes
// https://github.com/kubernetes/cloud-provider-aws/issues/762
if e.kubernetesMode == "" {
go e.serviceprovider.startServiceProvider()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does e.serviceprovider.startServiceProvider() do anything beneficial on EKS? Seems strange that we would have to do this just to avoid the IMDS check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. startServiceProvider scrapes IAM role or service name from instance tags so they can be used for service names. In Kubernetes environment neither of these will be used because they have lower priority than Instrumentation or K8sWorkload.

Code wise, I only see serviceprovider public function called in awsentity processor if we are strictly on EC2:

entityServiceName, entityServiceNameSource = getServiceNameSource()

@lisguo
Copy link
Contributor

lisguo commented Nov 21, 2024

Can you provide logs of the built agent on EC2 and EKS?

go e.serviceprovider.startServiceProvider()
// Instance metadata tags is not usable for EKS nodes
// https://github.com/kubernetes/cloud-provider-aws/issues/762
if e.kubernetesMode == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the above -- shouldn't this be if e.kubernetesMode != modeEKS ? So that we can account for native k8s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled it for all kubernetes environment because serviceprovider isn't used for any Kubernetes related logics. GetMetricServiceNameAndSource is the only public function that is used for awsentity processor and that is only referenced in EC2 specific processor logic:

entityServiceName, entityServiceNameSource = getServiceNameSource()

Copy link
Contributor

github-actions bot commented Dec 2, 2024

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Dec 2, 2024
lisguo
lisguo previously approved these changes Dec 4, 2024
@lisguo
Copy link
Contributor

lisguo commented Dec 4, 2024

pr build failing

@zhihonl zhihonl merged commit a8ef6f7 into main Dec 4, 2024
7 checks passed
@zhihonl zhihonl deleted the imds-log-fix branch December 4, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants