Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows agent getting unhealthy with System integration and no data under kafka output. #6049

Open
amolnater-qasource opened this issue Nov 18, 2024 · 14 comments
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented Nov 18, 2024

Kibana Build details:

VERSION: 8.17.0 SNAPSHOT
BUILD: 80188
COMMIT: fdb16ae8cbdf4236db3696aa00d0bb98c943d864

Artifact Link: https://snapshots.elastic.co/8.17.0-7a041bf5/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip

Preconditions:

  1. 8.17.0-SNAPSHOT Kibana cloud environment should be available.

Steps to reproduce:

  1. Install Windows agent.
  2. Update output to kafka output.
  3. Observe agent gets unhealthy and no output under kafka topic.

Expected Result:
Windows agent should remain healthy with System integration and data should be generated under kafka output.

Screenshot:
Image

Agent Logs:
elastic-agent-diagnostics-2024-11-18T10-38-10Z-00.zip

What's working fine:

  • Data for same kafka output is available for 8.16.0 agent.
@amolnater-qasource amolnater-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 18, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@muskangulati-qasource Please review.

@muskangulati-qasource
Copy link

Secondary review is Done for this ticket

@amolnater-qasource amolnater-qasource changed the title Windows agent getting unhealthy inconsistently with System integration and no data under kafka output. Windows agent getting unhealthy with System integration and no data under kafka output. Nov 18, 2024
@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

            input-beat/metrics-monitoring-metrics-monitoring-beats:
                message: 'Error fetching data for metricset beat.stats: error making http request: Get "http://npipe/stats": open \\.\pipe\Q540iWXFlriVpKtDsr_d4ccTYz2N3a_K.sock: The system cannot find the file specified.'
                payload:
                    streams:
                        metrics-monitoring-filebeat:
                            error: ""
                            status: HEALTHY
                        metrics-monitoring-metricbeat:
                            error: 'Error fetching data for metricset beat.stats: error making http request: Get "http://npipe/stats": open \\.\pipe\Q540iWXFlriVpKtDsr_d4ccTYz2N3a_K.sock: The system cannot find the file specified.'
                            status: DEGRADED
                state: 3

This is #5332 again

@ycombinator
Copy link
Contributor

#5332 was resolved last week. @amolnater-qasource would you mind re-testing this issue with the latest 8.17.0-SNAPSHOT? Thanks.

@amolnater-qasource
Copy link
Author

@ycombinator Thank you for the update.

We have revalidated this issue on latest 8.17.0 snapshot available and found this issue still reproducible.

  • No data under Kafka output is generated.

Artifact: https://snapshots.elastic.co/8.17.0-1c58bcd8/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip
Image

Logs:
elastic-agent-diagnostics-2024-11-26T04-52-15Z-00.zip

Please let us know if we are missing anything here.

Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Nov 26, 2024

Same problem as before Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.:

{"log.level":"warn","@timestamp":"2024-11-26T04:43:43.726Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

The intended fix for this is in 8.17 but it looks like it isn't helping here for some reason, https://github.com/elastic/elastic-agent/commits/8.17

Assigning to @pchila to investigate.

@pchila
Copy link
Member

pchila commented Nov 26, 2024

@cmacknz from the logs we can see that we get the error about opening the named pipe more than once (with the fix of #5332 we trigger by default at the second consecutive error, with the assumption that the fetch will be retried in 60s)

What we can see from the logs is that the unit beat/metrics-monitoring-metrics-monitoring-beats is retrying to fetch much more quickly than expected...

{"log.level":"info","@timestamp":"2024-11-26T04:25:42.656Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-11-26T04:25:42.658Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.657Z","message":"Beat ID: 779a746f-aa35-4946-8b9e-06ab08036161","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"winlog"},"log":{"source":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":1070,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-26T04:25:42.669Z","message":"Error fetching data for metricset http.json: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-26T04:25:42.671Z","message":"Error fetching data for metricset http.json: error making http request: Get \"http://npipe/inputs\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"ecs.version":"1.6.0"}","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.656Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-11-26T04:25:42.658Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.657Z","message":"Beat ID: 779a746f-aa35-4946-8b9e-06ab08036161","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"winlog"},"log":{"source":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":1070,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-26T04:25:42.669Z","message":"Error fetching data for metricset http.json: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-26T04:25:42.671Z","message":"Error fetching data for metricset http.json: error making http request: Get \"http://npipe/inputs\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"service.name":"metricbeat","ecs.version":"1.6.0","log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-26T04:25:42.671Z","message":"Error fetching data for metricset http.json: error making http request: Get \"http://npipe/stats\": open \\\\.\\pipe\\hC6H1faJ6uJdcqwMEc7XDxNvsCB7nGo1.sock: The system cannot find the file specified.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"log.origin":{"file.line":333,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).handleFetchError"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.677Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-11-26T04:25:42.677Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (HEALTHY->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://npipe/inputs\": open \\\\.\\pipe\\K5P2Tc74wcqesMl7DzGLkORtOcGrPO1a.sock: The system cannot find the file specified.","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.681Z","message":"Output reload is enabled, the beat will restart as needed on change of output config","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"winlog"},"log":{"source":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"log.logger":"centralmgmt","log.origin":{"file.line":204,"file.name":"management/managerV2.go","function":"github.com/elastic/beats/v7/x-pack/libbeat/management.NewV2AgentManagerWithClient"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.681Z","message":"Set gc percentage to: 100","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"winlog"},"log":{"source":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"log.origin":{"file.line":1124,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-26T04:25:42.684Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}

Probably for these cases the default of triggering the DEGRADED state on the second consecutive error is not enough... I will test to see if there's a more appropriate value for this input tomorrow.

@cmacknz
Copy link
Member

cmacknz commented Nov 26, 2024

It may be because the Beats restarted because of an output change that caused the second instance of the problem. Beats decide to do this on their own and don't communicate their intent to restart to agent.

@pchila
Copy link
Member

pchila commented Nov 27, 2024

@amolnater-qasource since you already have the setup for this test, could you try to increase the failure_threshold value for elastic-agent monitoring to see if we can find a reasonable default that would avoid agent going DEGRADED in this case?

To set a different failure_threshold we can use fleet overrides API, for example to set the threshold to 5 consecutive errors (using the Dev Tools in kibana)

PUT kbn:/api/fleet/agent_policies/<policy_id>
{
   "name": <policy_name>,
   "namespace": "default",
   "overrides": {
       "agent": {
            "monitoring": {
                "failure_threshold": "5"
            }
       }
   }
}

Another interesting test would be to disable the failure threshold completely using "failure_threshold": "0", so that would be:

PUT kbn:/api/fleet/agent_policies/<policy_id>
{
   "name": <policy_name>,
   "namespace": "default",
   "overrides": {
       "agent": {
            "monitoring": {
                "failure_threshold": "0"
            }
       }
   }
}

with 0 as failure threshold the monitoring input should never degrade no matter how many errors we get ;)

In order to "reset" the value to the default (not exactly but it will be close enough) you can reset the failure_threshold to "2"

PUT kbn:/api/fleet/agent_policies/<policy_id>
{
   "name": <policy_name>,
   "namespace": "default",
   "overrides": {
       "agent": {
            "monitoring": {
                "failure_threshold": "2"
            }
       }
   }
}

Could you please run these values of failure_threshold to see if the issue reproduces?
Thank you,
Paolo

@amolnater-qasource
Copy link
Author

Hi @pchila
Thank you for sharing the details.

Just an update, that even under the #6049 (comment) tests, agent remained Healthy on Fleet UI and we observed no data under Kafka topic.

Further please find below the logs for today's tests:

Threshold: 05
elastic-agent-diagnostics-2024-11-27T09-23-22Z-00.zip

Threshold: 0
elastic-agent-diagnostics-2024-11-27T09-32-19Z-00.zip

Threshold: 02
elastic-agent-diagnostics-2024-11-27T09-39-00Z-00.zip

Observations:

  • Agent remained Healthy on Fleet UI.
  • No data under Kafka output is generated with any of the threshold values.

Please let us know if we are missing anything here.

Thanks!

@pchila
Copy link
Member

pchila commented Nov 27, 2024

If the agent was displayed as healthy in the Fleet UI it means that the fix for #5332 is working as intended (although in the logs we can see that some monitoring inputs become DEGRADED briefly, the agent doesn't stay DEGRADED for a long time).
I guess that @cmacknz referred to #5332 and assigned this to me thinking that agent was becoming degraded because of a transient error while fetching metrics.

Looking into the logs though I noticed a lot of panics in metricbeat that looks related to the kafka output publishing for component {"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"winlog-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"winlog"}:

{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"panic: runtime error: invalid memory address or nil pointer dereference","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"[signal 0xc0000005 code=0x0 addr=0x30 pc=0x1ebc014]","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"goroutine 2168 [running]:","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"github.com/elastic/beats/v7/libbeat/outputs/kafka.(*client).Publish(0xc000306c08, {0xc002d33f70?, 0x0?}, {0x8441540, 0xc00383a9c0})","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"github.com/elastic/beats/v7/libbeat/outputs/kafka/client.go:167 +0xf4","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*clientWorker).run(0xc0033fda40, {0x8427bc8, 0xc001c4b360})","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"github.com/elastic/beats/v7/libbeat/publisher/pipeline/client_worker.go:101 +0xc6","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"created by github.com/elastic/beats/v7/libbeat/publisher/pipeline.makeClientWorker in goroutine 2131","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-11-27T09:23:40.950Z","message":"github.com/elastic/beats/v7/libbeat/publisher/pipeline/client_worker.go:75 +0x1f0","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"system/metrics"},"log":{"source":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-27T09:23:40.964Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":645},"message":"Component state changed system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '6420' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-27T09:23:40.964Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b-system/metrics-system-667a20db-207d-4c4b-8901-8a7ac7c96640 (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '6420' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","state":"STOPPED"},"unit":{"id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b-system/metrics-system-667a20db-207d-4c4b-8901-8a7ac7c96640","type":"input","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-11-27T09:23:40.964Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '6420' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","state":"STOPPED"},"unit":{"id":"system/metrics-222f7bb5-918e-4247-87cc-d93c0f1d496b","type":"output","state":"STOPPED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

I am not really an expert about kafka output so I guess that this could be investigated more efficiently by somebody else...

@cmacknz
Copy link
Member

cmacknz commented Nov 27, 2024

(although in the logs we can see that some monitoring inputs become DEGRADED briefly, the agent doesn't stay DEGRADED for a long time).

The Fleet checkin doesn't consider history, it is a report of the current health at the time of the checkin. The source of truth for this is what happened in the logs and what elastic-agent status would output. Ideally we would have a watch on the agent status while this was happening to catch any momentary slip into the degraded state. This is what many of our tests do which is why this is more prevalent there.

The Kafka output panic is separate and concerning. CC @pierrehilbert we'll want someone to look at that.

@cmacknz
Copy link
Member

cmacknz commented Nov 28, 2024

I opened elastic/beats#41823 to track the Kafka panic separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants