-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent gets unhealthy on assigning from policy with Elastic Defend integration to without Defend integration. #3617
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@manishgupta-qasource Please review. |
Secondary review for this ticket is Done |
"components": [
{
"id": "log-default",
"type": "log",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '128'",
"units": [
{
"id": "log-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "winlog-default",
"type": "winlog",
"status": "DEGRADED",
"message": "Degraded: pid '4900' missed 1 check-in",
"units": [
{
"id": "winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "winlog-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "system/metrics-default",
"type": "system/metrics",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '916'",
"units": [
{
"id": "system/metrics-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "filestream-monitoring",
"type": "filestream",
"status": "DEGRADED",
"message": "Degraded: pid '7040' missed 1 check-in",
"units": [
{
"id": "filestream-monitoring-filestream-monitoring-agent",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "filestream-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "beat/metrics-monitoring",
"type": "beat/metrics",
"status": "HEALTHY",
"message": "Healthy: communicating with pid '1552'",
"units": [
{
"id": "beat/metrics-monitoring-metrics-monitoring-beats",
"type": "input",
"status": "CONFIGURING",
"message": "Configuring"
},
{
"id": "beat/metrics-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
},
{
"id": "http/metrics-monitoring",
"type": "http/metrics",
"status": "DEGRADED",
"message": "Degraded: pid '6044' missed 1 check-in",
"units": [
{
"id": "http/metrics-monitoring-metrics-monitoring-agent",
"type": "input",
"status": "CONFIGURING",
"message": "Configuring"
},
{
"id": "http/metrics-monitoring",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
}
], The diagnostics shows everything is healthy but the snapshot of the agent details while this was happening shows that multiple processes are missing check ins. |
Usually missed check ins happen due to increased CPU usage on the machine. The policy reassignment would have caused every input in the policy to stop and restart, and the missed checkins correlate with that event in the logs: "log.level":"debug","@timestamp":"2023-10-17T09:16:42.412Z","log.logger":"component.runtime.endpoint-default.service_runtime","log.origin":{"file.name":"runtime/service.go","file.line":192},"message":"context is done. exiting.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPED->STOPPING): Stopping","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPING","old_state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed filestream-monitoring-filestream-monitoring-agent (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"HEALTHY"},"unit":{"id":"filestream-monitoring-filestream-monitoring-agent","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"HEALTHY"},"unit":{"id":"winlog-default-winlog-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8 (STOPPING->STOPPED): Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED","old_state":"STOPPING"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a (STARTING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-0a5c1691-2ce5-4a4f-8ead-f5b1e767f61a","type":"input","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '7040' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '6044' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed winlog-default (HEALTHY->DEGRADED): Degraded: pid '4900' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"HEALTHY"},"unit":{"id":"system/metrics-default-system/metrics-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-17T09:16:42.412Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":550},"message":"Spawned new unit winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8: Stopped","log":{"source":"elastic-agent"},"component":{"id":"winlog-default","state":"DEGRADED"},"unit":{"id":"winlog-default-winlog-system-255f7854-c0d1-46f4-b861-021bac628fd8","type":"input","state":"STOPPED"},"ecs.version":"1.6.0"} |
Does this happen every time? If it does happen every time can it be reproduced on the latest 8.10.4 release? Are you able to check or record the CPU usage of the machine running the agent when the policy reassignment is happening? |
Hi @cmacknz Thank you for looking into this. We have validated this issue on 8.10.4 released production environment and found it not reproducible there. Further observations:
Screen Recordings: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-19.10-28-25.mp4Amol.Windows.-.ec2-18-212-196-2.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-10-19.10-28-57.mp48.11.0: Amol.Windows.-.ec2-18-212-196-2.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-10-19.10-46-29.mp4Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-19.10-47-19.mp4Please let us know if anything else is required from our end. Thanks!! |
When the CPU reaches 100%, can you run |
Hi @cmacknz We have reproduced this issue and recollected the new logs when CPU reached 100% with the shared command. Logs: Please let us know if anything else is required from our end. Thanks!! |
The latest diagnostics isn't giving me any clues but likely this is the same problem as #3654 |
Pulling this into the sprint and assigning to Lee since this is the same as #3654 but we'll need to retest this scenario to confirm. |
Hi Team, We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now. Observations:
Build details: Hence we are marking this issue as QA:Validated. Thanks! |
|
Kibana Build details:
Host OS: All
Preconditions:
Steps to reproduce:
Screen Recording:
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-17.14-45-43.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-17.14-47-06.mp4
Agent Logs:
elastic-agent-diagnostics-2023-10-17T09-17-18Z-00.zip
Agent JSON:
ec2amaz-auvpg7n-agent-details.zip
Expected Result:
Agent shouldn't get unhealthy on assigning from policy with Elastic Defend integration to without Defend integration.
Impacted Testcase:
https://elastic.testrail.io/index.php?/tests/view/2166951
Similar issue:
#3507
The text was updated successfully, but these errors were encountered: