-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux agents gets unhealthy on enabling/disabling modules for System/Linux integration. #3654
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@manishgupta-qasource Please review. |
Secondary review for this ticket is Done |
From ip-172-31-66-47-agent-details.zip: {
"id": "beat/metrics-monitoring",
"type": "beat/metrics",
"status": "DEGRADED",
"message": "Degraded: pid '1582' missed 1 check-in",
"units": [ Same for ip-172-31-79-43-agent-details.zip {
"id": "system/metrics-default",
"type": "system/metrics",
"status": "DEGRADED",
"message": "Degraded: pid '996' missed 1 check-in",
"units": [
{
"id": "system/metrics-default-system/metrics-system-a4cc1da8-f6cd-4ab6-b61b-045da2b42479",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "system/metrics-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
}, Seems similar to #3617 |
I can reproduce this on my Mac by adding and removing the system integration. |
Likely also the same problem affecting tests run through elastic-package https://github.com/elastic/ingest-dev/issues/2560 |
#3617 (comment) shows policy reassignment (which stops then restarts every input) causes CPU to spike to 100% so probably this is resource utilization from somewhere. Diagnostics can capture CPU profiles but for me it didn't show anything useful, it might be easier to directly use the /debug/pprof endpoint of agent (and/or the beats via their unix socket paths) so get more control over it. Also possible this is a deadlock (or temporary deadlock) on either side of the protocol, but the CPU usage spike makes me think it is resource usage first. |
Just an update so far on what I'm seeing. On debian arm Linux, I don't see a CPU spike. I see status go from:
to
That persists for 30s, and then status goes back to "normal"
It isn't always the same components that miss a check-in, I have seem beat/metrics-monitoring, log-default and system/metrics-default. But so far it has always been 2 PIDs that miss a check in. |
update with data so far. Using Steps to re-produce
Initial policy has the following components
Timeline
log-default processing was stuck trying to write to the checkinObserved channel for 30sec
|
Hi Team, We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now. Observations:
Build details: Hence we are closing this issue and marking as QA:Validated. Thanks! |
Kibana Build details:
Host OS: Linux
Preconditions:
Steps to reproduce:
NOTE:
Screen Recording:
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-10-25.12-49-29.mp4
Expected Result:
Linux agents should remain Healthy on enabling/disabling modules for System/Linux integration.
Agent.json:
ip-172-31-66-47-agent-details.zip
ip-172-31-79-43-agent-details.zip
Logs:
elastic-agent-diagnostics-2023-10-25T07-20-34Z-00.zip
elastic-agent-diagnostics-2023-10-25T07-27-25Z-00.zip
The text was updated successfully, but these errors were encountered: