-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECK TestFleet* is failing #7790
Comments
@cmacknz could you help us assess what the problem might be? It is an older version of the stack, 8.1.3, that we are testing here that might have known issues. If that were the case we would be also glad for some guidance for which versions to mute our tests. |
From the logs posted here, this is what I see:
I suspect that Fleet Server restarting is going to be the root cause here but I can't tell why it is shutting down in the logs. Fleet Server doesn't actually depend on Kibana at all, it depends on Elasticsearch and information between Fleet Server and Kibana is exchanged via the .fleet indices. Was Elasticsearch working as expected during this test? Are there any more Fleet Server logs besides those posted here? |
@michel-laterman does anything jump out in these 8.1.3 agent logs? |
The "Fleet Server - Waiting in policy..." message indicates that the policy self-monitor is awaiting for the policy to be detected, but |
So the problem may be Kibana after all. |
yes, i think it's written by kibana. but i'm not sure why there's no error logs from fleet-server to indicate the reason for the shutdown |
I'll try to reproduce it in my dev env, with log level increased. |
FWIW the last successful run was on Saturday |
I reproduced the problem in our CI, attempted to run the Agent diagnostic tool using https://github.com/elastic/eck-diagnostics/, but for some reasons there is no Agent diagnostic data in the archive 😞 I also attempted to enable debug logs without success, it does not seem to give us more logs, I'm a bit out of ideas... |
We seem to hit a similar failure on 8.15.0-SNAPSHOT as well |
I think I was able to repro this on my dev cluster, and I saw no container kills or pod restarts but on the node where the agent pods were running, I see the agent process being continously OOM killed
|
It seems like the jump in memory usage actually occurs in 8.14.0 and not 8.13.0, summarizing some measurements from Slack: 8.13.0:
8.14.0+
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@cmacknz the original issue here was related to fleet server restarting, the issue I mentioned is related to the elastic agent, could this be two separate issues? just want to make sure. |
Quite possibly yes, fleet server could also be OOMing as the root cause there, but all we have confirmed so far is that agent in 8.14+ seems to use more baseline memory (or more memory at startup) then it did before. I can fork off the second problem into a second issue if that makes things clearer. |
might be better until we know for sure the two issues are related. |
Looking at the issue description, at least the symptom |
I created elastic/elastic-agent#4730 specifically about the agent memory usage increase, I am going to send this one back to the ECK repository to track the test failures. As discussed on Slack, the best initial mitigation would be to increase the memory limit and confirm it fixes the tests. We have a similar issue we've detected in MKI which is probably the same root cause. elastic/elastic-agent#4729 |
We have new failures, they are pretty consistent in the last days. The error message is slightly different as it is referring to
It is affecting:
I had a quick look at one diag bundle and can't find anything obvious 😕 : , no OOMKilled Pods for example...
|
Seeing the this frequently on our nightly pipelines
|
@pierrehilbert @rdner @belimawr does this error rings a bell on your end? |
It does not ring a bell for me :/. |
Nothing on my end either. @rdner any idea? |
I don't recall us changing anything related to index creation lately. Also, the fact it started affecting multiple versions at once suggests it's either a backported change or something else changed in the environment. |
It looks like OOM is at least responsible for some of the failures. In #8021 I bumped the memory to |
I think |
@pkoutsovasilis and/or @swiatekm might be able to help answer this. |
There shouldn't be anything in 8.16.0 specifically that would require increasing the memory consumption. There was a major regression introduced in 8.15.0 involving the memory queue in beats, but it was fixed in 8.15.4 and 8.16.0. If this test only cares about communication between fleet and agent, then maybe we could give agent a very small policy? Looking at the test code, I can't easily tell what it actually does. |
I have to say that I am amazed that |
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/530:
I had a closer look at
TestFleetMode/Fleet_in_same_namespace_as_Agent
:2024-04-29T23:03:21.220427957Z
-TestFleetMode/Fleet_in_same_namespace_as_Agent/ES_data_should_pass_validations
starts2024-04-29T23:18:21.222389355Z
- Failure with{Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default]
Agents
The 3 Agents have the same exact same final log lines:
Fleet Server
Fleet server seems to restart quite often, last start is 13 minutes after the Pod has been created:
Looking at the logs of a previous container instance:
I think the question is what is the root cause of this
fleet-server failed: context canceled
?Kibana
On the Kibana side the container has been oomkilled once but seems to be ready way before Agent as expected:
In the Kibana logs we have a few
401
from23:02:39.664+00:00
to23:02:45.162+00:00
, which might be related to the operator trying to callhttps://test-agent-fleet-qj5h-kb-http.e2e-fo9kn-mercury.svc:5601/api/fleet/setup
(last401
. in the operator is at23:02:45.193Z
). Seems to be ok otherwise.The text was updated successfully, but these errors were encountered: