ECK TestFleet* is failing #7790

barkbay · 2024-04-30T13:10:49Z

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/530:

I had a closer look at TestFleetMode/Fleet_in_same_namespace_as_Agent:

2024-04-29T23:03:21.220427957Z - TestFleetMode/Fleet_in_same_namespace_as_Agent/ES_data_should_pass_validations starts
2024-04-29T23:18:21.222389355Z - Failure with {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default]

Agents

The 3 Agents have the same exact same final log lines:

{
    "log.level": "error",
    "@timestamp": "2024-04-29T23:10:26.530Z",
    "log.origin": {
        "file.name": "fleet/fleet_gateway.go",
        "file.line": 206
    },
    "message": "Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://test-agent-fleet-fs-w6xh-agent-http.e2e-fo9kn-mercury.svc:8220/api/fleet/agents/e94192f6-77db-431e-929e-98a11f267d6b/checkin?\": dial tcp 10.56.235.142:8220: connect: connection refused",
    "ecs.version": "1.6.0"
}

{
    "log.level": "error",
    "@timestamp": "2024-04-29T23:15:23.323Z",
    "log.origin": {
        "file.name": "fleet/fleet_gateway.go",
        "file.line": 206
    },
    "message": "Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://test-agent-fleet-fs-w6xh-agent-http.e2e-fo9kn-mercury.svc:8220/api/fleet/agents/bc83885c-c297-44a9-95db-b5471fba15e8/checkin?\": dial tcp 10.56.235.142:8220: connect: connection refused",
    "ecs.version": "1.6.0"
}

{
    "log.level": "error",
    "@timestamp": "2024-04-29T23:16:30.150Z",
    "log.origin": {
        "file.name": "fleet/fleet_gateway.go",
        "file.line": 206
    },
    "message": "Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://test-agent-fleet-fs-w6xh-agent-http.e2e-fo9kn-mercury.svc:8220/api/fleet/agents/4509833f-ba40-41d1-b164-fd29009b01ab/checkin?\": dial tcp 10.56.235.142:8220: connect: connection refused",
    "ecs.version": "1.6.0"
}

Fleet Server

Fleet server seems to restart quite often, last start is 13 minutes after the Pod has been created:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "creationTimestamp": "2024-04-29T23:03:16Z",
        "name": "test-agent-fleet-fs-w6xh-agent-6bdfbbdc7f-m5dpk",
        "namespace": "e2e-fo9kn-mercury",
    }
    "status": {
        "containerStatuses": [
            {
                "image": "docker.elastic.co/beats/elastic-agent:8.1.3",
                "lastState": {
                    "terminated": {
                        "exitCode": 1,
                        "finishedAt": "2024-04-29T23:15:23Z",
                        "reason": "Error",
                        "startedAt": "2024-04-29T23:13:16Z"
                    }
                },
                "name": "agent",
                "ready": true,
                "restartCount": 5, ### WHY ?
                "started": true,
                "state": {
                    "running": {
                        "startedAt": "2024-04-29T23:16:53Z" ## ~13 minutes after the Pod has been created
                    }
                }
            }
        ],

Looking at the logs of a previous container instance:

Apr 30, 2024 @ 23:10:24.273 Updating certificates in /etc/ssl/certs...
....
Apr 30, 2024 @ 23:10:36.085 2024-04-29T23:10:36Z - message: Application: fleet-server--8.1.3[]: State changed to STARTING: Waiting on policy with Fleet Server integration: eck-fleet-server - type: 'STATE' - sub_type: 'STARTING'
Apr 30, 2024 @ 23:10:36.876 Fleet Server - Waiting on policy with Fleet Server integration: eck-fleet-server
Apr 30, 2024 @ 23:12:29.794 Shutting down Elastic Agent and sending last events...
Apr 30, 2024 @ 23:12:29.794 waiting for installer of pipeline 'default' to finish
Apr 30, 2024 @ 23:12:29.795 Signaling application to stop because of shutdown: fleet-server--8.1.3
Apr 30, 2024 @ 23:12:31.297 2024-04-29T23:12:31Z - message: Application: fleet-server--8.1.3[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
Apr 30, 2024 @ 23:12:31.297 Shutting down completed.
Apr 30, 2024 @ 23:12:31.301 Error: fleet-server failed: context canceled
Apr 30, 2024 @ 23:12:31.304 Error: enrollment failed: exit status 1

I think the question is what is the root cause of this fleet-server failed: context canceled?

Kibana

On the Kibana side the container has been oomkilled once but seems to be ready way before Agent as expected:

{
    "containerID": "containerd://2abe4cde65635722a5fdc84cf3502b33c3567fbffdea8d58f2e8f4079a49e56b",
    "image": "docker.elastic.co/kibana/kibana:8.1.3",
    "imageID": "docker.elastic.co/kibana/kibana@sha256:1d7cd5fa140df3307f181830b096562c9f2fc565c94f6f9330aa2313ecb7595c",
    "lastState": {
        "terminated": {
            "containerID": "containerd://e14eb76e157c85b18a44d60ea21ffdf0fb475ee54a9e0bedf6843e2381ed4ed7",
            "exitCode": 137,
            "finishedAt": "2024-04-29T23:00:39Z",
            "reason": "OOMKilled",
            "startedAt": "2024-04-29T23:00:00Z"
        }
    },
    "name": "kibana",
    "ready": true,
    "restartCount": 1,
    "started": true,
    "state": {
        "running": {
            "startedAt": "2024-04-29T23:00:39Z"
        }
    }
}

In the Kibana logs we have a few 401 from 23:02:39.664+00:00 to 23:02:45.162+00:00, which might be related to the operator trying to call https://test-agent-fleet-qj5h-kb-http.e2e-fo9kn-mercury.svc:5601/api/fleet/setup (last 401. in the operator is at 23:02:45.193Z). Seems to be ok otherwise.

[2024-04-29T23:01:00.615+00:00][INFO ][status] Kibana is now degraded
[2024-04-29T23:01:03.360+00:00][INFO ][status] Kibana is now available (was degraded)
[2024-04-29T23:01:03.388+00:00][INFO ][plugins.reporting.store] Creating ILM policy for managing reporting indices: kibana-reporting
[2024-04-29T23:01:03.543+00:00][INFO ][plugins.fleet] Encountered non fatal errors during Fleet setup
[2024-04-29T23:01:03.544+00:00][INFO ][plugins.fleet] {"name":"Error","message":"Saved object [epm-packages-assets/a01d2162-e711-5ee9-88c5-62781411ac1e] not found"}
[2024-04-29T23:01:03.544+00:00][INFO ][plugins.fleet] {"name":"Error","message":"Saved object [epm-packages-assets/a01d2162-e711-5ee9-88c5-62781411ac1e] not found"}
[2024-04-29T23:01:03.544+00:00][INFO ][plugins.fleet] Fleet setup completed
[2024-04-29T23:01:03.552+00:00][INFO ][plugins.securitySolution] Dependent plugin setup complete - Starting ManifestTask
[2024-04-29T23:02:39.664+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:39.749+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:39.916+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:39.967+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:40.019+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:40.070+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:40.150+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:40.525+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:41.214+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:42.547+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:45.162+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-fo9kn-mercury-test-agent-fleet-fs-w6xh-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}
[2024-04-29T23:02:50.589+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2024-04-29T23:03:16.505+00:00][INFO ][plugins.fleet] Fleet setup completed
[2024-04-29T23:03:17.486+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2024-04-29T23:03:17.645+00:00][INFO ][plugins.fleet] Fleet setup completed

The text was updated successfully, but these errors were encountered:

pebrc · 2024-04-30T13:35:30Z

@cmacknz could you help us assess what the problem might be? It is an older version of the stack, 8.1.3, that we are testing here that might have known issues. If that were the case we would be also glad for some guidance for which versions to mute our tests.

cmacknz · 2024-04-30T15:00:38Z

From the logs posted here, this is what I see:

Elastic Agent can't communicate with Fleet Server, which is the cause of the connection refused errors.
Fleet Server seems to be gracefully shutting down. The Error: fleet-server failed: context canceled is not a real error, it is us failing to filter this out because the internal contexts are expected to be cancelled during shutdown.

I suspect that Fleet Server restarting is going to be the root cause here but I can't tell why it is shutting down in the logs.

Fleet Server doesn't actually depend on Kibana at all, it depends on Elasticsearch and information between Fleet Server and Kibana is exchanged via the .fleet indices.

Was Elasticsearch working as expected during this test? Are there any more Fleet Server logs besides those posted here?

barkbay · 2024-04-30T15:48:26Z

Was Elasticsearch working as expected during this test?

Cluster was green with the following indices:

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   .apm-agent-configuration            Phgbu1G_QuKRypuhfrlu5g   1   1          0            0       450b           225b false
green  open   .apm-custom-link                    IdCZhr3qRdOVN8NhDucCYg   1   1          0            0       450b           225b false
green  open   .ds-ilm-history-5-2024.04.29-000001 mSSaUyQMQcmlaa8G_PU8dA   1   1          6            0     37.2kb         18.6kb false
green  open   .fleet-agents-7                     0Qw2CLFgS6Ols6CBQ2hXRQ   1   1          3            1      429kb        208.3kb false
green  open   .fleet-enrollment-api-keys-7        kCxvPPR6T7y9co0Tv6NNUg   1   1          2            0     13.2kb          6.6kb false
green  open   .geoip_databases                    _uqV82zNRZCZbpkQFPu8sw   1   1         33            0     62.3mb         31.1mb false
green  open   .kibana-event-log-8.1.3-000001      SCl-CCDPSxCTKKFvNyhxVQ   1   1          2            0     24.6kb         12.3kb false
green  open   .kibana_8.1.3_001                   mbYdeM1-RKCDNPXsN_iybw   1   1       1739          998     19.1mb          9.5mb false
green  open   .kibana_security_session_1          Kl6U1-TzRHqeEPcdw7nInA   1   1          0            0       450b           225b false
green  open   .kibana_task_manager_8.1.3_001      7HnEovZaRt-MI7Ds-qcsFg   1   1         20          984    430.1kb        180.8kb false
green  open   .security-7                         bkRgaX0eSN2LjRcumxkiHA   1   1        110            0    721.1kb        364.4kb false
green  open   .tasks                              njUPJ0uPSOy1htV2deDGzg   1   1          2            0     22.3kb         14.2kb false

I can't find any error in the logs.

Are there any more Fleet Server logs besides those posted here?

Sorry for the screenshot, I have some issues when I try to get an export:

Please note that Kibana has used my TZ (GMT+2) in the screenshot above

cmacknz · 2024-05-01T17:46:12Z

@michel-laterman does anything jump out in these 8.1.3 agent logs?

michel-laterman · 2024-05-01T18:18:25Z

The "Fleet Server - Waiting in policy..." message indicates that the policy self-monitor is awaiting for the policy to be detected, but .fleet-policies isn't a part of the indices listed in #7790

cmacknz · 2024-05-01T18:23:39Z

.fleet-policies is written by the Fleet plugin in Kibana correct?

So the problem may be Kibana after all.

michel-laterman · 2024-05-01T18:26:36Z

yes, i think it's written by kibana. but i'm not sure why there's no error logs from fleet-server to indicate the reason for the shutdown

barkbay · 2024-05-06T11:08:41Z

I'll try to reproduce it in my dev env, with log level increased.

barkbay · 2024-05-06T11:49:59Z

FWIW the last successful run was on Saturday 2024-04-27T23:06:08.133397756Z , first failure is reported on Monday 2024-04-28. There is no commit in ECK during that time frame 😕

barkbay · 2024-05-06T11:50:50Z

I'll try to reproduce it in my dev env, with log level increased.

😞

barkbay · 2024-05-06T15:26:32Z

I reproduced the problem in our CI, attempted to run the Agent diagnostic tool using https://github.com/elastic/eck-diagnostics/, but for some reasons there is no Agent diagnostic data in the archive 😞

I also attempted to enable debug logs without success, it does not seem to give us more logs, I'm a bit out of ideas...

kvalliyurnatt · 2024-05-08T17:45:29Z

We seem to hit a similar failure on 8.15.0-SNAPSHOT as well
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/538#018f5503-4478-47bc-a9f5-ad5f1178dbf3

kvalliyurnatt · 2024-05-08T22:13:28Z

I think I was able to repro this on my dev cluster, and I saw no container kills or pod restarts but on the node where the agent pods were running, I see the agent process being continously OOM killed

[Wed May  8 22:10:22 2024] agentbeat invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
[Wed May  8 22:10:22 2024] CPU: 7 PID: 64018 Comm: agentbeat Not tainted 6.1.75+ elastic/cloud-on-k8s#1
[Wed May  8 22:10:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
[Wed May  8 22:10:22 2024] Call Trace:
[Wed May  8 22:10:22 2024]  <TASK>
[Wed May  8 22:10:22 2024]  dump_stack_lvl+0x4a/0x70
[Wed May  8 22:10:22 2024]  dump_header+0x52/0x250
[Wed May  8 22:10:22 2024]  oom_kill_process+0x10a/0x220
[Wed May  8 22:10:22 2024]  out_of_memory+0x3dc/0x5c0
[Wed May  8 22:10:22 2024]  ? __wake_up_locked_key+0x52/0x80
[Wed May  8 22:10:22 2024]  try_charge_memcg+0x827/0xa90
[Wed May  8 22:10:22 2024]  charge_memcg+0x3f/0x1f0
[Wed May  8 22:10:22 2024]  __mem_cgroup_charge+0x2b/0x80
[Wed May  8 22:10:22 2024]  wp_page_copy+0x3a8/0xa40
[Wed May  8 22:10:22 2024]  handle_mm_fault+0x84a/0x16b0
[Wed May  8 22:10:22 2024]  do_user_addr_fault+0x271/0x4d0
[Wed May  8 22:10:22 2024]  exc_page_fault+0x78/0xf0
[Wed May  8 22:10:22 2024]  asm_exc_page_fault+0x22/0x30
[Wed May  8 22:10:22 2024] RIP: 0033:0x55c6379eea4e
[Wed May  8 22:10:22 2024] Code: 24 38 48 8b 7c 24 78 48 8d 05 ae 23 7b 09 e8 c9 22 90 ff 48 89 44 24 58 84 00 48 8b 94 24 98 00 00 00 48 8b b4 24 90 00 00 00 <48> 89 30 83 3d f8 6e 79 0d 00 74 02 eb 04 eb 14 66 90 e8 5b 69 96
[Wed May  8 22:10:22 2024] RSP: 002b:000000c0015bc330 EFLAGS: 00010246
[Wed May  8 22:10:22 2024] RAX: 000000c00363d048 RBX: 0000000000000088 RCX: 000000c00362dec0
[Wed May  8 22:10:22 2024] RDX: 000055c644855cc8 RSI: 000055c6402ca1a0 RDI: 0000000000000000
[Wed May  8 22:10:22 2024] RBP: 000000c0015bc430 R08: 0000000000000110 R09: ffffffffffffffff
[Wed May  8 22:10:22 2024] R10: 0000000000000000 R11: 000000000001b1e7 R12: ffffffffffffffff
[Wed May  8 22:10:22 2024] R13: 000055c6408e9de0 R14: 000000c000f024e0 R15: 0000000000000000
[Wed May  8 22:10:22 2024]  </TASK>
[Wed May  8 22:10:22 2024] memory: usage 358400kB, limit 358400kB, failcnt 52936
[Wed May  8 22:10:22 2024] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[Wed May  8 22:10:22 2024] kmem: usage 7228kB, limit 9007199254740988kB, failcnt 0
[Wed May  8 22:10:22 2024] Memory cgroup stats for /kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c:
[Wed May  8 22:10:22 2024] anon 359591936
                           file 4096
                           kernel 7401472
                           kernel_stack 1343488
                           pagetables 4644864
                           sec_pagetables 0
                           percpu 35920
                           sock 0
                           vmalloc 65536
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 0
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 359559168
                           active_anon 32768
                           inactive_file 0
                           active_file 4096
                           unevictable 0
                           slab_reclaimable 223696
                           slab_unreclaimable 828072
                           slab 1051768
                           workingset_refault_anon 0
                           workingset_refault_file 32589
                           workingset_activate_anon 0
                           workingset_activate_file 258
                           workingset_restore_anon 0
                           workingset_restore_file 3
                           workingset_nodereclaim 0
                           pgscan 123548
                           pgsteal 34679
                           pgscan_kswapd 0
                           pgscan_direct 123548
                           pgsteal_kswapd 0
                           pgsteal_direct 34679
                           pgfault 5820413
                           pgmajfault 0
                           pgrefill 6278
                           pgactivate 6538
                           pgdeactivate 6278
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 0
                           thp_collapse_alloc 0
[Wed May  8 22:10:22 2024] Tasks state (memory values in pages):
[Wed May  8 22:10:22 2024] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed May  8 22:10:22 2024] [  48825]     0 48825      257        1    28672        0          -998 pause
[Wed May  8 22:10:22 2024] [  48859]     0 48859      625      147    49152        0          -997 tini
[Wed May  8 22:10:22 2024] [  48871]     0 48871   640830    18324   450560        0          -997 elastic-agent
[Wed May  8 22:10:22 2024] [  49408]     0 49408   602607    44255   847872        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  60311]     0 60311   584174    44946   847872        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  63295]     0 63295   584238    46602   839680        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  63384]     0 63384   584254    47449   843776        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  64018]     0 64018   602687    47556   856064        0          -997 agentbeat
[Wed May  8 22:10:22 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,mems_allowed=0,oom_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c,task_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c/803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,task=agentbeat,pid=64018,uid=0
[Wed May  8 22:10:22 2024] Memory cgroup out of memory: Killed process 64018 (agentbeat) total-vm:2410748kB, anon-rss:69512kB, file-rss:120712kB, shmem-rss:0kB, UID:0 pgtables:836kB oom_score_adj:-997

cmacknz · 2024-05-09T15:36:56Z

It seems like the jump in memory usage actually occurs in 8.14.0 and not 8.13.0, summarizing some measurements from Slack:

8.13.0:

 karthikeyanvalliyurnatt@KarthikeyansMBP  ~/workspace/kvalliyurnatt/k8s-gitops-control-plane   main ?  kubectl top pod test-agent-system-int-sf6b-agent-6n7vm -n e2e-mercury                                               ✔  12985  11:20:01
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-sf6b-agent-6n7vm   83m          265Mi

8.14.0+

karthikeyanvalliyurnatt@KarthikeyansMBP  ~/workspace/kvalliyurnatt/k8s-gitops-control-plane   main ?  kubectl top pod test-agent-system-int-vlpc-agent-28vhc -n e2e-mercury                                                                          ✔  12958  09:55:47
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-vlpc-agent-28vhc   171m         349Mi

cmacknz · 2024-05-09T15:38:34Z

The heap profiles in the agent diagnostics in the ECK build failures aren't particularly helpful, all 3 look like:

elasticmachine · 2024-05-09T15:44:05Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

kvalliyurnatt · 2024-05-09T17:27:05Z

@cmacknz the original issue here was related to fleet server restarting, the issue I mentioned is related to the elastic agent, could this be two separate issues? just want to make sure.

cmacknz · 2024-05-09T18:50:49Z

Quite possibly yes, fleet server could also be OOMing as the root cause there, but all we have confirmed so far is that agent in 8.14+ seems to use more baseline memory (or more memory at startup) then it did before.

I can fork off the second problem into a second issue if that makes things clearer.

kvalliyurnatt · 2024-05-09T19:08:38Z

might be better until we know for sure the two issues are related.

cmacknz · 2024-05-09T20:18:09Z

Looking at the issue description, at least the symptom 2024-04-29T23:18:21.222389355Z - Failure with {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent-default] is 100% explained by the OOMs we are observing. The process responsible for writing to that index is the one being killed.

cmacknz · 2024-05-09T20:23:35Z

I created elastic/elastic-agent#4730 specifically about the agent memory usage increase, I am going to send this one back to the ECK repository to track the test failures.

As discussed on Slack, the best initial mitigation would be to increase the memory limit and confirm it fixes the tests. We have a similar issue we've detected in MKI which is probably the same root cause. elastic/elastic-agent#4729

barkbay · 2024-08-05T09:28:51Z

We have new failures, they are pretty consistent in the last days. The error message is slightly different as it is referring to logs-elastic_agent.filebeat-default:

404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception}]}}

It is affecting:

TestAgentVersionUpgradeToLatest8x/ES_data_should_pass_validations on 8.15.0-SNAPSHOT
TestAgentVersionUpgradeToLatest8x/ES_data_should_pass_validations on 8.16.0-SNAPSHOT
TestFleetKubernetesIntegrationRecipe/ES_data_should_pass_validations on 8.14.0

I had a quick look at one diag bundle and can't find anything obvious 😕 : , no OOMKilled Pods for example...
Closest datasteams which have been created are:

.ds-logs-elastic_agent-default-2024.08.01-000001
.ds-logs-elastic_agent.fleet_server-default-2024.08.01-000001

kvalliyurnatt · 2024-08-13T13:57:57Z

Seeing the this frequently on our nightly pipelines
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/635#01914a0d-6b49-4854-b2c3-7efcef377733

r:{CausedBy:{Reason: Type:} Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception StackTrace: RootCause:[{Reason:no such index [logs-elastic_agent.filebeat-default] Type:index_not_found_exception}]}}
        	Test:       	TestFleetMode/Fleet_in_same_namespace_as_Agent/ES_data_should_pass_validations

jlind23 · 2024-08-19T09:17:50Z

@pierrehilbert @rdner @belimawr does this error rings a bell on your end?

belimawr · 2024-08-19T15:02:28Z

@pierrehilbert @rdner @belimawr does this error rings a bell on your end?

It does not ring a bell for me :/.

pierrehilbert · 2024-08-20T15:22:08Z

Nothing on my end either. @rdner any idea?

rdner · 2024-08-20T18:47:59Z

I don't recall us changing anything related to index creation lately. Also, the fact it started affecting multiple versions at once suggests it's either a backported change or something else changed in the environment.

pebrc · 2024-08-22T18:17:08Z

It looks like OOM is at least responsible for some of the failures. In #8021 I bumped the memory to 640Mi now which at least in the builds I ran improved the situation. But this is probably worth investigating. Unfortunately we don't have more accurate memory metrics from the test runs due to a problem with our monitoring.

barkbay · 2024-12-16T10:38:00Z

I think 640Mi is no longer enough for 8.16.0: #8331
Does anyone from the Agent team know if a recent change may explain why we need to increase the memory again?

jlind23 · 2024-12-16T13:26:19Z

@pkoutsovasilis and/or @swiatekm might be able to help answer this.

swiatekm · 2024-12-16T13:46:48Z

There shouldn't be anything in 8.16.0 specifically that would require increasing the memory consumption. There was a major regression introduced in 8.15.0 involving the memory queue in beats, but it was fixed in 8.15.4 and 8.16.0.

If this test only cares about communication between fleet and agent, then maybe we could give agent a very small policy? Looking at the test code, I can't easily tell what it actually does.

pkoutsovasilis · 2024-12-16T16:05:48Z

I have to say that I am amazed that 640Mi didn't cause any issues with 8.15.x series but with 8.16.0 they did 🤯 I would expect the opposite. If I read the test correctly, this installs kubernetes and system integrations on a fleet-managed agent and then checks if some of the expected streams contain events. We should try to reproduce and investigate this one

cmacknz transferred this issue from elastic/cloud-on-k8s May 9, 2024

cmacknz changed the title ~~TestFleet* is failing~~ ECK TestFleet* is failing due to agent pod OOM May 9, 2024

cmacknz mentioned this issue May 9, 2024

Baseline agent memory usage has increased in ECK integration tests due to agentbeat elastic/elastic-agent#4730

Open

cmacknz changed the title ~~ECK TestFleet* is failing due to agent pod OOM~~ ECK TestFleet* is failing May 9, 2024

cmacknz transferred this issue from elastic/elastic-agent May 9, 2024

botelastic bot added the triage label May 9, 2024

barkbay added the >test Related to unit/integration/e2e tests label May 13, 2024

botelastic bot removed the triage label May 13, 2024

barkbay mentioned this issue Jun 10, 2024

[E2E] Increase Agent memory on OpenShift #7884

Merged

barkbay mentioned this issue Dec 16, 2024

[E2E] Increase Agent memory from 640Mi to 756Mi #8337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECK TestFleet* is failing #7790

ECK TestFleet* is failing #7790

barkbay commented Apr 30, 2024 •

edited

Loading

pebrc commented Apr 30, 2024

cmacknz commented Apr 30, 2024

barkbay commented Apr 30, 2024 •

edited

Loading

cmacknz commented May 1, 2024

michel-laterman commented May 1, 2024

cmacknz commented May 1, 2024

michel-laterman commented May 1, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

kvalliyurnatt commented May 8, 2024

kvalliyurnatt commented May 8, 2024

cmacknz commented May 9, 2024 •

edited

Loading

cmacknz commented May 9, 2024

elasticmachine commented May 9, 2024

kvalliyurnatt commented May 9, 2024

cmacknz commented May 9, 2024

kvalliyurnatt commented May 9, 2024

cmacknz commented May 9, 2024

cmacknz commented May 9, 2024

barkbay commented Aug 5, 2024

kvalliyurnatt commented Aug 13, 2024

jlind23 commented Aug 19, 2024

belimawr commented Aug 19, 2024

pierrehilbert commented Aug 20, 2024

rdner commented Aug 20, 2024

pebrc commented Aug 22, 2024 •

edited

Loading

barkbay commented Dec 16, 2024

jlind23 commented Dec 16, 2024

swiatekm commented Dec 16, 2024

pkoutsovasilis commented Dec 16, 2024 •

edited

Loading

ECK TestFleet* is failing #7790

ECK TestFleet* is failing #7790

Comments

barkbay commented Apr 30, 2024 • edited Loading

Agents

Fleet Server

Kibana

pebrc commented Apr 30, 2024

cmacknz commented Apr 30, 2024

barkbay commented Apr 30, 2024 • edited Loading

cmacknz commented May 1, 2024

michel-laterman commented May 1, 2024

cmacknz commented May 1, 2024

michel-laterman commented May 1, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

barkbay commented May 6, 2024

kvalliyurnatt commented May 8, 2024

kvalliyurnatt commented May 8, 2024

cmacknz commented May 9, 2024 • edited Loading

cmacknz commented May 9, 2024

elasticmachine commented May 9, 2024

kvalliyurnatt commented May 9, 2024

cmacknz commented May 9, 2024

kvalliyurnatt commented May 9, 2024

cmacknz commented May 9, 2024

cmacknz commented May 9, 2024

barkbay commented Aug 5, 2024

kvalliyurnatt commented Aug 13, 2024

jlind23 commented Aug 19, 2024

belimawr commented Aug 19, 2024

pierrehilbert commented Aug 20, 2024

rdner commented Aug 20, 2024

pebrc commented Aug 22, 2024 • edited Loading

barkbay commented Dec 16, 2024

jlind23 commented Dec 16, 2024

swiatekm commented Dec 16, 2024

pkoutsovasilis commented Dec 16, 2024 • edited Loading

barkbay commented Apr 30, 2024 •

edited

Loading

barkbay commented Apr 30, 2024 •

edited

Loading

cmacknz commented May 9, 2024 •

edited

Loading

pebrc commented Aug 22, 2024 •

edited

Loading

pkoutsovasilis commented Dec 16, 2024 •

edited

Loading