-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic-Agents unexpectedly unenrolled after update to 8.16.x #6213
Comments
You could collect a diagnostics bundle for the specific Agent. If you extract the bundle, you'll see a folder in it for logs. You could try to attach those files here or post the logs in a gist and link it from here. |
Here the agent-log from the day of the upgrade:
|
Hmmmm, not seeing anything about (un)enrollment in the upgrade day logs. It looks like you have debug level logging turned on for Agent. Do you see a "handlerUnenroll: action '... UNENROLL ...' received" message anywhere in your logs from November 25 through December 2nd? |
@ycombinator @michel-laterman can we make this one of our P0 please? I'd like to make sure that this is not a bug on our end or if it is get it fixed. |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@ycombinator : there's no occurence of any On day of host-reboot (elastic-agent service shutdown) - Nov. 30th - we see this
Dec. 2nd - first start of agent after host reboot:
` |
In case it's relevant: some windows-hosts with agents in that strange state (active and unenrolled) seem to be unable to start the agent-service at boot time anymore. I cannot find any log entry anywhere matching this error (maybe in kibana.log - I didn't check that one ...) |
I think that upgrading from 8.16.0 -> 8.16.1 calls uninstall on 8.16.0, which makes an API call to Can you provide a diagnostics bundle of a failing agent? |
Doing a quick search of the codebase, i don't see a call to Uninstall anywhere within the upgrade handler, so my theory may be incorrect |
elastic-agent-diagnostics-2024-12-11T15-11-45Z-00.zip this should be from the same agent all the log-snippets above originated... |
That diagnostics bundle does not have logs from the incident time. I've tried recreating on a Linux VM against a 8.16.1 cluster. In the issue description you stated:
I tried installing 8.16.0 and upgrading to 8.16.1. I specifically tried to test the timing of the restart. I also tried restarting the VM after an upgrade (stopping the elastic-agent service and watcher), but the upgrade seems to succeed, without any calls to audit/unenroll. |
Below is the Browser-Trace of the api query for the agent that shows up "unhealthy" - noticable besides the already known wrong dates for is imho that in
|
OK I am starting to see what is happening. The Fleet Server server audit endpoint sets the unenrolled_at timestamp here. This is not a side effect free operation. The UI looks at The intent was for this API to be side effect free and only be used to show that the uninstall command or unenroll actions/commands were executed. Setting unenrolled at makes this not true. https://github.com/search?q=repo%3Aelastic%2Fkibana%20unenrolled_at&type=code There is an additional complication that the orphaned state was introduced for when endpoint can't communicate with agent, that also triggers these side effects. It may also be that endpoint's detection of the orphaned state is too sensitive during upgrades. This and the fact that this operation is not side effect free opens us to situations where the unenrolled audit reason is sent incorrectly causing the UI to think we are unenrolled when we aren't. CC @intxgo how does this logic work on the endpoint side? |
Testing on windows; I've also created elastic/fleet-server#4221 to address the issue @cmacknz brought up |
I tested with a policy that had the defend integration and was unable to recreate when I didn't restart, and followed the upgrade path. However, I manually stopped the elastic-agent process so that defend would report it's orphaned status, then restarted the agent to allow it to checkin again. The To summarize this bug effects agent upgrades (v8.16.0, v8.16.1, v8.16.2, v8.17.0) when the Defend integration is used. As a work around you can use an update_by_query call to set First get a Token to make the request: curl -XPOST --user elastic:${SUPERUSER_PASS} -H 'x-elastic-product-origin:fleet' -H'content-type:application/json' "https://${ELASTICSEARCH_HOST}/_security/service/elastic/fleet-server/credential/token/fix-unenrolled" Next make the query to set unenrolled_at to null: curl -XPOST -H 'Authorization: Bearer ${TOKEN}' -H 'x-elastic-product-origin:fleet' -H 'content-type:application/json' "https://${ELASTICSEARCH_HOST}/.fleet-agents/_update_by_query" -d '{"query": {"bool": {"must": [{ "exists": { "field": "unenrolled_at" } }],"must_not": [{ "term": { "active": "false" } }]}},"script": {"source": "ctx._source.unenrolled_at = null;","lang": "painless"}}' The fix is going to target v8.16.3 and v8.17.1 |
I can confirm the workaround removes the " Many thanks to everyone - especially @michel-laterman for solving this issue! Keep up that great work! |
If they are Fleet managed and you are running Defend, upgrading from the CLI does not send a signed upgrade action to Defend which is what allows Defend to upgrade without believing it is being tampered with. The tamper protection feature of Defend is based on all operations requiring a digital signature signed by a private key in Kibana. There is no way to provide a signed token for an upgrade on the CLI, so Defend legitimately believed it was being tampered with and orphaned from agent here. This hit the bug Michel is fixing. Specifically using the CLI skips this block elastic-agent/internal/pkg/agent/application/actions/handlers/handler_action_upgrade.go Lines 58 to 74 in 54932dc
For Fleet managed agents, upgrading from the CLI isn't something we encourage. We are making it harder to do, but it is still possible as an escape hatch for when agents need to be upgraded but Fleet managed upgrades aren't available or accidentally broken for some unforeseen reason. To make the upgrade command work with tamper protection properly we'd probably have to allow it to accept an uninstall token and otherwise just completely refuse the upgrade. |
This is exactly why we upgraded them locally: because they already were "lost" for fleet management by an unwanted and completely autonomous behaviour of the software/installation. What exactly caused elastic-agent to be "offline for too long" in the first place is still unclear. |
Thanks, I created #6356 to track making the upgrade command work properly when defend is installed with tamper protection enabled in the "break glass in case of emergency" case. |
Version:
8.16.0 and 8.16.1
Operating System:
Windows Server 2019 (64-bit), Microsoft Windows Server 2022 (64-bit)
Discuss Forum URL:
https://discuss.elastic.co/t/elastic-agent-unexpectedly-gets-unenrolled-version-8-16-x/371463
related Links:
[Fleet] Agents erroneously showing "Agent not upgradeable: agent has been unenrolled" kibana#202440
[Fleet]: Unable to force unenroll multiple offline agents from bulk actions. kibana#197180
Steps to Reproduce:
Agent-Logs (.fleet-agent) shows timestamp of 2) ->
unenrolled_at
and timestamp of 3) ->upgraded_at
Example-Logs and Screenshots:
Upgraded agent from 8.16.0 to 8.16.1 on Nov. 25th
Rebooted Host on Nov. 30th
Started Agent-Service on Dec. 2nd
Agent-Log from day of upgrade (I think, the error happens here):
too long to fit here -> could someone provide a hint how to export the relevant log entries from a specific agent?
Agent-Log from day of reboot & day of service-start (copy/paste from Kibana->Fleet->Agent->specific agent->Logs):
Nov 30, 2024
20:09:04.513
elastic_agent
[elastic_agent][info] signal "terminated" received
20:09:04.513
elastic_agent
[elastic_agent][info] Shutting down Elastic Agent and sending last events...
20:09:04.520
elastic_agent
[elastic_agent][warn] Possible transient error during checkin with fleet-server, retrying
20:09:04.552
elastic_agent
[elastic_agent][error] failed accept conn info connection: use of closed network connection
20:09:04.552
elastic_agent
[elastic_agent][info] stopping endpoint service runtime
20:09:04.720
elastic_agent
[elastic_agent][info] Shutting down completed.
20:09:04.728
elastic_agent
[elastic_agent][info] Stopping monitoring server
20:09:04.728
elastic_agent
[elastic_agent][info] Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection
Dec 2, 2024
09:25:50.042
elastic_agent
[elastic_agent][info] Elastic Agent started
09:25:50.331
elastic_agent
[elastic_agent][info] Starting upgrade watcher
09:25:50.365
elastic_agent
[elastic_agent][info] Upgrade Watcher invoked
09:25:50.692
elastic_agent
[elastic_agent][info] Upgrade Watcher started
09:25:50.708
elastic_agent
[elastic_agent][info] Loaded update marker &{Version:8.16.1 Hash:b6da7f VersionedHome:data\elastic-agent-8.16.1-b6da7f UpdatedOn:2024-11-25 12:20:00.3690588 +0100 CET PrevVersion:8.16.0 PrevHash:3f07f2 PrevVersionedHome:data\elastic-agent-8.16.0-3f07f2 Acked:false Action:id: f0d5d0c4-b283-419e-b826-a8e830f755cc, type: UPGRADE Details:}
09:25:50.714
elastic_agent
[elastic_agent][info] not within grace [updatedOn 2024-11-25 12:20:00.3690588 +0100 CET] 165h5m50.3458541s
09:25:50.714
elastic_agent
[elastic_agent][info] Cleaning up upgrade
09:25:50.828
elastic_agent
[elastic_agent][info] APM instrumentation disabled
09:25:50.838
elastic_agent
[elastic_agent][info] Gathered system information
09:25:50.870
elastic_agent
[elastic_agent][info] Detected available inputs and outputs
09:25:50.870
elastic_agent
[elastic_agent][info] Capabilities file not found in C:\Program Files\Elastic\Agent\capabilities.yml
09:25:50.870
elastic_agent
[elastic_agent][info] Determined allowed capabilities
09:25:50.870
elastic_agent
[elastic_agent][info] Loading baseline config from C:\Program Files\Elastic\Agent\elastic-agent.yml
09:25:51.312
elastic_agent
[elastic_agent][info] GRPC comms socket listening at localhost:6789
09:25:51.439
elastic_agent
[elastic_agent][info] Parsed configuration and determined agent is managed by Fleet
09:25:51.439
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
09:25:51.652
elastic_agent
[elastic_agent][info] GRPC control socket listening at npipe:///elastic-agent-system
09:25:51.656
elastic_agent
[elastic_agent][info] updated upgrade details
09:25:51.660
elastic_agent
[elastic_agent][info] Starting grpc control protocol listener on port 6789 with max_message_size 104857600
09:25:51.660
elastic_agent
[elastic_agent][info] Docker provider skipped, unable to connect: protocol not available
09:25:51.879
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
09:25:52.986
elastic_agent
[elastic_agent][info] restoring current policy from disk
09:25:53.030
elastic_agent
[elastic_agent][info] Setting fallback log level from policy
09:25:53.067
elastic_agent
[elastic_agent][info] Fleet gateway started
09:25:53.080
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
09:25:53.080
elastic_agent
[elastic_agent][info] Starting monitoring server with cfg &config.MonitoringConfig{Enabled:true, MonitorLogs:true, MonitorMetrics:false, MetricsPeriod:"", LogMetrics:true, HTTP:(*config.MonitoringHTTPConfig)(0xc00067d5f0), Namespace:"default", Pprof:(*config.PprofConfig)(nil), MonitorTraces:false, APM:config.APMConfig{Environment:"", APIKey:"", SecretToken:"", Hosts:[]string(nil), GlobalLabels:map[string]string(nil), TLS:config.APMTLS{SkipVerify:false, ServerCertificate:"", ServerCA:""}, SamplingRate:(*float32)(nil)}, Diagnostics:config.Diagnostics{Uploader:config.Uploader{MaxRetries:10, InitDur:1000000000, MaxDur:600000000000}, Limit:config.Limit{Interval:60000000000, Burst:1}}}
09:25:53.083
elastic_agent
[elastic_agent][info] creating monitoring API with cfg api.Config{Enabled:true, Host:"http://localhost:6791", Port:6791, User:"", SecurityDescriptor:"", Timeout:5000000000}
09:25:53.084
elastic_agent
[elastic_agent][info] Starting stats endpoint
09:25:53.105
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)
09:25:53.107
elastic_agent
[elastic_agent][info] Updating running component model
09:25:54.355
elastic_agent
[elastic_agent][info] Creating connection info server for endpoint service, address: npipe:///.eaci.sock
09:25:54.356
elastic_agent
[elastic_agent][info] check if endpoint service is installed
09:25:54.552
elastic_agent
endpoint-default
[elastic_agent][info] Spawned new component endpoint-default: Starting: endpoint service runtime
09:25:54.552
elastic_agent
endpoint-default
[elastic_agent][info] Spawned new unit endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae: Starting: endpoint service runtime
09:25:54.553
elastic_agent
endpoint-default
[elastic_agent][info] Spawned new unit endpoint-default: Starting: endpoint service runtime
09:25:56.570
elastic_agent
[elastic_agent][error] 2024-12-02 08:25:56: info: Main.cpp:569 Verifying existing installation
09:25:56.574
elastic_agent
[elastic_agent][error] 2024-12-02 08:25:56: info: InstallLib.cpp:611 Running [C:\Program Files\Elastic\Endpoint\elastic-endpoint.exe] [version --log stdout]
09:25:56.574
elastic_agent
[elastic_agent][error] 2024-12-02 08:25:56: debug: Service.cpp:804 PPL is supported. This process is unprotected. (TrustLevelSid: absent)
09:25:57.282
elastic_agent
[elastic_agent][error] 2024-12-02 08:25:57: info: InstallLib.cpp:650 Installed endpoint is expected version (version: 8.16.1, compiled: Tue Nov 19 12:00:00 2024, branch: HEAD, commit: 7d50b182b0f0ddc7170095904dc1e341224bb1f4)
09:25:57.282
elastic_agent
[elastic_agent][error] 2024-12-02 08:25:57: info: Util.cpp:2146 Endpoint Service is running.
09:25:57.286
elastic_agent
[elastic_agent][info] after check if endpoint service is installed, err:
09:26:01.240
elastic_agent
winlog-default
[elastic_agent][info] Spawned new component winlog-default: Starting: spawned pid '20108'
09:26:01.241
elastic_agent
winlog-default
[elastic_agent][info] Spawned new unit winlog-default-winlog-system-85821b11-0064-11ee-b676-af36e033a9ae: Starting: spawned pid '20108'
09:26:01.241
elastic_agent
winlog-default
[elastic_agent][info] Spawned new unit winlog-default-winlog-windows-85821b12-0064-11ee-b676-af36e033a9ae: Starting: spawned pid '20108'
09:26:01.241
elastic_agent
winlog-default
[elastic_agent][info] Spawned new unit winlog-default: Starting: spawned pid '20108'
09:26:03.099
elastic_agent
[elastic_agent][info] control checkin v2 protocol has chunking enabled
09:26:03.100
elastic_agent
winlog-default
[elastic_agent][info] Component state changed winlog-default (STARTING->HEALTHY): Healthy: communicating with pid '20108'
09:26:03.175
elastic_agent
[elastic_agent][info] control checkin v2 protocol has chunking enabled
09:26:04.111
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default (STARTING->HEALTHY): Healthy
09:26:04.114
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-system-85821b11-0064-11ee-b676-af36e033a9ae (STARTING->HEALTHY): Healthy
09:26:04.114
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-windows-85821b12-0064-11ee-b676-af36e033a9ae (STARTING->HEALTHY): Healthy
09:26:10.278
elastic_agent
endpoint-default
[elastic_agent][info] Component state changed endpoint-default (STARTING->HEALTHY): Healthy: communicating with endpoint service
09:26:10.818
elastic_agent
[elastic_agent][info] Removing marker file
09:26:10.822
elastic_agent
[elastic_agent][info] Removing previous symlink path
09:26:10.822
elastic_agent
[elastic_agent][error] clean up of prior watcher run failedextracting elastic-agent path relative to data directory from C:\Program Files\Elastic\Agent\data\elastic-agent-8.16.1-b6da7f: Rel: can't make C:\Program Files\Elastic\Agent\data\elastic-agent-8.16.1-b6da7f relative to data
09:26:15.763
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae (STARTING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
09:26:15.764
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default (STARTING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
09:26:21.526
elastic_agent
[elastic_agent][info] component model updated
09:26:21.527
elastic_agent
[elastic_agent][info] Updating running component model
11:16:50.539
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default (HEALTHY->CONFIGURING): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
11:16:50.539
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae (HEALTHY->CONFIGURING): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
11:17:10.534
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae (CONFIGURING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
11:17:10.534
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default (CONFIGURING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
13:28:44.169
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
13:28:44.170
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
13:28:44.185
elastic_agent
[elastic_agent][info] Setting fallback log level from policy
13:28:44.218
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
13:28:44.236
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
13:28:44.237
elastic_agent
[elastic_agent][info] Stopping monitoring server
13:28:44.237
elastic_agent
[elastic_agent][info] Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection
13:28:44.238
elastic_agent
[elastic_agent][info] Starting monitoring server with cfg &config.MonitoringConfig{Enabled:true, MonitorLogs:true, MonitorMetrics:false, MetricsPeriod:"", LogMetrics:true, HTTP:(*config.MonitoringHTTPConfig)(0xc00078af90), Namespace:"default", Pprof:(*config.PprofConfig)(nil), MonitorTraces:false, APM:config.APMConfig{Environment:"", APIKey:"", SecretToken:"", Hosts:[]string(nil), GlobalLabels:map[string]string(nil), TLS:config.APMTLS{SkipVerify:false, ServerCertificate:"", ServerCA:""}, SamplingRate:(*float32)(nil)}, Diagnostics:config.Diagnostics{Uploader:config.Uploader{MaxRetries:10, InitDur:1000000000, MaxDur:600000000000}, Limit:config.Limit{Interval:60000000000, Burst:1}}}
13:28:44.238
elastic_agent
[elastic_agent][info] creating monitoring API with cfg api.Config{Enabled:true, Host:"http://localhost:6791", Port:6791, User:"", SecurityDescriptor:"", Timeout:5000000000}
13:28:44.240
elastic_agent
[elastic_agent][info] Starting stats endpoint
13:28:44.242
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)
13:28:44.267
elastic_agent
[elastic_agent][info] component model updated
13:28:44.267
elastic_agent
[elastic_agent][info] Updating running component model
13:28:44.274
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-windows-85821b12-0064-11ee-b676-af36e033a9ae (HEALTHY->CONFIGURING): Configuring
13:28:44.274
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-system-85821b11-0064-11ee-b676-af36e033a9ae (HEALTHY->CONFIGURING): Configuring
13:28:44.300
elastic_agent
[elastic_agent][warn] SSL/TLS verifications disabled.
13:28:45.275
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-windows-85821b12-0064-11ee-b676-af36e033a9ae (CONFIGURING->HEALTHY): Healthy
13:28:45.275
elastic_agent
winlog-default
[elastic_agent][info] Unit state changed winlog-default-winlog-system-85821b11-0064-11ee-b676-af36e033a9ae (CONFIGURING->HEALTHY): Healthy
19:17:13.888
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae (HEALTHY->CONFIGURING): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
19:17:13.889
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default (HEALTHY->CONFIGURING): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
19:17:33.888
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default-85821b10-0064-11ee-b676-af36e033a9ae (CONFIGURING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
19:17:33.888
elastic_agent
endpoint-default
[elastic_agent][info] Unit state changed endpoint-default (CONFIGURING->HEALTHY): Applied policy {85821b10-0064-11ee-b676-af36e033a9ae}
The text was updated successfully, but these errors were encountered: