Add ability to remotely restart an agent #144585

jlind23 · 2022-11-04T06:37:21Z

There are some cases where a simple restart of an Agent may resolve common problems. Currently there's no way to do this remotely.
In order to allow this action we should offer a new API endpoint that will be shipped under an experimental status for now.
This endpoint should one of multiple Agent ID in order to operate a bulk restart if needed.

Depends on

This is a two steps issue:

Allow this for a single Elastic Agent
Allow this for multiple Elastic Agent

elasticmachine · 2022-11-04T06:37:23Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2023-02-08T17:57:58Z

Questions:

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?
We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.
- Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.
- Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

nimarezainia · 2023-05-08T03:56:30Z

Closing in favour of https://github.com/elastic/ingest-dev/issues/1221

juliaElastic · 2023-05-31T13:10:57Z

@nimarezainia Is this issue intentionally reopened?

nimarezainia · 2023-06-05T01:24:07Z

@nimarezainia Is this issue intentionally reopened?

@juliaElastic this is the public issue. I had closed it in favor of the private one to reduce duplicates by mistake. We should close the public issue once the implementation is compete. hope this makes sense. the private issue has the bulk of the prioritization and implementation discussions.

joshdover · 2023-08-15T15:26:49Z

I think we're still not yet aligned on whether or not we want to support this at all. If we do support it, I think it should be an advanced action not exposed in the UI and we should have telemetry to track usage as ideally this isn't needed often.

ThomSwiss · 2023-08-29T09:36:46Z

We have currently 1150 agents out in our environment.
The most of them send there data to on of two logstashes.

Each time, after a restart of logstash, all agents look to work fine, but some are not able so send data anymore. They are still visible as helthy. In kibana I couldn't find anything bad. But the didn't send data anymore. If I restart the elastic-agent, it works fine. That is the reason, why we I need this feature.

jlind23 · 2023-08-29T09:48:11Z

@amolnater-qasource As part of the Logstash test cases you run, is this included? If not, worth adding it then.

amolnater-qasource · 2023-08-29T10:51:01Z

Thank you for the update @jlind23

We have added a testcase where the Logstash is restarted when connected to the elastic-agent under Fleet test suite at link:

Validate on restarting logstash connected to an agent, agent restarts sending data under the Data Streams tab.

Please let us know if we are missing anything here.
Thanks!

jlind23 · 2023-08-29T11:30:42Z

@amolnater-qasource csn you please check this as soon as possible? I want to check if we have a really bad problem here.

amolnater-qasource · 2023-08-29T14:20:20Z

@jlind23 We have revalidated this scenario on latest 8.10.0 BC2 kibana cloud environment and found this issue not reproducible there.

Observations:

On restarting logstash output, new data is generated for the connected agent after 10-15 seconds as soon as Logstash is up.

For reconfirming we tried several times to reproduce this, however the data resumed for the agent as soon as logstash service gets up.

Few other scenarios tried:

Restarted Elastic-Agent from services and then restarted logstash.
Stopped the agent till it went offline and then getting the host back up. After that restarting the logstash.
Set agent logs to debug level and then tried to restart logstash.

This issue isn't reproducible this way too.

Screen Recording:
Before Restart:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-30-40.mp4

After Restart:

Data.streams.-.Fleet.-.Elastic.-.Google.Chrome.2023-08-29.19-35-05.mp4

Build details:
VERSION: 8.10.0
BUILD: 66107 BC2
COMMIT: fa3473f

Please let us know if anything else is required from our end.

Thanks!

amitkanfer · 2023-08-29T14:22:44Z

@ThomSwiss please let us know if we're running the tests in a different way, we're unable to reproduce. If this does reproduce for you, would be great to share your agent diagnostics files and we're happy to investigate further.

nimarezainia · 2023-08-29T22:13:22Z

@ThomSwiss also what version are you on?

ThomSwiss · 2023-08-30T07:07:33Z

@amitkanfer, @nimarezainia
Thanks for your help!

We use the newest Agent version 8.9.1. We had the same issue also with older releases. We do not all releases, but I am sure that this was a problem with 8.7.x releases as well.

I try to do a query on the reached data to find out which clients don't send data anymore. Than I can run diagnostics on it. I hope to get this answer in the next 1-2 days.

ThomSwiss · 2023-08-31T09:54:47Z

I did a lot of tests the last 2 days. I can now tell you: Elastic Agent is correctly working after restart logstash. My

Testcase:

With a powershell script, runned many times
- Get all 984 Elastic Agents with status healthy, all Windows
- Count the number of records we received during the last 30 minutes/per Agent on dataview winlogbeat (includes logs-system.application,logs-system.security,logs-windows.powershell,logs-windows.powershell_operational and a view more)
- List all Agents that send less than 30 records during the last 30 minutes
Compare this lists during many runs
Restart the two logstashes that receives Elastic Agent input on port 5044
Result: We received also data after restarting logstash

I am sorry for my wrong post. But I am still unclear, when this happend in the past. I remember at least two occourencies in the last 2 years, where we had to restart all agents to get them back to send correctly data. I guess, sometimes it also helped when we just changed the fleet policy. So for example added or disabled Powershell logs in the windows integration. I have now this script and also my logs. I will check carefully, if this appears again and will come back, if I have details. Also with diagnostics.

Thanks for your work! Elastic is a great product.

jlind23 · 2023-09-06T12:45:28Z

@pierrehilbert @blakerouse Does Elastic Agent have a restart command than be sent down from fleet? Just like upgrade or any other actions?

pierrehilbert · 2023-09-06T13:00:05Z

From what I know, we don't have an action handler to restart the Agent.
@blakerouse if you can keep me honest here

blakerouse · 2023-09-06T14:31:44Z

Correct. The Elastic Agent doesn't support that action.

ThomSwiss · 2023-09-06T15:01:33Z

Today I had a problem with an Elastic Agent Custom log integration: I did an error in the processor field (the Kibana Fleets GUI, didn't show me an error. I had two \ sign in a replace pattern ). I saved it successfully. Later the Agent changed to not healty.

I corrected the error. But the client did not change to healty. The message did not disapper. I waited at least 15 minutes. Then I restarted the elastic agent (had to login to the device). After restart, all was fine. If you are interested, I did a analytics before I restarted. This is a typical use case for a restart.

jlind23 · 2023-09-07T12:21:46Z

@nimarezainia updated the issue description following the chat we had.
cc @kpollich for awareness

allamiro · 2023-09-12T21:49:54Z

Questions:

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?

We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.

Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.

Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

nimarezainia · 2023-10-02T06:38:14Z

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API?

zez3 · 2023-12-04T17:20:11Z

This would also help if the metricbeat or other beasts will contain memory leak bugs in the future.

msecpim · 2024-11-10T06:17:47Z

We do have a plus 30K agent infrastructure, with agents running also in remote locations. Utilising such an API would be of great advantage. Do you have any update on when that will become available?

ThomSwiss · 2024-11-10T17:35:38Z

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

nimarezainia · 2024-11-11T20:51:57Z

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

@ThomSwiss this shouldn't be happening and I consider it a bug. Could you open an support case with us if possible to the issue can be diagnosed. Not denying that this feature would be useful, just want to ensure the primary problem is addressed. Would be great to obtain the diagnostics file or any error that you see which agents produce which could give us a clue.

allamiro · 2024-12-26T10:18:02Z

This is my suggestion :
I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API?

While I understand that the logic could technically be implemented in the user's code when invoking the API, i think the goal is to streamline the user experience and provide a safeguard against potential misuse directly within the GUI. By integrating this functionality at the GUI level, we can ensure consistent enforcement of these rules, even for users who may lack the expertise or resources to handle such logic programmatically. This approach aligns with providing a more robust and user-friendly solution.
Would you agree this might better serve a broader range of use cases?
For instance, ArcSight Management Center (ArcMC) and many other SIEM solutions offer this capability to streamline the management of agents and connectors. It’s unclear why would delegate such a critical feature to be managed solely by the Fleet and made available through the gui using Kibana.

jlind23 added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 4, 2022

jlind23 mentioned this issue Feb 2, 2023

starting and stopping agents from fleet servers elastic/fleet-server#2311

Closed

allamiro mentioned this issue Apr 27, 2023

[Fleet]Restart elastic agents from fleet server remotely elastic/fleet-server#2523

Open

nimarezainia closed this as completed May 8, 2023

nimarezainia reopened this May 10, 2023

jlind23 mentioned this issue Sep 7, 2023

Elastic Agent should support a restart action elastic/elastic-agent#3367

Open

juliaElastic added the QA:Needs Validation Issue needs to be validated by QA label Sep 13, 2023

ycombinator added the Team:Elastic-Agent-Control-Plane label May 7, 2024

kpollich assigned nimarezainia and unassigned nimarezainia Jul 19, 2024

kpollich assigned nchaulet Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to remotely restart an agent #144585

Add ability to remotely restart an agent #144585

jlind23 commented Nov 4, 2022 •

edited by nimarezainia

Loading

elasticmachine commented Nov 4, 2022

joshdover commented Feb 8, 2023

nimarezainia commented May 8, 2023

juliaElastic commented May 31, 2023

nimarezainia commented Jun 5, 2023 •

edited

Loading

joshdover commented Aug 15, 2023 •

edited

Loading

ThomSwiss commented Aug 29, 2023

jlind23 commented Aug 29, 2023

amolnater-qasource commented Aug 29, 2023

jlind23 commented Aug 29, 2023

amolnater-qasource commented Aug 29, 2023

amitkanfer commented Aug 29, 2023

nimarezainia commented Aug 29, 2023

ThomSwiss commented Aug 30, 2023

ThomSwiss commented Aug 31, 2023 •

edited

Loading

jlind23 commented Sep 6, 2023

pierrehilbert commented Sep 6, 2023

blakerouse commented Sep 6, 2023

ThomSwiss commented Sep 6, 2023

jlind23 commented Sep 7, 2023

allamiro commented Sep 12, 2023 •

edited

Loading

nimarezainia commented Oct 2, 2023

zez3 commented Dec 4, 2023

msecpim commented Nov 10, 2024

ThomSwiss commented Nov 10, 2024

nimarezainia commented Nov 11, 2024

allamiro commented Dec 26, 2024 •

edited

Loading

Add ability to remotely restart an agent #144585

Add ability to remotely restart an agent #144585

Comments

jlind23 commented Nov 4, 2022 • edited by nimarezainia Loading

Depends on

elasticmachine commented Nov 4, 2022

joshdover commented Feb 8, 2023

nimarezainia commented May 8, 2023

juliaElastic commented May 31, 2023

nimarezainia commented Jun 5, 2023 • edited Loading

joshdover commented Aug 15, 2023 • edited Loading

ThomSwiss commented Aug 29, 2023

jlind23 commented Aug 29, 2023

amolnater-qasource commented Aug 29, 2023

jlind23 commented Aug 29, 2023

amolnater-qasource commented Aug 29, 2023

amitkanfer commented Aug 29, 2023

nimarezainia commented Aug 29, 2023

ThomSwiss commented Aug 30, 2023

ThomSwiss commented Aug 31, 2023 • edited Loading

jlind23 commented Sep 6, 2023

pierrehilbert commented Sep 6, 2023

blakerouse commented Sep 6, 2023

ThomSwiss commented Sep 6, 2023

jlind23 commented Sep 7, 2023

allamiro commented Sep 12, 2023 • edited Loading

nimarezainia commented Oct 2, 2023

zez3 commented Dec 4, 2023

msecpim commented Nov 10, 2024

ThomSwiss commented Nov 10, 2024

nimarezainia commented Nov 11, 2024

allamiro commented Dec 26, 2024 • edited Loading

jlind23 commented Nov 4, 2022 •

edited by nimarezainia

Loading

nimarezainia commented Jun 5, 2023 •

edited

Loading

joshdover commented Aug 15, 2023 •

edited

Loading

ThomSwiss commented Aug 31, 2023 •

edited

Loading

allamiro commented Sep 12, 2023 •

edited

Loading

allamiro commented Dec 26, 2024 •

edited

Loading