Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The division of responsibility between agent versions during an upgrade is incorrect #3639

Open
cmacknz opened this issue Oct 19, 2023 · 5 comments
Assignees
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 19, 2023

Problem

Today the majority of the work performed during an agent upgrade is performed by the current version the user is upgrading from. Today this includes:

  1. Downloading the next version of the agent.
  2. Unpacking the downloaded new agent package and extracting the new build hash.
  3. Using the build hash to determine the data path of the next agent version, assuming the structure is the same as the current version. Today this path is data/elastic-agent-$hash where $hash is the Git commit hash truncated to exactly 6 characters.
  4. Migrating the action store and runtime state directories into the new agent data path, assuming the structure is the same as the current version.
  5. Changing the symlink to the active agent version, assuming the executable is in the same location as it is in the current version.
  6. Starting the upgrade watcher to monitor the next version of the agent and roll it back if necessary. As of v8.10.0 the elastic-agent watch command is executed using the path to the next version of the agent, but the path to this version is assumed to match the current version.

Today these steps today are implemented here, and once the agent artifact is downloaded and extracted every one of them requires the current version of the agent to make assumptions about the structure of the next agent version.

Impact

This strategy has two major flaws:

  1. Any bugs in the upgrade logic cannot be fixed without reinstalling the agent. There is no way for us to deliver fixes that correct permanent failures of the upgrade if the upgrade is itself broken.
  2. It makes it extremely challenging to change the agent directory structure in a backwards compatible way. This is actively impacting Add the full version number to the installation directory name #2579 which needs to change the path to the agent executable in the next version we would upgrade to.

Solution

The solution to this problem is move the majority of the logic needed to perform an upgrade into the next version of the agent. The current version of the agent should be responsible for as little as possible. The minimum set of steps that for the current version of the agent to perform are:

  1. Download the next version of the agent.
  2. Extract the next version of the agent.
  3. Invoke the entrypoint for the executable that performs the upgrade from a stable, well known location ideally outside of the versioned data path. This executable may or may not be the next version of the agent itself. The location of this entrypoint should be the only thing the current version needs to know about the next version.

Implementing this solution will require significant changes to the upgrade process, and will also require changing the path to the executable invoked to start the upgrade. Changing the path to the agent executable is the exact same set of work already started in #2579 with a preliminary backwards compatible solution.

Note that all changes must be backwards compatible since we allow upgrading from any version of the agent to any later version of the agent, across both 7.17.x and 8.x.x. Making a breaking change to support these changes is currently not an option.

Next Steps

Create an RFC describing the changes we must make to the upgrade process to implement the solution above (or an alternate solution that accomplishes the same goals). Describe how the upgrade process will be tested and how we will achieve backwards compatibility. The changes made must unblock #2579 and any other future changes to the directory structure of the agent.

The proposed implementation should be broken into phases to allow us to make progress incrementally and reduce the risk of the implementation. Ideally the changes to the directory structure needed to address #2579 and implement a well known upgrade entrypoint can be made separately from the migration of most upgrade functionality into that entrypoint.

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 19, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@leehinman
Copy link
Contributor

One thing that tripped me up in previous upgrade scenarios is the fact that the semantic version number can be greater, but could have been produced at an earlier point in time. For example, we might first release 10.9.0 and then a few days later release 10.8.5. A user may want to upgrade from 10.8.5 -> 10.9.0, but because 10.9.0 was produced before 10.8.5 it may be missing critical information about how to do that upgrade.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 15, 2024

@pierrehilbert why did this get closed? This problem still exists.

@pierrehilbert
Copy link
Contributor

Sorry that's a misclick. Thx for catching it!

@pierrehilbert pierrehilbert reopened this Jan 15, 2024
@pchila pchila mentioned this issue Jan 24, 2024
3 tasks
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants