The division of responsibility between agent versions during an upgrade is incorrect #3639

cmacknz · 2023-10-19T18:11:57Z

Problem

Today the majority of the work performed during an agent upgrade is performed by the current version the user is upgrading from. Today this includes:

Downloading the next version of the agent.
Unpacking the downloaded new agent package and extracting the new build hash.
Using the build hash to determine the data path of the next agent version, assuming the structure is the same as the current version. Today this path is data/elastic-agent-$hash where $hash is the Git commit hash truncated to exactly 6 characters.
Migrating the action store and runtime state directories into the new agent data path, assuming the structure is the same as the current version.
Changing the symlink to the active agent version, assuming the executable is in the same location as it is in the current version.
Starting the upgrade watcher to monitor the next version of the agent and roll it back if necessary. As of v8.10.0 the elastic-agent watch command is executed using the path to the next version of the agent, but the path to this version is assumed to match the current version.

Today these steps today are implemented here, and once the agent artifact is downloaded and extracted every one of them requires the current version of the agent to make assumptions about the structure of the next agent version.

Impact

This strategy has two major flaws:

Any bugs in the upgrade logic cannot be fixed without reinstalling the agent. There is no way for us to deliver fixes that correct permanent failures of the upgrade if the upgrade is itself broken.
It makes it extremely challenging to change the agent directory structure in a backwards compatible way. This is actively impacting Add the full version number to the installation directory name #2579 which needs to change the path to the agent executable in the next version we would upgrade to.

Solution

The solution to this problem is move the majority of the logic needed to perform an upgrade into the next version of the agent. The current version of the agent should be responsible for as little as possible. The minimum set of steps that for the current version of the agent to perform are:

Download the next version of the agent.
Extract the next version of the agent.
Invoke the entrypoint for the executable that performs the upgrade from a stable, well known location ideally outside of the versioned data path. This executable may or may not be the next version of the agent itself. The location of this entrypoint should be the only thing the current version needs to know about the next version.

Implementing this solution will require significant changes to the upgrade process, and will also require changing the path to the executable invoked to start the upgrade. Changing the path to the agent executable is the exact same set of work already started in #2579 with a preliminary backwards compatible solution.

Note that all changes must be backwards compatible since we allow upgrading from any version of the agent to any later version of the agent, across both 7.17.x and 8.x.x. Making a breaking change to support these changes is currently not an option.

Next Steps

Create an RFC describing the changes we must make to the upgrade process to implement the solution above (or an alternate solution that accomplishes the same goals). Describe how the upgrade process will be tested and how we will achieve backwards compatibility. The changes made must unblock #2579 and any other future changes to the directory structure of the agent.

The proposed implementation should be broken into phases to allow us to make progress incrementally and reduce the risk of the implementation. Ideally the changes to the directory structure needed to address #2579 and implement a well known upgrade entrypoint can be made separately from the migration of most upgrade functionality into that entrypoint.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-19T18:11:59Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

leehinman · 2023-10-19T19:01:37Z

One thing that tripped me up in previous upgrade scenarios is the fact that the semantic version number can be greater, but could have been produced at an earlier point in time. For example, we might first release 10.9.0 and then a few days later release 10.8.5. A user may want to upgrade from 10.8.5 -> 10.9.0, but because 10.9.0 was produced before 10.8.5 it may be missing critical information about how to do that upgrade.

cmacknz · 2024-01-15T16:26:35Z

@pierrehilbert why did this get closed? This problem still exists.

pierrehilbert · 2024-01-15T16:34:13Z

Sorry that's a misclick. Thx for catching it!

elasticmachine · 2024-05-21T15:24:02Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 19, 2023

cmacknz mentioned this issue Oct 19, 2023

Add the full version number to the installation directory name #2579

Closed

2 tasks

cmacknz assigned pchila Oct 19, 2023

pierrehilbert closed this as completed Jan 15, 2024

pierrehilbert reopened this Jan 15, 2024

pchila mentioned this issue Jan 24, 2024

Add manifest to package #3910

Merged

3 tasks

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The division of responsibility between agent versions during an upgrade is incorrect #3639

The division of responsibility between agent versions during an upgrade is incorrect #3639

cmacknz commented Oct 19, 2023 •

edited

Loading

elasticmachine commented Oct 19, 2023

leehinman commented Oct 19, 2023

cmacknz commented Jan 15, 2024

pierrehilbert commented Jan 15, 2024

elasticmachine commented May 21, 2024

The division of responsibility between agent versions during an upgrade is incorrect #3639

The division of responsibility between agent versions during an upgrade is incorrect #3639

Comments

cmacknz commented Oct 19, 2023 • edited Loading

Problem

Impact

Solution

Next Steps

elasticmachine commented Oct 19, 2023

leehinman commented Oct 19, 2023

cmacknz commented Jan 15, 2024

pierrehilbert commented Jan 15, 2024

elasticmachine commented May 21, 2024

cmacknz commented Oct 19, 2023 •

edited

Loading