Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade verifiers do not retry if the download fails or times out. #5163

Closed
cllasyx opened this issue Jul 17, 2024 · 10 comments · Fixed by #6276
Closed

Upgrade verifiers do not retry if the download fails or times out. #5163

cllasyx opened this issue Jul 17, 2024 · 10 comments · Fixed by #6276
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cllasyx
Copy link

cllasyx commented Jul 17, 2024

Hello, I have deployed Elastic Agent with Fleet Server in version 8.14.2 and tried to upgrade few days later to 8.14.3.

When watching the logs through Observability -> Logs -> Stream I have noticed some error messages from elastic_agent dataset. The logs are provided below as well as temporary fix.

Steps to reproduce:

  • Upgrade Elastic Agent (Fleet Server) from Kibana UI.

Log output:

12:28:50.843 elastic_agent [elastic_agent][info] download from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz completed in 48 seconds @ 7.16MBps
12:28:50.843 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] download from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps12:28:50.854
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:50.854 elastic_agent [elastic_agent][info] updated upgrade details
12:28:51.464 elastic_agent [elastic_agent][info] Default PGP appended
12:29:21.465 elastic_agentm [elastic_agent][warn] Skipped remote PGP located at "https://artifacts.elastic.co/GPG-KEY-elastic-agent" because it's unavailable: 2 errors occurred:
	* Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": context deadline exceeded
	* Remote PGP download failed

12:29:21.468 elastic_agent [elastic_agent][warn] Skipped remote PGP located at "https://localhost:8221/api/agents/upgrades/8.14.3/pgp-public-key" because it's unavailable: 2 errors occurred:
	* Get "https://localhost:8221/api/agents/upgrades/8.14.3/pgp-public-key": x509: certificate is valid for myfleet.example.com, not localhost
	* Remote PGP download failed

12:29:21.468 elastic_agent [elastic_agent][info] Using 1 PGP keys
12:29:52.081 elastic_agent [elastic_agent][info] Cleaning up non-matching downloaded versions
12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
	* could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
	* fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

12:29:52.114 elastic_agent [elastic_agent][info] updated upgrade details

Bug fix (manual):

  • While in the upgrade process, go into folder /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads and issue the command below.
  • On the Fleet Server issue the command to manually download the .asc file:
root@myfleet:/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads# curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc
  • Start the upgrade process again if it fails the first time from Kibana UI.
    • Now the data stream shows "Using 2 PGP keys" as the second required one to download the update archive was manually put into the required downloads directory.
  • Wait until the upgrade is successfully done.

Notes

My Fleet Server host is listening on socket *:8220 on a domain name https://myfleet.example.com:8220. The host has another socket open 127.0.0.1:8221 which is used for internal API operations. My firewall has OUTPUT chain to accept all and INPUT chain has the rule to accept all connections made to loopback adapter as specified in a rule iptables -A INPUT -i lo -j ACCEPT.

@cmacknz
Copy link
Member

cmacknz commented Jul 17, 2024

12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
	* could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
	* fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

This context deadline exceeded for https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc is the source of the failure. It appears to be a timeout downloading the .asc file.

Since you could download it manually later, my first thought is this was a transient network error or problem with our artifacts CDN.

Is this still happening to your agents? Were you able to download the file while the agent was failing? This may indicate the problem is actually that our download timeout for this file needs to be longer.

@cmacknz cmacknz transferred this issue from elastic/fleet-server Jul 17, 2024
@dhanfocus
Copy link

I'm getting the same error as well upgrading from 8.14.1 to 8.14.3. All I did was applied the upgrade again through the Fleet UI.

upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
	* could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.1-1348b9/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.1-1348b9/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
	* fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

@cllasyx
Copy link
Author

cllasyx commented Jul 18, 2024

12:29:52.114 elastic_agent [elastic_agent][error] upgrade to version 8.14.3 failed: failed verification of agent binary: 2 errors occurred:
	* could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.14.2-173817/downloads/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: no such file or directory
	* fetching asc file from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc: failed loading public key: Get "https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc": context deadline exceeded

This context deadline exceeded for https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc is the source of the failure. It appears to be a timeout downloading the .asc file.

Since you could download it manually later, my first thought is this was a transient network error or problem with our artifacts CDN.

Is this still happening to your agents? Were you able to download the file while the agent was failing? This may indicate the problem is actually that our download timeout for this file needs to be longer.

I could indeed download the .asc file manually later while in the upgrade process. I have tried upgrading twice in a row, right after the first failure. The result was the same so my only option was to download it manually while the upgrade process was started to supply for the timeout.

You're most likely right and timeout period is too low.

For the agent part - I don't have any outdated agent right now I could test this all over again on.

@ycombinator
Copy link
Contributor

@cllasyx Would you mind timing your curl command from the same host as before, so we can get a sense of how long it's taking?

time curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.14.3-linux-x86_64.tar.gz.asc

Thanks.

@ycombinator ycombinator added bug Something isn't working good first issue Good for newcomers labels Jul 18, 2024
@cmacknz
Copy link
Member

cmacknz commented Jul 22, 2024

I don't think it matters what the time on this system is, I can see in our code that the .asc download does not share a context timeout with the agent package download and does not have retries.

path, err := downloaderFunc(ctx, factory, parsedVersion, &settings, upgradeDetails)
if err != nil {
return "", errors.New(err, "failed download of agent binary")
}
if skipVerifyOverride {
return path, nil
}
if verifier == nil {
verifier, err = newVerifier(parsedVersion, u.log, &settings)
if err != nil {
return "", errors.New(err, "initiating verifier")
}
}
if err := verifier.Verify(agentArtifact, *parsedVersion, skipDefaultPgp, pgpBytes...); err != nil {
return "", errors.New(err, "failed verification of agent binary")
}

In the case of the HTTP verifier, we make one attempt to get it with a 30s timeout with no retries which is definitely wrong. 30s is fine for the timeout of an individual request, but we should retry as long as the overall upgrade download timeout is not expired.

func (v *Verifier) getPublicAsc(sourceURI string) ([]byte, error) {
ctx, cancelFn := context.WithTimeout(context.Background(), 30*time.Second)
defer cancelFn()
req, err := http.NewRequestWithContext(ctx, http.MethodGet, sourceURI, nil)
if err != nil {
return nil, errors.New(err, "failed create request for loading public key", errors.TypeNetwork, errors.M(errors.MetaKeyURI, sourceURI))
}
resp, err := v.client.Do(req)
if err != nil {
return nil, errors.New(err, "failed loading public key", errors.TypeNetwork, errors.M(errors.MetaKeyURI, sourceURI))
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return nil, errors.New(fmt.Sprintf("call to '%s' returned unsuccessful status code: %d", sourceURI, resp.StatusCode), errors.TypeNetwork, errors.M(errors.MetaKeyURI, sourceURI))
}
return io.ReadAll(resp.Body)
}

@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jul 22, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz removed the good first issue Good for newcomers label Jul 22, 2024
@cmacknz cmacknz changed the title Failed Elastic Agent 8.14.3 upgrade from Kibana UI with manual fix Upgrade verifiers do not retry if the download fails or times out. Jul 22, 2024
@lucabelluccini
Copy link
Contributor

Do we agree the problem here was the download of the asc file?

The PGP key download is not mandatory atm - as it will try anyway to use the one embedded in the binary itself?

@cllasyx
Copy link
Author

cllasyx commented Aug 23, 2024

Do we agree the problem here was the download of the asc file?

The PGP key download is not mandatory atm - as it will try anyway to use the one embedded in the binary itself?

Yes, the problem was definitely the download of the .asc file used for PGP verification.

@lucabelluccini
Copy link
Contributor

The asc is not the PGP key.
What I meant by the question is: the PGP warning is a red herring. Downloading the asc was the problem.

@cllasyx
Copy link
Author

cllasyx commented Aug 26, 2024

I didn't say the asc file is PGP key, I said it's used for verification which is true.
And in my response is stated that "the problem was definitely the download of the .asc file" which is the answer to your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants