Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration Test] New Agent after upgrade fails to start #3085

Merged
merged 14 commits into from
Sep 28, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Jul 14, 2023

What does this PR do?

Relates #2176

This PR adds an integration test that ensures an Agent upgrade is rolled back because the new Agent (after upgrade) fails to start.

The test creates a fake Agent package containing a fake Agent binary that immediately exits with a non-zero status code. The test installs the locally-built Agent and then attempts to upgrade to the fake Agent package. The upgrade is expected to fail due to the new (fake) Agent binary failing to start successfully, and therefore the upgrade should be rolled back to the previous Agent version.

Why is it important?

To ensure the upgrade/rollback behavior is working as expected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

  1. Build Agent locally.

    DEV=true EXTERNAL=true SNAPSHOT=true PLATFORMS=linux/amd64,linux/arm64 PACKAGES=tar.gz mage package
    
  2. Run the test in this PR.

    SNAPSHOT=true mage integration:single TestStandaloneUpgradeFailsRestart
    

Related issues

@ycombinator ycombinator added Team:Elastic-Agent Label for the Agent team skip-changelog Testing backport-v8.9.0 Automated backport with mergify labels Jul 14, 2023
@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 14, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-28T16:34:48.934+0000

  • Duration: 25 min 7 sec

Test stats 🧪

Test Results
Failed 0
Passed 6313
Skipped 59
Total 6372

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Jul 14, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.78% (81/82) 👍
Files 66.102% (195/295) 👍
Classes 65.693% (360/548) 👍
Methods 52.744% (1134/2150) 👍
Lines 38.197% (12877/33712) 👎 -0.006
Conditionals 100.0% (0/0) 💚

@ycombinator
Copy link
Contributor Author

The test in this PR is currently failing. That's because the Upgrade Watcher process seems to get killed off prematurely by the systemd service when it attempts to restart the failing fake Agent process. You can see this in action by starting the test, then SSH'ing into the VM where the test is running, and running the following command after the test attempts the upgrade.

watch -n1 systemctl status elastic-agent.service

For about a minute and a half, the above command will show that the fake Agent process is failing with a non-zero status code. It will also show the Upgrade Watcher process running.

Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:51:38 2023

● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago
    Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
   Main PID: 97938 (code=exited, status=101)
      Tasks: 6 (limit: 4637)
     Memory: 105.8M
        CPU: 3.386s
     CGroup: /system.slice/elastic-agent.service
             └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Then, after about a minute and a half, systemd will attempt to restart the fake Agent process. You can tell this by seeing the Main PID change in the status command's output. However, as a result of this attempted restart, the Upgrade Watcher process also gets killed. You can see it disappearing from the output.

Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:52:32 2023

● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Fri 2023-07-14 18:52:20 UTC; 12s ago
    Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
   Main PID: 97938 (code=exited, status=101)
        CPU: 3.396s

Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98043 (elastic-agent) with signal SIGKILL.
Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98044 (elastic-agent) with signal SIGKILL.
Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98045 (elastic-agent) with signal SIGKILL.
Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Consumed 3.396s CPU time.

As an aside (likely a separate bug), looking through the Upgrade Watcher's log, I don't think the Crash Checker is retrieving the correct PID for the Agent process from systemd. Note the Main PID mentioned in the systemctl status output above (97938) and compare it with the PID seen in the Upgrade Watcher's log (0):

grep -i 'crash.*service pid' /opt/Elastic/Agent/data/elastic-agent-903287/logs/elastic-agent-watcher-20230714-1.ndjson
{"log.level":"debug","@timestamp":"2023-07-14T18:51:00.213Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:10.214Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:20.215Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:30.216Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:40.217Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:50.218Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:00.219Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:10.220Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:20.222Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}

@ycombinator
Copy link
Contributor Author

@mergify
Copy link
Contributor

mergify bot commented Aug 2, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b integ-test-upgrade-fails-restart upstream/integ-test-upgrade-fails-restart
git merge upstream/main
git push upstream integ-test-upgrade-fails-restart

@ycombinator
Copy link
Contributor Author

Currently blocked on #3274

@ycombinator
Copy link
Contributor Author

Currently blocked on #3274

#3724 was just merged. Rebasing this PR on main and resuming development on it...

@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-restart branch from 1d39efd to af0d877 Compare September 8, 2023 16:03
@mergify
Copy link
Contributor

mergify bot commented Sep 8, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b integ-test-upgrade-fails-restart upstream/integ-test-upgrade-fails-restart
git merge upstream/main
git push upstream integ-test-upgrade-fails-restart

@pierrehilbert pierrehilbert added backport-v8.10.0 Automated backport with mergify and removed backport-v8.9.0 Automated backport with mergify labels Sep 8, 2023
@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-restart branch 3 times, most recently from 640711c to 4304b73 Compare September 12, 2023 14:02
@ycombinator ycombinator marked this pull request as ready for review September 13, 2023 00:16
@ycombinator ycombinator requested a review from a team as a code owner September 13, 2023 00:16
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


t.Logf("Restarting Agent via service to simulate crashing")
err = install.RestartService(topPath)
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I really like the simplicity of making it think that something is wrong with it.

@mergify
Copy link
Contributor

mergify bot commented Sep 20, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b integ-test-upgrade-fails-restart upstream/integ-test-upgrade-fails-restart
git merge upstream/main
git push upstream integ-test-upgrade-fails-restart

@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-restart branch from 6faf005 to 96dca69 Compare September 27, 2023 20:23
@ycombinator ycombinator force-pushed the integ-test-upgrade-fails-restart branch from a6fcd66 to eadf520 Compare September 28, 2023 16:34
@elastic-sonarqube
Copy link

SonarQube Quality Gate

Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ycombinator ycombinator enabled auto-merge (squash) September 28, 2023 17:49
@ycombinator ycombinator merged commit 99bc6d7 into elastic:main Sep 28, 2023
11 checks passed
mergify bot pushed a commit that referenced this pull request Sep 28, 2023
* Adding skeleton

* WIP

* Finish writing test

* Fixing upgrade checker completion check function name

* Fix client.Upgrade() call

* Fix other errors

* Increasing context timeout

* Trying a different way of crashing the upgraded Agent

* Trying yet another (much simpler!) way to crash upgraded Agent

* Remove OS restriction

* Use newer checks

* Fix expected versions in checks

* Fix changed function name

* Excluding generated mocks from code coverage analysis

(cherry picked from commit 99bc6d7)

# Conflicts:
#	sonar-project.properties
@ycombinator ycombinator deleted the integ-test-upgrade-fails-restart branch September 28, 2023 18:20
ycombinator added a commit that referenced this pull request Oct 2, 2023
* Adding skeleton

* WIP

* Finish writing test

* Fixing upgrade checker completion check function name

* Fix client.Upgrade() call

* Fix other errors

* Increasing context timeout

* Trying a different way of crashing the upgraded Agent

* Trying yet another (much simpler!) way to crash upgraded Agent

* Remove OS restriction

* Use newer checks

* Fix expected versions in checks

* Fix changed function name

* Excluding generated mocks from code coverage analysis

(cherry picked from commit 99bc6d7)

# Conflicts:
#	sonar-project.properties
ycombinator added a commit that referenced this pull request Oct 2, 2023
* Adding skeleton

* WIP

* Finish writing test

* Fixing upgrade checker completion check function name

* Fix client.Upgrade() call

* Fix other errors

* Increasing context timeout

* Trying a different way of crashing the upgraded Agent

* Trying yet another (much simpler!) way to crash upgraded Agent

* Remove OS restriction

* Use newer checks

* Fix expected versions in checks

* Fix changed function name

* Excluding generated mocks from code coverage analysis

(cherry picked from commit 99bc6d7)

# Conflicts:
#	sonar-project.properties
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.10.0 Automated backport with mergify skip-changelog Team:Elastic-Agent Label for the Agent team Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants