Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Hosted fleet server gets unhealthy on 8.16 Snapshot. #5615

Closed
harshitgupta-qasource opened this issue Sep 24, 2024 · 14 comments · Fixed by #5616 or elastic/apm-server#14214
Closed
Assignees
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. QA:Validated Validated by the QA Team Team:Fleet Label for the Fleet team

Comments

@harshitgupta-qasource
Copy link

Deployment Links:

Description:
Hosted fleet server gets unhealthy on 8.16 Snapshot and we have observed APM integration shouws error.

Build details:
VERSION: 8.16.0 SNAPSHOT
BUILD: 78494
COMMIT: 156a76cb03e60a89792f905642817405002099a1

Screenshot
Image
Image

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:critical Immediate priority; high value or cost to the product. labels Sep 24, 2024
@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link

Secondary Review for this ticket is Done.

@amolnater-qasource amolnater-qasource added the Team:Fleet Label for the Fleet team label Sep 24, 2024
@ycombinator
Copy link
Contributor

I'm able to reproduce this simply by creating an 8.16.0-SNAPSHOT deployment on ESS production in the CFT region. I downloaded the diagnostic and checked for errors in the logs. Here's what I see:

$ cat elastic-agent-20240924-1.ndjson | grep error | jq '.message'
...
"Component state changed apm-es-containerhost (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Unit state changed apm-es-containerhost (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Unit state changed apm-es-containerhost-elastic-cloud-apm (STARTING->FAILED): Failed: pid '11978' exited with code '1'"
"Error: error loading config file: stat apm-server.yml: no such file or directory"
"Usage:"
"apm-server [flags]"
"apm-server [command]"
"Available Commands:"
"apikey      Manage API Keys for communication between APM agents and server (deprecated)"
"export      Export current config"
"help        Help about any command"
"keystore    Manage secrets keystore"
"run         Run APM Server"
"test        Test config"
"version     Show current version info"
"Flags:"
"-E, --E setting=value      Configuration overwrite"
"-N, --N                    Disable actual publishing for testing"
"-c, --c string             Configuration file, relative to path.config (default \"apm-server.yml\")"
"--cpuprofile string    Write cpu profile to file"
"-d, --d stringArray        Enable certain debug selectors"
"-e, --e                    Log to stderr and disable syslog/file output"
"--environment string   Set the environment in which the process is running (default \"default\")"
"-h, --help                 help for apm-server"
"--httpprof string      Start pprof http server"
"--memprofile string    Write memory profile to this file"
"--path.config string   Configuration path"
"--path.data string     Data path"
"--path.home string     Home path"
...

So it seems like the apm-server.yml file is missing.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2024

Pinged APM server team in Slack for ideas.

@cmacknz
Copy link
Member

cmacknz commented Sep 24, 2024

From Slack discussion looks related to the group membership change in the Wolfi container, where the agent user is no long in gid user. The permissions of the apm-server.yml file make it unreadable to the elastic-agent processes:

apm-server binary is 1000:1000 while apm-server.yml is 0:0

@ycombinator
Copy link
Contributor

Thanks @cmacknz. I'm guessing the fix here is on the APM Server end where the apm-server.yml file (and any others) need to be readable by the Elastic Agent user?

@ycombinator
Copy link
Contributor

Chatting with the APM Server team in Slack, the file ownership fix needs to be made in the Agent packaging step, specifically here (thanks @kruskall for pinpointing):

RUN chmod 0777 {{ $beatHome }} && \
usermod -d {{ $beatHome}} {{ .user }} && \
find {{ $beatHome }}/data/elastic-agent-{{ commit_short }}/components -name "*.yml*" -type f -exec chown root:root {} \; && \
true

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Sep 25, 2024

as this PR is merged #4925 for 8.16.0-SNAPSHOT I am wondering if the only thing missing here is giving SETPCAP, and CHOWN permissions to the elastic-agent pod?!

@ycombinator
Copy link
Contributor

... I am wondering if the only thing missing here is giving CHOWN permission to the elastic-agent pod?!

@pkoutsovasilis Do we need to do this even after #5616 is merged?

@pkoutsovasilis
Copy link
Contributor

it should be solved when this is merged. But I am wondering if merging it will cause issues on the opposite scenario when somebody runs the image under root without SETPCAP and CHOWN perms 🙂

@pchila
Copy link
Member

pchila commented Sep 30, 2024

Issue is still present for agent version 8.16.0-SNAPSHOT

build_time: 2024-09-27T20:51:03Z
commit: 601aaebb1893f1f470531b3f921dc6fb9d965306
snapshot: true
version: 8.16.0

From the diagnostics I pulled from an 8.16.0-SNAPSHOT in CFT we can still see that apm-server is still looking for apm-server.yaml

{"log.level":"info","@timestamp":"2024-09-30T07:01:42.992Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":632},"message":"Spawned new unit apm-es-containerhost-elastic-cloud-apm: Starting: spawned pid '206'","log":{"source":"elastic-agent"},"component":{"id":"apm-es-containerhost","state":"STARTING"},"unit":{"id":"apm-es-containerhost-elastic-cloud-apm","type":"input","state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-30T07:01:43.089Z","message":"Error: error loading config file: stat apm-server.yml: no such file or directory","component":{"binary":"apm-server","dataset":"elastic_agent.apm_server","id":"apm-es-containerhost","type":"apm"},"log":{"source":"apm-es-containerhost"},"ecs.version":"1.6.0"}

@pchila
Copy link
Member

pchila commented Sep 30, 2024

Rebuilt a new image from the 8.x branch, the ownership of apm-server.yaml now is consistent with the docker image user

bash-5.2$ ls -la /usr/share/elastic-agent/
total 3904
drwxrwxrwx    1 elastic- elastic-      4096 Sep 30 08:27 .
drwxr-xr-x    1 root     root          4096 Sep 30 08:27 ..
-rw-r--r--    1 elastic- elastic-        41 Sep 30 08:27 .build_hash.txt
-rw-r--r--    1 elastic- elastic-        41 Sep 30 08:27 .elastic-agent.active.commit
-rw-r--r--    1 elastic- elastic-      3860 Sep 30 08:27 LICENSE.txt
-rw-r--r--    1 elastic- elastic-   3906754 Sep 30 08:27 NOTICE.txt
-rw-r--r--    1 elastic- elastic-       360 Sep 30 08:27 README.md
drwxrwxrwx    1 elastic- elastic-      4096 Sep 30 08:27 data
lrwxrwxrwx    1 elastic- elastic-        64 Sep 30 08:27 elastic-agent -> /usr/share/elastic-agent/data/elastic-agent-110229/elastic-agent
-rw-r--r--    1 elastic- elastic-     14829 Sep 30 08:27 elastic-agent.reference.yml
-rw-r--r--    1 elastic- elastic-     10409 Sep 30 08:27 elastic-agent.yml
-rw-r--r--    1 elastic- elastic-       376 Sep 30 08:27 manifest.yaml
-rw-r--r--    1 elastic- elastic-       643 Sep 30 08:27 otel.yml
drwxr-xr-x    2 elastic- elastic-      4096 Sep 30 08:27 otel_samples
-rw-r--r--    1 elastic- elastic-        85 Sep 30 08:27 otelcol
bash-5.2$ ls -la /usr/share/elastic-agent/data/elastic-agent-110229/components/
total 1119252
drwxrwxrwx    1 elastic- elastic-      4096 Sep 30 08:27 .
drwxrwxrwx    1 elastic- elastic-      4096 Sep 30 08:27 ..
-rw-rw-rw-    1 elastic- elastic-        41 Sep 30 08:27 .build_hash.txt
-rw-rw-rw-    1 elastic- elastic-     13675 Sep 30 08:27 LICENSE.txt
-rw-rw-rw-    1 elastic- elastic-    377163 Sep 30 08:27 NOTICE.pf-elastic-collector.txt
-rw-rw-rw-    1 elastic- elastic-    549000 Sep 30 08:27 NOTICE.pf-elastic-symbolizer.txt
-rw-rw-rw-    1 elastic- elastic-    997964 Sep 30 08:27 NOTICE.pf-host-agent.txt
-rw-rw-rw-    1 elastic- elastic-     96850 Sep 30 08:27 NOTICE.txt
-rw-rw-rw-    1 elastic- elastic-       851 Sep 30 08:27 README.md
-rwxr-xr-x    1 elastic- elastic- 419405240 Sep 30 08:27 agentbeat
-rw-r--r--    1 elastic- elastic-     16530 Sep 30 08:27 agentbeat.spec.yml
-rwxr-xr-x    1 elastic- elastic-  56025240 Sep 30 08:27 apm-server
-rw-r--r--    1 elastic- elastic-       542 Sep 30 08:27 apm-server.spec.yml
-rw-r--r--    1 elastic- elastic-     39322 Sep 30 08:27 apm-server.yml
-rw-rw-rw-    1 elastic- elastic-    273830 Sep 30 08:27 bundle.tar.gz
drwxrwxrwx    2 elastic- elastic-      4096 Sep 30 08:27 certs
-rw-r--r--    1 elastic- elastic-      3142 Sep 30 08:27 checksum.yml
-rwxr-xr-x    1 elastic- elastic- 104601169 Sep 30 08:27 cloud-defend
-rw-r--r--    1 elastic- elastic-       442 Sep 30 08:27 cloud-defend.spec.yml
-rwxr-xr-x    1 elastic- elastic- 252076353 Sep 30 08:27 cloudbeat
-rw-r--r--    1 elastic- elastic-      2541 Sep 30 08:27 cloudbeat.spec.yml
-rw-r--r--    1 elastic- elastic-      6920 Sep 30 08:27 cloudbeat.yml
-rwxr-xr-x    1 elastic- elastic-  26417288 Sep 30 08:27 endpoint-security
-rw-rw-rw-    1 elastic- elastic-  26978766 Sep 30 08:27 endpoint-security-resources.zip
-rw-r--r--    1 elastic- elastic-      3608 Sep 30 08:27 endpoint-security.spec.yml
-rwxr-xr-x    1 elastic- elastic-  38035777 Sep 30 08:27 fleet-server
-rw-r--r--    1 elastic- elastic-       423 Sep 30 08:27 fleet-server.spec.yml
-rw-rw-rw-    1 elastic- elastic-   8621110 Sep 30 08:27 java-attacher.jar
drwxrwxrwx    2 elastic- elastic-     12288 Sep 30 08:27 lenses
drwxrwxrwx    1 elastic- elastic-      4096 Sep 30 08:27 module
-rwxr-xr-x    1 elastic- elastic-   6788339 Sep 30 08:27 osquery-extension.ext
-rwxr-xr-x    1 elastic- elastic-  86504168 Sep 30 08:27 osqueryd
-rwxr-xr-x    1 elastic- elastic-  20934888 Sep 30 08:27 pf-elastic-collector
-rw-r--r--    1 elastic- elastic-       283 Sep 30 08:27 pf-elastic-collector.spec.yml
-rwxr-xr-x    1 elastic- elastic-  21367848 Sep 30 08:27 pf-elastic-symbolizer
-rw-r--r--    1 elastic- elastic-       285 Sep 30 08:27 pf-elastic-symbolizer.spec.yml
-rwxr-xr-x    1 elastic- elastic-  75825808 Sep 30 08:27 pf-host-agent
-rw-r--r--    1 elastic- elastic-       406 Sep 30 08:27 pf-host-agent.spec.yml
bash-5.2$ whoami
elastic-agent

Launching apm-server from /usr/share/elastic-agent/data/elastic-agent-110229/components/ works, launching from /usr/share/elastic-agent/ reproduces the issue with apm-server erroring out because apm-server.yml is not found

bash-5.2$ cd  /usr/share/elastic-agent/data/elastic-agent-110229/components/
bash-5.2$ ./apm-server
bash-5.2$ cd -
/usr/share/elastic-agent
bash-5.2$  /usr/share/elastic-agent/data/elastic-agent-110229/components/apm-server
Error: error loading config file: stat apm-server.yml: no such file or directory
Usage:
  apm-server [flags]
  apm-server [command]

Available Commands:
  apikey      Manage API Keys for communication between APM agents and server (deprecated)
  export      Export current config
  help        Help about any command
  keystore    Manage secrets keystore
  run         Run APM Server
  test        Test config
  version     Show current version info

Flags:
  -E, --E setting=value      Configuration overwrite
  -N, --N                    Disable actual publishing for testing
  -c, --c string             Configuration file, relative to path.config (default "apm-server.yml")
      --cpuprofile string    Write cpu profile to file
  -d, --d stringArray        Enable certain debug selectors
  -e, --e                    Log to stderr and disable syslog/file output
      --environment string   Set the environment in which the process is running (default "default")
  -h, --help                 help for apm-server
      --httpprof string      Start pprof http server
      --memprofile string    Write memory profile to this file
      --path.config string   Configuration path
      --path.data string     Data path
      --path.home string     Home path
      --path.logs string     Logs path
      --strict.perms         Strict permission checking on config files (default true)
  -v, --v                    Log at INFO level

Use "apm-server [command] --help" for more information about a command.

bash-5.2$ pwd
/usr/share/elastic-agent

@kruskall it seems that apm-server binary fails to start even when ownership/permissions of apm-server.yml are consistent with the ownership/permissions on the binary itself... Do you have any idea if anything else changed recently ?

Edit: It seems like apm-server does not honor the -E management.enabled=true to delay configuration until it can receive it from agent. apm-server uses libbeat to load the configuration from file, so I tried to reproduce the same symptom using agentbeat and the only way the binary exits dumping the help text is when -E management.enabled=true is not specified (I am assuming that apm-server behaves like other beats for loading configuration)

@pchila
Copy link
Member

pchila commented Sep 30, 2024

After testing with apm-server that included a modified version of beats with additional logging:

  • the issue is due to not correctly parsing -E management.enabled=true CLI argument, so libbeat cfgfile.HandleFlags() calls fleetmode.SetAgentMode(false)
  • When cfgfile.Load() is called it enters the if !fleetmode.Enabled() block, so it requires the file apm-server.yml to be present in the working directory (the working directory under agent is <elastic-agent install dir>/data/elastic-agent-<git sha>/run/<component-id> which never contained such files as it's not needed when apm is running under agent.

The issue seems to have been introduced with commit elastic/apm-server@7f85cd6 on apm-server side ( up to commit elastic/apm-server@46a9a96 apm-server starts correctly under agent)

@harshitgupta-qasource
Copy link
Author

harshitgupta-qasource commented Oct 3, 2024

Hi Team,

We have re-validated this issue on the latest 8.16.0 SNAPSHOT Kibana cloud environment and found it fixed now.

Observations:

  • Fleet Server become healthy upon deployment of the 8.16.0 Kibana build.

Build details:
VERSION: 8.16.0 SNAPSHOT
BUILD: 78786
COMMIT: 3d2d667e3d7a56d577f590581a19b599ab332b7b

Screen-Shot:
Image

Hence, we are marking as QA: Validated.

Thanks

@harshitgupta-qasource harshitgupta-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Oct 3, 2024
@orestisfl orestisfl reopened this Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:critical Immediate priority; high value or cost to the product. QA:Validated Validated by the QA Team Team:Fleet Label for the Fleet team
Projects
None yet
8 participants