Extend agent container initialisation logic #4925

pkoutsovasilis · 2024-06-13T10:28:19Z

What does this PR do?

This PR affects only the DOCKER package of elastic-agent (code path of elastic-agent container ...). Specifically it does the following:

allow all capabilities as permitted at the elastic-agent binary
adds init logic to raise all capabilities present in Permitted set when running as non-root
adds init logic to chown agent-related paths, including state, config, logs, data, etc. if either CAP_CHOWN or DAC_OVERRIDE capabilities is given

Flowchart for visual aid

flowchart TD
    classDef default line-height:1.5,text-align:center;
    entrypoint((("elastic-agent<br>container #quot;$@#quot;")))
    style entrypoint fill:#222,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
    ifIsRoot{Is Root?}
    canChown{Has either CHOWN<br>or DAC_OVERRIDE<br>capability?}
    raiseEffective[raise all capabilities from Permitted<br>set to Effective and Inheritable sets]
    canPcap{Has SETPCAP<br>capability?}
    raiseAmbient[raise all capabilities from<br>Effective set to Ambient set]
    chownPaths["chown all agent-related paths"]
    existingCode[existing code]
    style existingCode fill:#222,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5

    entrypoint --> ifIsRoot
    ifIsRoot -->|True|canChown
    ifIsRoot --> |False|raiseEffective
    raiseEffective --> canPcap
    canPcap --> |True|raiseAmbient
    canPcap ---> |False|canChown
    raiseAmbient -->canChown
    canChown --> |True|chownPaths
    canChown --> |False|existingCode
    chownPaths --> existingCode

Why is it important?

I consider this PR important for the following reasons:

it can allow detaching elastic-agent user (inside the container) from the root group by enforcing CAP_CHOWN capability which amplifies the security
allows for non-root invocation of elastic-agent to do pretty much everything, given the correct capabilities e.g. if we want to read k8s container logs just add the capability DAC_READ_SEARCH without the need to run as root or being part of the root group.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

For any reason that elastic-agent container is invoked for a completely unknown user (e.g. uid=500 and gid=500) the chowing phase of the initialisation can take some seconds to complete before the normal elastic-agent operation starts. Based on some initial experiment the "chowing" phase for this particular case can take ~15 seconds.

How to test this PR locally

$ DEV=true SNAPSHOT=true PLATFORMS=linux/amd64 PACKAGES=docker mage -v package
$ docker run --entrypoint bash --user 1000:1000 -v $PWD/tmp_state:/etc/state --env STATE_PATH=/etc/state  --cap-drop ALL --cap-add CAP_BPF --cap-add CAP_CHOWN --cap-add CAP_SETFCAP --cap-add CAP_SETPCAP --rm -it docker.elastic.co/beats/elastic-agent-complete:8.16.0-SNAPSHOT
$ elastic-agent container

Related issues

…d raises capabilities

mergify · 2024-06-13T10:28:58Z

This pull request does not have a backport label. Could you fix it @pkoutsovasilis? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

blakerouse

Haven't made it through the whole PR. Just some early comments as I was going.

dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl

internal/pkg/agent/cmd/container.go

internal/pkg/agent/cmd/container_init_linux.go

…es are updated

jlind23

Does this PR only Relates or it actually resolves #4087 (comment) and https://github.com/elastic/ingest-dev/issues/3253

pkoutsovasilis · 2024-06-13T17:35:48Z

Does this PR only Relates or it actually resolves #4087 (comment) and elastic/ingest-dev#3253

Hello @jlind23 👋 this PR actually resolves #4087 as it removes elastic-agent from root group. https://github.com/elastic/ingest-dev/issues/3253 essentially has two requirements rootless agent and a helm chart for agent; this PR resolves the rootless requirement

elasticmachine · 2024-06-13T21:16:32Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2024-06-17T14:37:02Z

What is the best way to add automated tests for this?

We don't have a great template for testing agent in a container yet.

Our integration testing framework can give you a Linux VM you can do whatever you want on.
There is a very basic test using kind to ensure agent can deploy https://github.com/elastic/elastic-agent/blob/main/.buildkite/scripts/steps/k8s-tests.sh
The Buildkite agents can start containers but we don't do this with the agent container yet.

pkoutsovasilis · 2024-06-17T15:44:34Z

What is the best way to add automated tests for this?

We don't have a great template for testing agent in a container yet.

Our integration testing framework can give you a Linux VM you can do whatever you want on.
There is a very basic test using kind to ensure agent can deploy https://github.com/elastic/elastic-agent/blob/main/.buildkite/scripts/steps/k8s-tests.sh
The Buildkite agents can start containers but we don't do this with the agent container yet.

IMO having integration tests that target solely k8s is the way to go for testing agent in a container. I have seen the k8s tests you mention and from what I can tell these act as a mean to validate the agent k8s manifest, nothing more. On the same page, I have seen this file here which essentially sets up a standalone agent on k8s with kubernetes integration. I would expect an integration test to check that this k8s manifest produces the expected documents in the expected data-streams. However, after a quick search I can't find such a test; @cmacknz do you know if something like that exists or not?

cmacknz · 2024-06-17T15:51:25Z

No, the state of our k8s specific testing is poor

blakerouse · 2024-06-17T16:17:32Z

I don't think we should block this PR on having an integration test. We are not in a worse place with this PR, versus how it is today. Would really like to see if validated with an integration test, but we need to spend more time on enabling that. Blocking this PR because of that is not great.

dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl

internal/pkg/agent/cmd/container_init_linux.go

cmacknz · 2024-06-17T20:58:41Z

I don't think we should block this PR on having an integration test. We are not in a worse place with this PR, versus how it is today. Would really like to see if validated with an integration test, but we need to spend more time on enabling that. Blocking this PR because of that is not great.

We don't need to block the PR on this, but we should:

Define what the ideal test actually is and what the minimum we need for that is so we can actually build it.
We do not need this to be a complete functional test of agent on Kubernetes, only the permissions related logic added in this PR. Is there a way we can partially test this with what we have today? For example in the very basic kind test today, can we at least make sure the agent will actually start as both root and non-root? Is there anything we can validate just be running the elastic-agent container on a Linux host that is not a container (probably not given this is updating the Dockerfile but you get the spirit of the idea).

cmacknz · 2024-06-17T21:00:56Z

For another example may be able to get Buildkite to just directly invoke the agent container as a build step since Buildkite can run containers without modification. See elastic/elastic-agent-system-metrics#140 for an example buildkite pipeline that has a step running a specific container and a define command to run on that container.

Edit: this is using pre-defined buildkite agent variants so perhaps not this simple, but getting CI to start a docker container is not an advanced use of CI.

pkoutsovasilis · 2024-06-17T21:21:49Z

We don't need to block the PR on this, but we should:

Define what the ideal test actually is and what the minimum we need for that is so we can actually build it.

We do not need this to be a complete functional test of agent on Kubernetes, only the permissions related logic added in this PR. Is there a way we can partially test this with what we have today? For example in the very basic kind test today, can we at least make sure the agent will actually start as both root and non-root? Is there anything we can validate just be running the elastic-agent container on a Linux host that is not a container (probably not given this is updating the Dockerfile but you get the spirit of the idea).

An ideal test IMO would be something similar to this. This particular test, does exactly what we need, deploys an Elasticsearch cluster though ECK operator, deploys agent with kubernetes integration enabled through a Helm chart (spoiler alert 🚨), and checks that the expected events are getting delivered to the expected data_streams. So some ways forward that pop up in my mind are:

utilise ECK operator in the kind clusters part of k8s tests to spawn an Elasticsearch and then do a similar testing here [I think this is what ECK is doing]
since this is kind based k8s cluster maybe we can utilise elastic-package to spawn a stack that is reachable through k8s workloads and once again do similar testing
spawn an Elasticsearch on Elastic cloud through the terraform here (I think I've seen some tf here in the repo doing such a provision) and do our testing.

… agent at base k8s files

mergify · 2024-07-23T12:21:47Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b pkoutsovasilis/container_capabilities_chown upstream/pkoutsovasilis/container_capabilities_chown
git merge upstream/main
git push upstream pkoutsovasilis/container_capabilities_chown

…asilis/container_capabilities_chown # Conflicts: # NOTICE.txt # go.mod

… agent at base k8s files

pkoutsovasilis · 2024-07-23T18:37:02Z

@cmacknz I introduced the support for k8s inner tests, which are essentially go unit-tests that will run inside the agent pod.

Some background of the design choices to support k8s inner tests:

tests requiring the tag kubernetes_inner as we don't want these to be part of the existing elastic-agent unit-tests
the above binary is essentially copied by the k8s provisioner to the agent image that is about to get tested for the following reasons:
- the provisioner knows the architecture of the k8s cluster thus it can compile correctly the tests binary
- an equivalent of kubectl cp ... isn't suitable as it doesn't maintain the file mode and you have to chmod inside the pod to make it an executable. However, chmoding inside the pod isn't always allowed 🙂
introduced the first k8s inner test that checks that all agent-related paths have the same uid and gid as the elastic-agent process

Now elastic-agent standalone in k8s is tested that runs without issues against the following scenarios:

rootful - default deployment [no inner tests]
rootful - drop ALL capabilities [no inner tests]
rootful - drop ALL add CHOWN, SETPCAP capabilities [inner tests included]
rootless - drop ALL add CHOWN, SETPCAP capabilities [inner tests included]
rootless - drop ALL add CHOWN, SETPCAP capabilities with random uid and gid [inner tests included]

…doesn't report healthy

…nd elastic-agent uid

…random uid and gid

blakerouse

Overall I like the initialization logic.

The whole inner testing logic of executing a separate binary inside of the container seems complex and error prone to debug. Would it be possible to just change this to running commands directly from like kubectl exec instead of just running a golang binary inside of the container?

internal/pkg/agent/cmd/common.go

…nerLoadPaths

pkoutsovasilis · 2024-07-25T12:28:01Z

Overall I like the initialization logic.

The whole inner testing logic of executing a separate binary inside of the container seems complex and error prone to debug. Would it be possible to just change this to running commands directly from like kubectl exec instead of just running a golang binary inside of the container?

I see your point, but given the limitation of testing inside a pod mentioned here, this is the best approach I could come up with. However, I'm more than happy to remove the Kubernetes inner tests from here and isolate them in a separate PR to give it more thought. Does that sound like a good plan? 🙂

pkoutsovasilis · 2024-07-29T10:43:18Z

~~@elastic/obs-cloudnative-monitoring @MichaelKatsoulis @constanca-m could you please review the files that you own? 🙂~~ ty 🙏 @MichaelKatsoulis

elastic-sonarqube · 2024-07-29T11:21:35Z

Quality Gate passed

Issues
4 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
67.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

blakerouse

I am good with this change and I really appreciate all the testing you added. I think its better to get this in and go from there to sit on it any longer.

feat: implement container initialisation that chowns related paths an…

56712a4

…d raises capabilities

pkoutsovasilis added the enhancement New feature or request label Jun 13, 2024

mergify bot assigned pkoutsovasilis Jun 13, 2024

mergify bot added the backport-skip label Jun 13, 2024

pkoutsovasilis added 4 commits June 13, 2024 13:30

doc: add changelog fragment

bcc9f18

fix: update NOTICE.txt

61763a8

fix: typo in container_init_test.go

de2a26f

feat: add more unit-tests to increase coverage

6f9adb0

blakerouse reviewed Jun 13, 2024

View reviewed changes

pkoutsovasilis marked this pull request as ready for review June 13, 2024 16:14

pkoutsovasilis requested a review from a team as a code owner June 13, 2024 16:14

pkoutsovasilis requested review from michalpristas and pchila June 13, 2024 16:14

pkoutsovasilis added 3 commits June 13, 2024 19:59

feat: utilise unix.Exec to re-exec elastic-agent when file capabiliti…

f9231ea

…es are updated

fix: handle err from utils.HasRoot

f62c056

fix: refactor skip-file-capabilities flag

dc53f87

jlind23 reviewed Jun 13, 2024

View reviewed changes

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 13, 2024

fix: add cap_setpcap at elastic-agent binary

84b98e1

blakerouse reviewed Jun 17, 2024

View reviewed changes

dev-tools/packaging/templates/docker/Dockerfile.elastic-agent.tmpl Outdated Show resolved Hide resolved

internal/pkg/agent/cmd/container_init_linux.go Outdated Show resolved Hide resolved

internal/pkg/agent/cmd/container_init_linux.go Outdated Show resolved Hide resolved

pkoutsovasilis added 5 commits July 23, 2024 11:12

feat: add comments in container_init_linux.go

852589e

fix: adjust artifact_paths to extract all artifacts

10075a7

fix: permission for configuration configmap mount to support rootless…

f263d3f

… agent at base k8s files

fix: log stderr for k8s inner tests

0e38c8a

fix: disable CGO for k8s inner tests

1f419c0

pkoutsovasilis added 3 commits July 23, 2024 19:54

Merge remote-tracking branch 'refs/remotes/origin/main' into pkoutsov…

1d54afc

…asilis/container_capabilities_chown # Conflicts: # NOTICE.txt # go.mod

fix: merge conflicts

c23bcb6

fix: permission for configuration configmap mount to support rootless…

6f3b7b6

… agent at base k8s files

pkoutsovasilis added 5 commits July 24, 2024 22:31

Merge branch 'main' into pkoutsovasilis/container_capabilities_chown

44193ba

fix: improve logging the output of status command when elastic-agent …

de8b24d

…doesn't report healthy

fix: revert back to pre-1.22 golang for loop to make linter happy

50cd3a9

fix: don't set runAsGroup inside k8s tests for deployment with root a…

809421c

…nd elastic-agent uid

fix: add DAC_READ_SEARCH capability in k8s tests for deployment with …

d27d12d

…random uid and gid

blakerouse reviewed Jul 24, 2024

View reviewed changes

internal/pkg/agent/cmd/common.go Show resolved Hide resolved

ycombinator removed the request for review from pchila July 24, 2024 23:29

fix: add comment to elaborate why call initContainer before tryContai…

51c2683

…nerLoadPaths

Merge branch 'main' into pkoutsovasilis/container_capabilities_chown

191c1a0

MichaelKatsoulis approved these changes Jul 29, 2024

View reviewed changes

blakerouse approved these changes Jul 29, 2024

View reviewed changes

pkoutsovasilis merged commit 03b5d31 into elastic:main Jul 29, 2024
13 checks passed

pkoutsovasilis mentioned this pull request Aug 1, 2024

Remove the Docker elastic-agent user from the root group #4087

Closed

pkoutsovasilis mentioned this pull request Aug 13, 2024

[agentbeat] Fix agentbeat capabilities elastic/beats#40466

Merged

6 tasks

cmacknz mentioned this pull request Aug 13, 2024

fix: rootless k8s integration tests #5290

Merged

3 tasks

This was referenced Sep 25, 2024

[Fleet]: Hosted fleet server gets unhealthy on 8.16 Snapshot. #5615

Closed

[Kubernetes] Add kustomize template for hints auto discover #5643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend agent container initialisation logic #4925

Extend agent container initialisation logic #4925

pkoutsovasilis commented Jun 13, 2024 •

edited

Loading

mergify bot commented Jun 13, 2024

blakerouse left a comment

jlind23 left a comment

pkoutsovasilis commented Jun 13, 2024 •

edited

Loading

elasticmachine commented Jun 13, 2024

cmacknz commented Jun 17, 2024 •

edited

Loading

pkoutsovasilis commented Jun 17, 2024 •

edited

Loading

cmacknz commented Jun 17, 2024

blakerouse commented Jun 17, 2024

cmacknz commented Jun 17, 2024

cmacknz commented Jun 17, 2024 •

edited

Loading

pkoutsovasilis commented Jun 17, 2024

mergify bot commented Jul 23, 2024

pkoutsovasilis commented Jul 23, 2024

blakerouse left a comment

pkoutsovasilis commented Jul 25, 2024 •

edited

Loading

pkoutsovasilis commented Jul 29, 2024 •

edited

Loading

elastic-sonarqube bot commented Jul 29, 2024

blakerouse left a comment

Extend agent container initialisation logic #4925

Extend agent container initialisation logic #4925

Conversation

pkoutsovasilis commented Jun 13, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

mergify bot commented Jun 13, 2024

blakerouse left a comment

Choose a reason for hiding this comment

jlind23 left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jun 13, 2024 • edited Loading

elasticmachine commented Jun 13, 2024

cmacknz commented Jun 17, 2024 • edited Loading

pkoutsovasilis commented Jun 17, 2024 • edited Loading

cmacknz commented Jun 17, 2024

blakerouse commented Jun 17, 2024

cmacknz commented Jun 17, 2024

cmacknz commented Jun 17, 2024 • edited Loading

pkoutsovasilis commented Jun 17, 2024

mergify bot commented Jul 23, 2024

pkoutsovasilis commented Jul 23, 2024

blakerouse left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jul 25, 2024 • edited Loading

pkoutsovasilis commented Jul 29, 2024 • edited Loading

elastic-sonarqube bot commented Jul 29, 2024

Quality Gate passed

blakerouse left a comment

Choose a reason for hiding this comment

pkoutsovasilis commented Jun 13, 2024 •

edited

Loading

pkoutsovasilis commented Jun 13, 2024 •

edited

Loading

cmacknz commented Jun 17, 2024 •

edited

Loading

pkoutsovasilis commented Jun 17, 2024 •

edited

Loading

cmacknz commented Jun 17, 2024 •

edited

Loading

pkoutsovasilis commented Jul 25, 2024 •

edited

Loading

pkoutsovasilis commented Jul 29, 2024 •

edited

Loading