Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend agent container initialisation logic #4925

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jun 13, 2024

What does this PR do?

This PR affects only the DOCKER package of elastic-agent (code path of elastic-agent container ...). Specifically it does the following:

  • allow all capabilities as permitted at the elastic-agent binary

  • adds init logic to raise all capabilities present in Permitted set when running as non-root

  • adds init logic to chown agent-related paths, including state, config, logs, data, etc. if either CAP_CHOWN or DAC_OVERRIDE capabilities is given

Flowchart for visual aid

flowchart TD
    classDef default line-height:1.5,text-align:center;
    entrypoint((("elastic-agent<br>container #quot;$@#quot;")))
    style entrypoint fill:#222,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
    ifIsRoot{Is Root?}
    canChown{Has either CHOWN<br>or DAC_OVERRIDE<br>capability?}
    raiseEffective[raise all capabilities from Permitted<br>set to Effective and Inheritable sets]
    canPcap{Has SETPCAP<br>capability?}
    raiseAmbient[raise all capabilities from<br>Effective set to Ambient set]
    chownPaths["chown all agent-related paths"]
    existingCode[existing code]
    style existingCode fill:#222,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5

    entrypoint --> ifIsRoot
    ifIsRoot -->|True|canChown
    ifIsRoot --> |False|raiseEffective
    raiseEffective --> canPcap
    canPcap --> |True|raiseAmbient
    canPcap ---> |False|canChown
    raiseAmbient -->canChown
    canChown --> |True|chownPaths
    canChown --> |False|existingCode
    chownPaths --> existingCode
Loading

Why is it important?

I consider this PR important for the following reasons:

  • it can allow detaching elastic-agent user (inside the container) from the root group by enforcing CAP_CHOWN capability which amplifies the security
  • allows for non-root invocation of elastic-agent to do pretty much everything, given the correct capabilities e.g. if we want to read k8s container logs just add the capability DAC_READ_SEARCH without the need to run as root or being part of the root group.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

  • For any reason that elastic-agent container is invoked for a completely unknown user (e.g. uid=500 and gid=500) the chowing phase of the initialisation can take some seconds to complete before the normal elastic-agent operation starts. Based on some initial experiment the "chowing" phase for this particular case can take ~15 seconds.

How to test this PR locally

$ DEV=true SNAPSHOT=true PLATFORMS=linux/amd64 PACKAGES=docker mage -v package
$ docker run --entrypoint bash --user 1000:1000 -v $PWD/tmp_state:/etc/state --env STATE_PATH=/etc/state  --cap-drop ALL --cap-add CAP_BPF --cap-add CAP_CHOWN --cap-add CAP_SETFCAP --cap-add CAP_SETPCAP --rm -it docker.elastic.co/beats/elastic-agent-complete:8.16.0-SNAPSHOT
$ elastic-agent container

Related issues

@pkoutsovasilis pkoutsovasilis added the enhancement New feature or request label Jun 13, 2024
Copy link
Contributor

mergify bot commented Jun 13, 2024

This pull request does not have a backport label. Could you fix it @pkoutsovasilis? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't made it through the whole PR. Just some early comments as I was going.

internal/pkg/agent/cmd/container.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/container_init_linux.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/container_init_linux.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/container_init_linux.go Outdated Show resolved Hide resolved
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review June 13, 2024 16:14
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner June 13, 2024 16:14
Copy link
Contributor

@jlind23 jlind23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR only Relates or it actually resolves #4087 (comment) and https://github.com/elastic/ingest-dev/issues/3253

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jun 13, 2024

Does this PR only Relates or it actually resolves #4087 (comment) and elastic/ingest-dev#3253

Hello @jlind23 👋 this PR actually resolves #4087 as it removes elastic-agent from root group. https://github.com/elastic/ingest-dev/issues/3253 essentially has two requirements rootless agent and a helm chart for agent; this PR resolves the rootless requirement

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Jun 17, 2024

What is the best way to add automated tests for this?

We don't have a great template for testing agent in a container yet.

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jun 17, 2024

What is the best way to add automated tests for this?

We don't have a great template for testing agent in a container yet.

Our integration testing framework can give you a Linux VM you can do whatever you want on.
There is a very basic test using kind to ensure agent can deploy https://github.com/elastic/elastic-agent/blob/main/.buildkite/scripts/steps/k8s-tests.sh
The Buildkite agents can start containers but we don't do this with the agent container yet.

IMO having integration tests that target solely k8s is the way to go for testing agent in a container. I have seen the k8s tests you mention and from what I can tell these act as a mean to validate the agent k8s manifest, nothing more. On the same page, I have seen this file here which essentially sets up a standalone agent on k8s with kubernetes integration. I would expect an integration test to check that this k8s manifest produces the expected documents in the expected data-streams. However, after a quick search I can't find such a test; @cmacknz do you know if something like that exists or not?

@cmacknz
Copy link
Member

cmacknz commented Jun 17, 2024

No, the state of our k8s specific testing is poor

@blakerouse
Copy link
Contributor

I don't think we should block this PR on having an integration test. We are not in a worse place with this PR, versus how it is today. Would really like to see if validated with an integration test, but we need to spend more time on enabling that. Blocking this PR because of that is not great.

@cmacknz
Copy link
Member

cmacknz commented Jun 17, 2024

I don't think we should block this PR on having an integration test. We are not in a worse place with this PR, versus how it is today. Would really like to see if validated with an integration test, but we need to spend more time on enabling that. Blocking this PR because of that is not great.

We don't need to block the PR on this, but we should:

  1. Define what the ideal test actually is and what the minimum we need for that is so we can actually build it.
  2. We do not need this to be a complete functional test of agent on Kubernetes, only the permissions related logic added in this PR. Is there a way we can partially test this with what we have today? For example in the very basic kind test today, can we at least make sure the agent will actually start as both root and non-root? Is there anything we can validate just be running the elastic-agent container on a Linux host that is not a container (probably not given this is updating the Dockerfile but you get the spirit of the idea).

@cmacknz
Copy link
Member

cmacknz commented Jun 17, 2024

For another example may be able to get Buildkite to just directly invoke the agent container as a build step since Buildkite can run containers without modification. See elastic/elastic-agent-system-metrics#140 for an example buildkite pipeline that has a step running a specific container and a define command to run on that container.

Edit: this is using pre-defined buildkite agent variants so perhaps not this simple, but getting CI to start a docker container is not an advanced use of CI.

@pkoutsovasilis
Copy link
Contributor Author

We don't need to block the PR on this, but we should:

  1. Define what the ideal test actually is and what the minimum we need for that is so we can actually build it.
  2. We do not need this to be a complete functional test of agent on Kubernetes, only the permissions related logic added in this PR. Is there a way we can partially test this with what we have today? For example in the very basic kind test today, can we at least make sure the agent will actually start as both root and non-root? Is there anything we can validate just be running the elastic-agent container on a Linux host that is not a container (probably not given this is updating the Dockerfile but you get the spirit of the idea).

An ideal test IMO would be something similar to this. This particular test, does exactly what we need, deploys an Elasticsearch cluster though ECK operator, deploys agent with kubernetes integration enabled through a Helm chart (spoiler alert 🚨), and checks that the expected events are getting delivered to the expected data_streams. So some ways forward that pop up in my mind are:

  • utilise ECK operator in the kind clusters part of k8s tests to spawn an Elasticsearch and then do a similar testing here [I think this is what ECK is doing]
  • since this is kind based k8s cluster maybe we can utilise elastic-package to spawn a stack that is reachable through k8s workloads and once again do similar testing
  • spawn an Elasticsearch on Elastic cloud through the terraform here (I think I've seen some tf here in the repo doing such a provision) and do our testing.

Copy link
Contributor

mergify bot commented Jul 23, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b pkoutsovasilis/container_capabilities_chown upstream/pkoutsovasilis/container_capabilities_chown
git merge upstream/main
git push upstream pkoutsovasilis/container_capabilities_chown

@pkoutsovasilis
Copy link
Contributor Author

@cmacknz I introduced the support for k8s inner tests, which are essentially go unit-tests that will run inside the agent pod.

Some background of the design choices to support k8s inner tests:

  • tests requiring the tag kubernetes_inner as we don't want these to be part of the existing elastic-agent unit-tests
  • the above binary is essentially copied by the k8s provisioner to the agent image that is about to get tested for the following reasons:
    • the provisioner knows the architecture of the k8s cluster thus it can compile correctly the tests binary
    • an equivalent of kubectl cp ... isn't suitable as it doesn't maintain the file mode and you have to chmod inside the pod to make it an executable. However, chmoding inside the pod isn't always allowed 🙂
  • introduced the first k8s inner test that checks that all agent-related paths have the same uid and gid as the elastic-agent process

Now elastic-agent standalone in k8s is tested that runs without issues against the following scenarios:

  • rootful - default deployment [no inner tests]
  • rootful - drop ALL capabilities [no inner tests]
  • rootful - drop ALL add CHOWN, SETPCAP capabilities [inner tests included]
  • rootless - drop ALL add CHOWN, SETPCAP capabilities [inner tests included]
  • rootless - drop ALL add CHOWN, SETPCAP capabilities with random uid and gid [inner tests included]

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the initialization logic.

The whole inner testing logic of executing a separate binary inside of the container seems complex and error prone to debug. Would it be possible to just change this to running commands directly from like kubectl exec instead of just running a golang binary inside of the container?

internal/pkg/agent/cmd/common.go Show resolved Hide resolved
@ycombinator ycombinator removed the request for review from pchila July 24, 2024 23:29
@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jul 25, 2024

Overall I like the initialization logic.

The whole inner testing logic of executing a separate binary inside of the container seems complex and error prone to debug. Would it be possible to just change this to running commands directly from like kubectl exec instead of just running a golang binary inside of the container?

I see your point, but given the limitation of testing inside a pod mentioned here, this is the best approach I could come up with. However, I'm more than happy to remove the Kubernetes inner tests from here and isolate them in a separate PR to give it more thought. Does that sound like a good plan? 🙂

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jul 29, 2024

@elastic/obs-cloudnative-monitoring @MichaelKatsoulis @constanca-m could you please review the files that you own? 🙂 ty 🙏 @MichaelKatsoulis

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this change and I really appreciate all the testing you added. I think its better to get this in and go from there to sit on it any longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

osquery won't install when deployed via Elastic Agent integrations on k8s
9 participants