-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Elastic Agent docker image for azure containers runtime #82
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
The actual requirement in OpenShift (and other securized kubernetes environments), is that it should be possible to run the image with arbitrary users (in |
One thing to try, suggested by @eedugon would be to make the default user the owner of the files. This way, when running in an environment that runs with the default user, but without root group, as seems to be the case in Azure, the user will have permissions because is the owner. When running in an environment that randomizes uids, as Openshift, the user will have permissions because it belongs to the root group. This would require to add This could also allow to remove |
Spent too much time on this one, just to circle back, when running:
it fails creating the container with:
I tried to create a small reproduction to validate # Prepare home in a different stage to avoid creating additional layers on
# the final image because of permission changes.
FROM ubuntu:20.04 AS home
RUN mkdir -p /usr/share/elastic-agent/data /usr/share/elastic-agent/data/elastic-agent-4a0299/logs
RUN touch /usr/share/elastic-agent/test-file.txt
FROM ubuntu:20.04
COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent
# Elastic Agent needs group permissions in the home itself to be able to
# create fleet.yml when running as non-root.
# RUN chmod 0770 /usr/share/elastic-agent
RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent
USER elastic-agent
WORKDIR /usr/share/elastic-agent
ENTRYPOINT ["/usr/bin/sleep", "1d"] Running There does not seem to be anything special about the az container create --resource-group control-plane --name ubuntu2 --image ubuntu:20.04 --debug --restart-policy Never --command-line "/usr/bin/getent group root" returns One interesting observation is that the order of the copy of the home directory with chown and adding the user to the groupadd matters. COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent
RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent
elastic-agent@SandboxHost-637895294345371045:~$ ls -al
total 12
drwxr-xr-x 3 root root 4096 May 30 17:42 .
drwxr-xr-x 1 root root 4096 May 30 17:42 ..
drwxr-xr-x 3 root root 4096 May 30 17:11 data
-rw-r--r-- 1 root root 0 May 30 17:11 test-file.txt Swapping these around though will update the ownership to RUN groupadd --gid 1000 elastic-agent
RUN useradd -M --uid 1000 --gid 1000 --groups 0 --home /usr/share/elastic-agent elastic-agent
RUN usermod -a -G root elastic-agent
COPY --chown=elastic-agent:root --from=home /usr/share/elastic-agent /usr/share/elastic-agent total 12
drwxr-xr-x 3 elastic-agent root 4096 May 30 17:48 .
drwxr-xr-x 1 root root 4096 May 30 17:48 ..
drwxr-xr-x 3 elastic-agent root 4096 May 30 17:11 data
-rw-r--r-- 1 elastic-agent root 0 May 30 17:11 test-file.txt ### I will try to recreate a new image to test the swap, will update the issue soon. Just adding Regarding the few workarounds I've tested (successfully) here:
|
This is interesting, this might give a clue about how to reproduce this environment in a normal local docker. It seems they are using pause containers in Azure. These " In Kubernetes, when you start a pod, a pause container is started first, and then all the containers in the pod are attached to the network namespace of this container.
I didn't manage to reproduce the "not found" errors though, they are probably related to permissions. |
Update, I managed to reproduce the azure errors with a local docker using also random uids and gids:
I don't have a solution, but this may ease how to test this kind of environments locally and in CI. |
@jsoriano , thanks , I am trying to test a potential solution but Azure container create apis are down(?), I am getting service unavailable. Hopefully it will back soon so I can test. |
The good news: azure container service is back online and the workaround here #493 would have worked for the azure containers and OpenShift as well. The bad news: due to #398 which adds the checks if user is root https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/vault/seed.go#L33 we get:
This is because the check checks both if the user and group owner is root. elastic-agent/internal/pkg/agent/vault/seed.go Lines 47 to 51 in 82c1010
However with my PR #493 and also the workaround that was initially suggested by @eedugon the permissions will get updated to drwxr-xr-x 3 elastic-agent root 4096 May 30 17:48 . Can we be more lenient here and just ensure the files are owned by the root group and not the root user? cc @aleksmaus @ph any feedback here is welcome |
When just removing check on root user it will still fail, logged the group id and user id inside the azure container
both user and group are Removing the entire check Conclusion:Can we safely remove |
My two cents:
|
For the vault check @aleksmaus has relaxed the check. IMHO:
Maybe we have different runtime profile: one for desktop, one for containers, others? |
If it's a check that should always be relaxed when running in a container. The Elastic Agent already knows that its running in a container, because it's started with the |
@blakerouse good point, this subcommand already "implies" the container profile I've mentioned before and run would be default environment. |
hi folks, The We do still have the initial issue that in Azure we need to update the default file permissions so that the
My PR #493 fixes this but updates the file ownership completely in the docker container. As I mentioned above, that would mean either we create a dedicated docker image for the Azure containers or we further investigate the effects these changes will have on the other platforms and we introduce them in the official docker image. The majority seems against making major changes to the docker image at the moment or maintaining a separate image, can we confirm that these are the conclusions for now? |
@narph i do confirm. Let's document this behavior for now and let users tweak their configuration. |
I think that the changes in ownership in this PR make sense, and should be fine. They are in line with what @eedugon suggested. And files continue being owned by the root group, following OpenShift recommendations. |
Could be a little unrelated. But might help someone. Mine was removing volumes Host details: Not completely sure why it works tho |
I tried the work around on azure container instance and it didn't work
Any other suggestions? |
@111andre111 found a similar issue when using elastic-package in Docker for Mac. When enabling the experimental containerd image store, elastic-agent doesn't start, with:
Kibana, Elasticsearch and package-registry containers start on this environment, so at least this part seems to be an issue specific to the Elastic Agent images. Applying the permission changes in Mariana's PR (#493) may solve this issue. Maybe Azure is also using containerd under the hood. @jlind23 could this be re-prioritized? maybe this affects all containerd-based environments and not only Azure, and it would break all developer environments in Mac and Windows if this gets enabled by default. |
@jsoriano added this to one of our next sprint in order to fix this. |
@elastic/fleet-qasource-external can you please test installing Elastic Defend in a Docker container to see if it continues to install correctly after this change? We likely only need to test one of these integrations to see if this change had any consequences since the requirement to be root is the same for all of them. |
Hi @cmacknz Thank you for the update. We have re-validated this on latest 8.9.0 BC5 Kibana cloud environment and had below observations: Observations:
Build details: Please let us know if anything else is required from our end. Thanks |
Thanks, I think that is a strong hint that what was suggested in #82 (comment) about inputs that require root in containers no longer working after this change. Elastic Defend isn't supported in containers, but the beta Defend for Containers integration is and it also has this requirement. @elastic/fleet-qasource-external can you test installing the Defend for Containers integration as well? |
Hi @cmacknz Thank you for the update. We have re-validated this on latest 8.9.0 release build and had below observations: Observations:
Build details: Please let us know if anything else is required from our end. Thanks |
Thank you, Cloud Defend is broken by this change then. It also seems to have affected synthetics. @michalpristas we need to revert #3084 while we sort this out. |
@cmacknz @michalpristas The nature of what has been changed would not really be an issue for a privileged docker container. Not really sure what exactly has been tested by @harshitgupta-qasource . The pr itself doesn't really change something from a user perspective itself. It just changes the rights to /usr/share/elastic-agent appropriately and would be only valid for non-privileged containers. In my opinion it's somehow expected Defend would not run in a non-privileged scenario. And here That could be probably changed to the primary group. But long story short, that Elastic Defend doesn't work on the unprivileged use might be probably expected and I should have been testing on a privileged container which I didn't test. Unprivileged the user would be always |
I did some tests without containerd and I think that part as a test with a pure docker environment should be excluded from this ticket. I don't know if pure docker fleet installation is even supported with Elastic Endpoint or if there are some capabilities probably needed. I think that should be investivated with Endpoint, Profiler and Synthetics team. At least under Kubernetes mentioned in last comment deployment stays healthy with rolling out Elastic Defend and Elastic Defend for Containers ( for the latter one I needed to add some capabilities but that's it) End in Kubernetes for Elastic Endpoint that is solved with some sidecar containers. So this scenario could be probably tested with the new Image. |
Hi @111andre111 We have re-validated this on latest 8.9.0 release build and please find below detailed steps we have followed for this test: Steps to reproduce:
Build details: Agent Diagnostic logs: Please let us know if we are missing anything here. |
yeah i did the revert at the end of my day for quick remedy but |
@harshitgupta-qasource Tbh I don't think the docker statement would even work with Elastic Defend like I mentioned earlier it needs privileged rights. |
CC @andrewvc since I believe this change broke the synthetics container, or at least reverting it fixed on of their E2E tests. I don't believe we can bring this change back without confirming that everything that worked before still works afterwards, and ensuring our documentation reflects any changes to the way the containers need to be started. |
can we get failures from e2e? |
Sent the link privately. |
@michalpristas @pierrehilbert what's the latest here? Were you able to look at the E2E failures and see what the problem was? |
no progress |
Re-opening this as the fix needed to be reverted due to problems with ECE deployments. See #3711. |
Hi, |
Describe the enhancement:
Elastic Agent docker image is currently not working out of the box for azure container runtime
Docker-entrypoint is executed but it cannot launch elastic-agent because it doesn't find the file.
And it does not find the file because it does not have enough rights to do so.
This is due to the fact that Elastic Agent user is part of the root user group but somehow Azure container runtime has some security restriction.
This is the solution provided so far by @eedugon :
We should ensure that this fix wouldn't break anything related to openshift.
More information can be found here: #147
Describe a specific use case for the enhancement or feature:
The text was updated successfully, but these errors were encountered: