Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resources stanza for Docker driver task somehow causes permission issues within the container #24774

Closed
efstajas opened this issue Jan 5, 2025 · 7 comments

Comments

@efstajas
Copy link

efstajas commented Jan 5, 2025

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

Linux pi-cluster-5-01 6.6.51+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt3 (2024-10-08) aarch64

Docker version 27.4.0, build bde2b89

Issue

I'm trying to deploy an image lscr.io/linuxserver/nextcloud:latest. Here's my jobspec:

job "nextcloud" {
  region = "global"
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "service"
  
  group "nextcloud" {
    network {
      mode = "bridge"
      port "http" {
        to = 80
      }
    }

    task "nextcloud" {
      driver = "docker"

      config {
        image = "lscr.io/linuxserver/nextcloud:latest"
        ports = ["http"]
      }

      env {
        TZ = "Etc/UTC"
        PGID = "1000"
        PUID = "1000"
      }

      resources {
        cpu    = 2000
        memory = 5000
      }
    }
  }
}

The env values are all the default values for that container. The PUID and GUID env vars are the standard values for that image. The problem is that the container fails to initialise because it gets permission denied errors trying to chown dirs within the container:

chown: changing ownership of '/app': Operation not permitted
chown: changing ownership of '/config': Operation not permitted
chown: changing ownership of '/defaults': Operation not permitted
mkdir: cannot create directory ‘/var/lib/nginx’: Permission denied
s6-rc: warning: unable to start service init-folders: command exited 1
chown: changing ownership of '/etc/crontabs/abc': Operation not permitted
crontab: setegid: Operation not permitted

... and indeed, when I sh into the container via Docker CLI and try to chown one of those dirs, I confusingly get permission denied even though I'm root with id 0:0.

I then tried deploying the image directly through the Docker CLI with the same config, and to my further confusion, everything worked fine. I sh into that container too, I'm also root there, but I can chown all the dirs just fine, and the init script also works. The output of id is 100% identical between the two containers.

So, I try to prepare a minimum reproducible example, and discover that it seems to be related to the resources stanza in the task, somehow. When I remove it, the nomad-orchestrated container has no permission issues. When I add it back, they're back. This seems to be reproducible on my end 100% of the time. I have no idea what could be going on here.

Reproduction steps

  • Deploy the above jobspec, once with resources stanza and once without
  • SSH into the hosts
  • sudo docker ps to find the container IDs of the two containers
  • sudo docker exec -it < container ID> to enter shell in both containers
  • Attempt chown 1000:1000 on app/ (or any other dir)

Expected Result

  • Since the containers are theoretically identical apart from one being resource-constrained, and the app/ dir in both is owned by root and the user is root, the chown should work on both containers

Actual Result

  • chown fails with Permission denied on the resource-constrained container, but works as expected on the one that's not.
@pkazmierczak
Copy link
Contributor

Hi @efstajas, thanks for reporting the issue. Sadly, I cannot reproduce. I deployed the jobspec with the resources block on ubuntu with nomad 1.9.3 with no issues, and was able to chown 1000:1000 the app directory inside the nextcloud container. Can you tell me more about your Nomad cluster setup?

@efstajas
Copy link
Author

Hey @pkazmierczak, thank you for looking into it.

I was trying some more things, different images etc., and unfortunately things got even weirder.

First up, some images I have no issues with at all, and they're running on the same hosts. I'm successfully running nginx, jenkins, traefik, among a few others. But with now multiple images, I see the same strange permission errors within the containers:

  • gitea/gitea
  • joxit/docker-registry-ui
  • linuxserver/nextcloud

...all with the same symptoms as described in my original post. Except that with gitea, it doesn't even work without resources stanza 😵‍💫

Here's the job for that:

job "gitea" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  group "gitea" {
    count = 1

    network {
      port "http" {
        to = 3000
      }

      port "ssh" {
        to     = 22
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "gitea/gitea:latest"
        ports = ["ssh", "http"]
      }

      env {
        APP_NAME                        = "Gitea: Git with a cup of tea"
        RUN_MODE                        = "prod"
        SSH_PORT                        = "$NOMAD_PORT_ssh"
        GITEA__server__START_SSH_SERVER = "true"
      }
    }
  }
}

After deploying it through Nomad, it fails with these errors:

chown: /data/gitea/conf/app.ini: Operation not permitted
chown: /data/gitea/conf/app.ini: Operation not permitted
chown: /data/gitea/conf: Operation not permitted
chown: /data/gitea/conf: Operation not permitted
chown: /data/gitea/log: Operation not permitted
chown: /data/gitea/log: Operation not permitted
chown: /data/gitea: Operation not permitted
chown: /data/gitea: Operation not permitted
chown: /app/gitea/gitea: Operation not permitted
chown: /app/gitea: Operation not permitted
chown: /app/gitea: Operation not permitted
chown: /data/git/.ssh/environment: Operation not permitted
chown: /data/git/.ssh: Operation not permitted
chown: /data/git/.ssh: Operation not permitted
chown: /data/git: Operation not permitted
chown: /data/git: Operation not permitted
su-exec: setgroups: Operation not permitted

And as before, it works just fine when I deploy it directly through Docker on the same host like this:

docker run -d \
  -p 3000:3000 \
  -p 22:22 \
  --name gitea \
  -e APP_NAME="Gitea: Git with a cup of tea" \
  -e RUN_MODE="prod" \
  -e SSH_PORT="22" \
  -e GITEA__server__START_SSH_SERVER="true" \
  gitea/gitea:latest

... which makes me think that it has to have something to do with Nomad, somehow.

Some info about my cluster as requested. All the nodes are Raspberry Pi 5s with identical setups:

  • 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64
  • Docker version 27.4.1, build b9d17ea
  • Nomad v1.9.4 BuildDate 2024-12-18T15:16:22Z Revision 5e49fcdb7be26941b6c7ad3ed6661bd37e70a9d8+CHANGES

Docker and Nomad / Consul was installed on the node with this Ansible playbook, which should in theory work on any (Debian) host.

The nomad config on the node:

data_dir = "/opt/nomad"

client {
  enabled = true

  host_volume "docker-sock" {
    path = "/var/run/docker.sock"
    read_only = false
  }
}

plugin "docker" {
  config {
    allow_privileged = true
    allow_caps = ["NET_ADMIN","NET_BROADCAST","NET_RAW"]
  }
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Pretty close to giving up and just re-imaging all the nodes, because clearly something is messed up somewhere. But I'm struggling to think of what it could be, given really nothing much was done on these nodes other than standard Docker, Nomad and Consul installs...

@efstajas
Copy link
Author

efstajas commented Jan 15, 2025

Just completely re-imaged a host from scratch, with a fresh image of Raspberry Pi OS 64 Bit (Lite).

I installed Docker using curl -ssL https://get.docker.com | sh now to rule out any problems with the Ansible Docker role I was using.

Problem persists :(

Here's all the exact steps that were run on the host, on top of said fresh OS image: https://gist.github.com/efstajas/7de8b79d0d0206013e9928b560f55f4b

And here's the exact config file now: https://gist.github.com/efstajas/d10fb646d82597709e2376bf23ebfda2

In theory, running the Ansible playbook on any Raspberry Pi 5 should reproduce the problem..? At least I can't think of any other variables that it doesn't cover.

@efstajas
Copy link
Author

Ugh, @pkazmierczak, apologies, I accidentally closed the issue. It'd be amazing if you could re-open it.

@jrasell jrasell reopened this Jan 15, 2025
@github-project-automation github-project-automation bot moved this from Done to Needs Triage in Nomad - Community Issues Triage Jan 15, 2025
@efstajas
Copy link
Author

Thanks for re-opening!

I did some more troubleshooting. It turns out that setting privileged to true in the task's config fixes the issues. Not a real solution of course, but might help pinning it down. If I understand correctly what privileged does, it being false should not be causing these types of permission errors, or am I wrong?

@Juanadelacuesta
Copy link
Member

Hi @efstajas! We were able to reproduce your problem and it comes down to the configuration of the docker driver in the clients:

plugin "docker" {
  config {
    allow_privileged = true
    allow_caps = ["NET_ADMIN","NET_BROADCAST","NET_RAW"]
  }
}

By setting the allow_caps to those 3, you are removing all the others ones that are enabled by default for Docker, including chown.
Some of the images that were working, like nginx can work perfectly with the capabilities you defined, but some others couldn't.
Allowing the containers to run as privileged overrides the allow_caps configuration in the client and allows you workloads to execute.

Make sure you allow all the capabilities you need in the client configuration to avoid undesired consequences when running your workloads as privileged and you should be fine.
I hope this helps you!

@efstajas
Copy link
Author

efstajas commented Jan 23, 2025

@Juanadelacuesta Oh man, I would've never spotted that. When I set that line I was assuming that it adds them to the defaults, but the docs actually clearly state that it replaces them. Thank you, problem solved!

@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants