`resources` stanza for Docker driver task somehow causes permission issues within the container #24774

efstajas · 2025-01-05T11:09:25Z

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

Linux pi-cluster-5-01 6.6.51+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt3 (2024-10-08) aarch64

Docker version 27.4.0, build bde2b89

Issue

I'm trying to deploy an image lscr.io/linuxserver/nextcloud:latest. Here's my jobspec:

job "nextcloud" {
  region = "global"
  datacenters = ["dc1"]
  namespace   = "default"
  type        = "service"
  
  group "nextcloud" {
    network {
      mode = "bridge"
      port "http" {
        to = 80
      }
    }

    task "nextcloud" {
      driver = "docker"

      config {
        image = "lscr.io/linuxserver/nextcloud:latest"
        ports = ["http"]
      }

      env {
        TZ = "Etc/UTC"
        PGID = "1000"
        PUID = "1000"
      }

      resources {
        cpu    = 2000
        memory = 5000
      }
    }
  }
}

The env values are all the default values for that container. The PUID and GUID env vars are the standard values for that image. The problem is that the container fails to initialise because it gets permission denied errors trying to chown dirs within the container:

chown: changing ownership of '/app': Operation not permitted
chown: changing ownership of '/config': Operation not permitted
chown: changing ownership of '/defaults': Operation not permitted
mkdir: cannot create directory ‘/var/lib/nginx’: Permission denied
s6-rc: warning: unable to start service init-folders: command exited 1
chown: changing ownership of '/etc/crontabs/abc': Operation not permitted
crontab: setegid: Operation not permitted

... and indeed, when I sh into the container via Docker CLI and try to chown one of those dirs, I confusingly get permission denied even though I'm root with id 0:0.

I then tried deploying the image directly through the Docker CLI with the same config, and to my further confusion, everything worked fine. I sh into that container too, I'm also root there, but I can chown all the dirs just fine, and the init script also works. The output of id is 100% identical between the two containers.

So, I try to prepare a minimum reproducible example, and discover that it seems to be related to the resources stanza in the task, somehow. When I remove it, the nomad-orchestrated container has no permission issues. When I add it back, they're back. This seems to be reproducible on my end 100% of the time. I have no idea what could be going on here.

Reproduction steps

Deploy the above jobspec, once with resources stanza and once without
SSH into the hosts
sudo docker ps to find the container IDs of the two containers
sudo docker exec -it < container ID> to enter shell in both containers
Attempt chown 1000:1000 on app/ (or any other dir)

Expected Result

Since the containers are theoretically identical apart from one being resource-constrained, and the app/ dir in both is owned by root and the user is root, the chown should work on both containers

Actual Result

chown fails with Permission denied on the resource-constrained container, but works as expected on the one that's not.

The text was updated successfully, but these errors were encountered:

pkazmierczak · 2025-01-07T16:01:11Z

Hi @efstajas, thanks for reporting the issue. Sadly, I cannot reproduce. I deployed the jobspec with the resources block on ubuntu with nomad 1.9.3 with no issues, and was able to chown 1000:1000 the app directory inside the nextcloud container. Can you tell me more about your Nomad cluster setup?

efstajas · 2025-01-14T21:46:27Z

Hey @pkazmierczak, thank you for looking into it.

I was trying some more things, different images etc., and unfortunately things got even weirder.

First up, some images I have no issues with at all, and they're running on the same hosts. I'm successfully running nginx, jenkins, traefik, among a few others. But with now multiple images, I see the same strange permission errors within the containers:

gitea/gitea
joxit/docker-registry-ui
linuxserver/nextcloud

...all with the same symptoms as described in my original post. Except that with gitea, it doesn't even work without resources stanza 😵‍💫

Here's the job for that:

job "gitea" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  group "gitea" {
    count = 1

    network {
      port "http" {
        to = 3000
      }

      port "ssh" {
        to     = 22
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "gitea/gitea:latest"
        ports = ["ssh", "http"]
      }

      env {
        APP_NAME                        = "Gitea: Git with a cup of tea"
        RUN_MODE                        = "prod"
        SSH_PORT                        = "$NOMAD_PORT_ssh"
        GITEA__server__START_SSH_SERVER = "true"
      }
    }
  }
}

After deploying it through Nomad, it fails with these errors:

chown: /data/gitea/conf/app.ini: Operation not permitted
chown: /data/gitea/conf/app.ini: Operation not permitted
chown: /data/gitea/conf: Operation not permitted
chown: /data/gitea/conf: Operation not permitted
chown: /data/gitea/log: Operation not permitted
chown: /data/gitea/log: Operation not permitted
chown: /data/gitea: Operation not permitted
chown: /data/gitea: Operation not permitted
chown: /app/gitea/gitea: Operation not permitted
chown: /app/gitea: Operation not permitted
chown: /app/gitea: Operation not permitted
chown: /data/git/.ssh/environment: Operation not permitted
chown: /data/git/.ssh: Operation not permitted
chown: /data/git/.ssh: Operation not permitted
chown: /data/git: Operation not permitted
chown: /data/git: Operation not permitted
su-exec: setgroups: Operation not permitted

And as before, it works just fine when I deploy it directly through Docker on the same host like this:

docker run -d \
  -p 3000:3000 \
  -p 22:22 \
  --name gitea \
  -e APP_NAME="Gitea: Git with a cup of tea" \
  -e RUN_MODE="prod" \
  -e SSH_PORT="22" \
  -e GITEA__server__START_SSH_SERVER="true" \
  gitea/gitea:latest

... which makes me think that it has to have something to do with Nomad, somehow.

Some info about my cluster as requested. All the nodes are Raspberry Pi 5s with identical setups:

6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64
Docker version 27.4.1, build b9d17ea
Nomad v1.9.4 BuildDate 2024-12-18T15:16:22Z Revision 5e49fcdb7be26941b6c7ad3ed6661bd37e70a9d8+CHANGES

Docker and Nomad / Consul was installed on the node with this Ansible playbook, which should in theory work on any (Debian) host.

The nomad config on the node:

data_dir = "/opt/nomad"

client {
  enabled = true

  host_volume "docker-sock" {
    path = "/var/run/docker.sock"
    read_only = false
  }
}

plugin "docker" {
  config {
    allow_privileged = true
    allow_caps = ["NET_ADMIN","NET_BROADCAST","NET_RAW"]
  }
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

Pretty close to giving up and just re-imaging all the nodes, because clearly something is messed up somewhere. But I'm struggling to think of what it could be, given really nothing much was done on these nodes other than standard Docker, Nomad and Consul installs...

efstajas · 2025-01-15T16:49:41Z

Just completely re-imaged a host from scratch, with a fresh image of Raspberry Pi OS 64 Bit (Lite).

I installed Docker using curl -ssL https://get.docker.com | sh now to rule out any problems with the Ansible Docker role I was using.

Problem persists :(

Here's all the exact steps that were run on the host, on top of said fresh OS image: https://gist.github.com/efstajas/7de8b79d0d0206013e9928b560f55f4b

And here's the exact config file now: https://gist.github.com/efstajas/d10fb646d82597709e2376bf23ebfda2

In theory, running the Ansible playbook on any Raspberry Pi 5 should reproduce the problem..? At least I can't think of any other variables that it doesn't cover.

efstajas · 2025-01-15T16:50:31Z

Ugh, @pkazmierczak, apologies, I accidentally closed the issue. It'd be amazing if you could re-open it.

efstajas · 2025-01-15T18:18:27Z

Thanks for re-opening!

I did some more troubleshooting. It turns out that setting privileged to true in the task's config fixes the issues. Not a real solution of course, but might help pinning it down. If I understand correctly what privileged does, it being false should not be causing these types of permission errors, or am I wrong?

Juanadelacuesta · 2025-01-22T16:31:48Z

Hi @efstajas! We were able to reproduce your problem and it comes down to the configuration of the docker driver in the clients:

plugin "docker" {
  config {
    allow_privileged = true
    allow_caps = ["NET_ADMIN","NET_BROADCAST","NET_RAW"]
  }
}

By setting the allow_caps to those 3, you are removing all the others ones that are enabled by default for Docker, including chown.
Some of the images that were working, like nginx can work perfectly with the capabilities you defined, but some others couldn't.
Allowing the containers to run as privileged overrides the allow_caps configuration in the client and allows you workloads to execute.

Make sure you allow all the capabilities you need in the client configuration to avoid undesired consequences when running your workloads as privileged and you should be fine.
I hope this helps you!

efstajas · 2025-01-23T20:55:27Z

@Juanadelacuesta Oh man, I would've never spotted that. When I set that line I was assuming that it adds them to the defaults, but the docs actually clearly state that it replaces them. Thank you, problem solved!

efstajas added the type/bug label Jan 5, 2025

jrasell added this to Nomad - Community Issues Triage Jan 7, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 7, 2025

pkazmierczak moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 7, 2025

pkazmierczak added the stage/waiting-reply label Jan 7, 2025

efstajas closed this as completed Jan 15, 2025

github-project-automation bot moved this from Triaging to Done in Nomad - Community Issues Triage Jan 15, 2025

jrasell reopened this Jan 15, 2025

github-project-automation bot moved this from Done to Needs Triage in Nomad - Community Issues Triage Jan 15, 2025

efstajas mentioned this issue Jan 15, 2025

/usr/local/bin/docker-entrypoint.sh: line 122: start.sh: No such file or directory flobernd/docker-minecraft-ftb#8

Closed

pkazmierczak added stage/needs-investigation and removed stage/waiting-reply labels Jan 20, 2025

pkazmierczak moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 20, 2025

pkazmierczak moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Jan 20, 2025

Juanadelacuesta added stage/waiting-reply type/question and removed stage/needs-investigation type/bug labels Jan 22, 2025

Juanadelacuesta self-assigned this Jan 22, 2025

Juanadelacuesta moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 22, 2025

efstajas closed this as completed Jan 23, 2025

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resources` stanza for Docker driver task somehow causes permission issues within the container #24774

`resources` stanza for Docker driver task somehow causes permission issues within the container #24774

efstajas commented Jan 5, 2025 •

edited

Loading

pkazmierczak commented Jan 7, 2025

efstajas commented Jan 14, 2025

efstajas commented Jan 15, 2025 •

edited

Loading

efstajas commented Jan 15, 2025

efstajas commented Jan 15, 2025

Juanadelacuesta commented Jan 22, 2025

efstajas commented Jan 23, 2025 •

edited

Loading

resources stanza for Docker driver task somehow causes permission issues within the container #24774

resources stanza for Docker driver task somehow causes permission issues within the container #24774

Comments

efstajas commented Jan 5, 2025 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

pkazmierczak commented Jan 7, 2025

efstajas commented Jan 14, 2025

efstajas commented Jan 15, 2025 • edited Loading

efstajas commented Jan 15, 2025

efstajas commented Jan 15, 2025

Juanadelacuesta commented Jan 22, 2025

efstajas commented Jan 23, 2025 • edited Loading

`resources` stanza for Docker driver task somehow causes permission issues within the container #24774

`resources` stanza for Docker driver task somehow causes permission issues within the container #24774

efstajas commented Jan 5, 2025 •

edited

Loading

efstajas commented Jan 15, 2025 •

edited

Loading

efstajas commented Jan 23, 2025 •

edited

Loading