-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resources
stanza for Docker driver task somehow causes permission issues within the container
#24774
Comments
Hi @efstajas, thanks for reporting the issue. Sadly, I cannot reproduce. I deployed the jobspec with the resources block on ubuntu with nomad 1.9.3 with no issues, and was able to |
Hey @pkazmierczak, thank you for looking into it. I was trying some more things, different images etc., and unfortunately things got even weirder. First up, some images I have no issues with at all, and they're running on the same hosts. I'm successfully running nginx, jenkins, traefik, among a few others. But with now multiple images, I see the same strange permission errors within the containers:
...all with the same symptoms as described in my original post. Except that with Here's the job for that: job "gitea" {
region = "global"
datacenters = ["dc1"]
type = "service"
group "gitea" {
count = 1
network {
port "http" {
to = 3000
}
port "ssh" {
to = 22
}
}
task "app" {
driver = "docker"
config {
image = "gitea/gitea:latest"
ports = ["ssh", "http"]
}
env {
APP_NAME = "Gitea: Git with a cup of tea"
RUN_MODE = "prod"
SSH_PORT = "$NOMAD_PORT_ssh"
GITEA__server__START_SSH_SERVER = "true"
}
}
}
} After deploying it through Nomad, it fails with these errors:
And as before, it works just fine when I deploy it directly through Docker on the same host like this:
... which makes me think that it has to have something to do with Nomad, somehow. Some info about my cluster as requested. All the nodes are Raspberry Pi 5s with identical setups:
Docker and Nomad / Consul was installed on the node with this Ansible playbook, which should in theory work on any (Debian) host. The nomad config on the node: data_dir = "/opt/nomad"
client {
enabled = true
host_volume "docker-sock" {
path = "/var/run/docker.sock"
read_only = false
}
}
plugin "docker" {
config {
allow_privileged = true
allow_caps = ["NET_ADMIN","NET_BROADCAST","NET_RAW"]
}
}
plugin "raw_exec" {
config {
enabled = true
}
} Pretty close to giving up and just re-imaging all the nodes, because clearly something is messed up somewhere. But I'm struggling to think of what it could be, given really nothing much was done on these nodes other than standard Docker, Nomad and Consul installs... |
Just completely re-imaged a host from scratch, with a fresh image of Raspberry Pi OS 64 Bit (Lite). I installed Docker using Problem persists :( Here's all the exact steps that were run on the host, on top of said fresh OS image: https://gist.github.com/efstajas/7de8b79d0d0206013e9928b560f55f4b And here's the exact config file now: https://gist.github.com/efstajas/d10fb646d82597709e2376bf23ebfda2 In theory, running the Ansible playbook on any Raspberry Pi 5 should reproduce the problem..? At least I can't think of any other variables that it doesn't cover. |
Ugh, @pkazmierczak, apologies, I accidentally closed the issue. It'd be amazing if you could re-open it. |
Thanks for re-opening! I did some more troubleshooting. It turns out that setting |
Hi @efstajas! We were able to reproduce your problem and it comes down to the configuration of the docker driver in the clients:
By setting the Make sure you allow all the capabilities you need in the client configuration to avoid undesired consequences when running your workloads as privileged and you should be fine. |
@Juanadelacuesta Oh man, I would've never spotted that. When I set that line I was assuming that it adds them to the defaults, but the docs actually clearly state that it replaces them. Thank you, problem solved! |
Nomad version
Operating system and Environment details
Linux pi-cluster-5-01 6.6.51+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt3 (2024-10-08) aarch64
Docker version 27.4.0, build bde2b89
Issue
I'm trying to deploy an image
lscr.io/linuxserver/nextcloud:latest
. Here's my jobspec:The env values are all the default values for that container. The PUID and GUID env vars are the standard values for that image. The problem is that the container fails to initialise because it gets permission denied errors trying to
chown
dirs within the container:... and indeed, when I
sh
into the container via Docker CLI and try tochown
one of those dirs, I confusingly get permission denied even though I'mroot
with id0:0
.I then tried deploying the image directly through the Docker CLI with the same config, and to my further confusion, everything worked fine. I
sh
into that container too, I'm alsoroot
there, but I canchown
all the dirs just fine, and the init script also works. The output ofid
is 100% identical between the two containers.So, I try to prepare a minimum reproducible example, and discover that it seems to be related to the
resources
stanza in the task, somehow. When I remove it, the nomad-orchestrated container has no permission issues. When I add it back, they're back. This seems to be reproducible on my end 100% of the time. I have no idea what could be going on here.Reproduction steps
resources
stanza and once withoutsudo docker ps
to find the container IDs of the two containerssudo docker exec -it < container ID>
to enter shell in both containerschown 1000:1000
onapp/
(or any other dir)Expected Result
app/
dir in both is owned byroot
and the user isroot
, the chown should work on both containersActual Result
chown
fails withPermission denied
on the resource-constrained container, but works as expected on the one that's not.The text was updated successfully, but these errors were encountered: