-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: eliminate the use of "pause" containers for docker networking #15086
Comments
One rather creative idea where I have no idea if it works (I didn't do this for networking yet) is to provide a custom docker runtime that simply wraps The runtime hooks might allow for something similar: https://github.com/opencontainers/runtime-spec/blob/main/config.md#createruntime-hooks Sorry, I am out of clever ideas, I can only offer the crazy ideas above ;) |
Those
|
Another option might be to work towards directly using containerd instead of dockerd. This would probably allow for plenty of things :) That said the simpler option might be to make the podman plugin a first class citizen and simply promote that? |
Crosslinking #19962 for visibility. |
So I have been looking a bit into this and I think it would actually be feasible. For quick tests there is https://github.com/awslabs/oci-add-hooks which can be used as a wrapper for To add the required information to the config annotations can be used https://github.com/opencontainers/runtime-spec/blob/main/config.md#annotations (Docker cli has: So assuming nomad would ship a
whereas it would take Thoughts on the viability of this @tgross, @shoenig? I'd really love to see something like this in nomad since this is a rather annoying issue. |
Hi @apollo13! I don't think injecting the hooks is too bad... it's more like "ok what is the hook supposed to do once it's injected" that is underspecified. For example, that excerpt from the docs specifically says
But it leaves unsaid how this would be done. We could certainly create a network namespace here, but how does that network namespace ID actually get propagated to the rest of the Docker container lifetime is the missing bit we'd need to figure out. |
Hi @tgross, I think this is described here: https://blog.quarkslab.com/digging-into-runtimes-runc.html, which has an example repo here: https://github.com/mihailkirov/runc-article/blob/main/scripts-hooks/createRuntime/runtimeCreate.sh So basically what one seems to be supposed to do is move the runc process into the correct namespace? Let me check that against some running containers. |
I think we can have it even easier, we can write a shim that just alters the namespace in the config.json: https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config-linux.md#namespaces -- I will see if I can get that working with an example docker container :) |
Okay @tgross I have got something for you. Put this in your {
"runtimes": {
"nomad": {
"path": "/usr/local/bin/nomad-runc"
}
}
} With this we can supply a custom runc and don't need to implement a container shim (which is probably much more work). EDIT: The upside of a container shim would be that we do not need to modify
#!/usr/bin/env python3
import json
import os
import sys
def log(*data):
with open("/tmp/log", "a") as log_file:
print(*data, file=log_file)
def modify_bundle(bundle):
config_file = f"{bundle}/config.json"
with open(config_file) as f:
config = json.load(f)
# passed in via `--annotation` to docker run
annotations = config["annotations"]
namespaces = config["linux"]["namespaces"]
for ns in namespaces:
# If there is a network namespace, modify it
# from what I have seen it is always there, even if no path is set
if ns["type"] == "network":
ns["path"] = annotations["network_ns"]
break
else:
# Better be safe than sorry, if there is no network namespace create on
namespaces.append({"type": "network", "path": annotations["network_ns"]})
with open(config_file, "w") as f:
json.dump(config, f)
def main():
# hook into `runc create --bundle /the/bundle/dir`
if "create" in sys.argv and "--bundle":
bundle = sys.argv[sys.argv.index("--bundle") + 1]
modify_bundle(bundle)
# exec back to runc and let it do the heavy lifting
os.execvp("runc", ["runc"] + sys.argv[1:])
if __name__ == "__main__":
try:
main()
except Exception as e:
log(e)
raise When you now run:
you will see:
that the sleep process is properly attached to the correct namespace. A few details:
Now, is that viable? :D |
Now that my runc wrapper is working I have further questions:
EDIT: I guess |
I have a working (!) PR at #20017 The only thing that seems to probably break for now is nvidia and other runtimes, there we probably need a way to relay to the original runtime (probably doable via annotations as well so we know what to exec instead of runc, but it might mean reading /etc/docker/daemon.json to get that information) |
So basically we're just editing the container's config.json file and then exec'ing into Can the runtime take arguments? Normally the way we'd like to do this kind of thing is to add a hidden subcommand, similar to what we've done for More importantly, do we really need to wrap the runtime, instead of adding this config-modifying script to the
The answer to both questions can be found in
Right! We return an error! |
I guess so, but as my code shows in the PR the behavior can be made optional. Users could opt into this (and I guess users loosing their network in containers might love to get it stable).
Yes, see https://docs.docker.com/reference/cli/dockerd/#configure-runc-drop-in-replacements
You'd still need to put the full path into daemon.json though if it is not on the path.
I haven't found a way to modify the the
Yeah, figured that out after posting (my PR utilizes this to make it conditional)
👍 |
Relevant? moby/moby#47828 |
@zackherbert Kinda :) I mean in the end the best solution would be if docker "simply" allowed joining an existing namespace. I am not sure though if it actually helps. For it to work in nomad we would need to have each container to be able to join a different namespace (sure, we could take your alternative number 3 and create a new bridge for every container but that probably doesn't scale). I still wonder if a networking plugin might do the trick. But as you noted, docs are rather sparse. |
The goal is to put everything in the same network namespace, isn't it? Networks are configured at the group level. |
@tgross Yes, but containers from different jobs would need to be in different namespaces, no? Or even within a job different groups would have different namespaces. |
Right. Each allocation of a job gets its own network namespace (an allocation is defined by the
So effectively we'd be removing a lot more code than we'd be adding 😁 |
Well yes, but that is still a "maybe" even in a distant future :) FWIW I looked into the network plugin docs and it doesn't look like a networking plugin would help. The join method (https://github.com/moby/moby/blob/master/libnetwork/docs/remote.md#join) reads as if libnetwork wants to move the iface to the namespace itself:
|
When a task group is using
"bridge"
networking mode, Nomad needs a network namespace shared by all tasks in the group. These tasks are not necessarily all using the same task driver (e.g. could beexec
+docker
, etc.)Unfortunately once the Docker task driver gets involved this becomes tricky, because Docker explicitly does not support creating a container in a pre-defined network namespace (moby/moby#7455). So unlike with the
exec
driver where we use normal Linux tooling to manage and join tasks to a shared network namespace, we are at the mercy of what Docker tooling enables.The
docker network create
tool AFAICT has no support for creating a network namespace. Instead each container in Docker always gets created with its own network namespace, and then you can "link" one container to another afterwords. This is where thepause
container comes in. Because Docker ties the network namespace to a container (i.e. running process), Nomad needs a container that will not be stopped/replaced throughout the lifespan of the task group.The problem of course, is that Bad Things can still happen to that pause container, and that is believed to be the source of most occurrences of #6385. While we can make improvements around cleaning up those orphaned resources, there's a lingering desire to just eliminate this whole class of problems outright.
I'm not sure what that means for users of the
docker
driver, unless we find a clever solution to start Docker containers in an existing network namespace, one that is created and managed by Nomad like we do for other task drivers. One idea would be to better invest in and promote the podman driver, which can also run Docker containers while being compatible with normal Linux tools and conventions.The text was updated successfully, but these errors were encountered: