-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of docker pause containers with a custom runtime #20017
base: main
Are you sure you want to change the base?
Conversation
a180f30
to
a48ea61
Compare
hostConfig.NetworkMode = netMode | ||
if d.config.NewNetworking { | ||
// "host" is not actually true here, it will cause joining the existing namespace | ||
hostConfig.NetworkMode = "host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might still be better to use "none" here and test if it still works? But the differences are probably marginal. docker network inspect host/none
will show the containers either way, even if they are not really attached.
a48ea61
to
9b306f1
Compare
Hi @tgross, I see you added a |
Ah yeah... I marked this (and all other open PRs at the time) as As far as this proposal goes, I really like the idea of dropping the pause containers but I'm still fuzzy on whether this particular implementation is viable (especially with having to deal with the group-level networks). There's not really enough for me to go on here and unfortunately I haven't had time to dig in further to make sure we understand all the implications of the design. |
@tgross Any chance that the team looks at this and we end up with some sort of plan? Even if this is not supported for nvidia runtimes, it would be a massive win for everyone else. |
As I mentioned above, I'm still a little fuzzy as to the viability of this plan. The PR doesn't have a working implementation in place, so it's hard to reason about without going thru it from scratch. Unfortunately we haven't had time to do so. |
Mkay, combined with #15086 (comment) it is a working implementation (or at least was last time I checked). I rather not put work into it if there is no chance on getting this in, so I also would like to keep the python script for now and not rewrite it into go :) |
9b306f1
to
3790f03
Compare
3790f03
to
34a9e0f
Compare
34a9e0f
to
93c0eee
Compare
@tgross this PR now contains a fully working implementation. Start nomad with this config:
and configure docker like this (/etc/docker/daemon.json):
Adjust the path to the runtime executable as needed. Let me know what you think! |
Thanks @apollo13! I'll take a detailed look thru in the next couple days. |
@tgross Thanks, no rush though. I probably cannot finish it anyways this year. That said I think it would be really valuable to have and maybe it can act as a starting point. Don't hesitate to ask if anything is unclear though! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @apollo13! With the work you've done here I've got a pretty good handle on how this is supposed to work, and it seems really promising. I demo'd your work here at our weekly internal demos and there was general consensus this would be a good path forward.
It's a little unfortunate that it requires extra dockerd
configuration, because I'd love for this to be the default behavior, but we'll have to live with that for ease of use and backwards compatibility. For the folks for whom this matters, it'll be great.
I've left some overall architectural comments for when you return to this, but we can get into the details then. If you can't get back to it because of outside forces (totally understandable), just let us know and we'll look into carrying the PR forward (with credit to you, of course!).
drivers/docker/config.go
Outdated
@@ -310,6 +310,8 @@ var ( | |||
// task containers. If true, nomad doesn't start docker_logger/logmon processes | |||
"disable_log_collection": hclspec.NewAttr("disable_log_collection", "bool", false), | |||
|
|||
"new_networking": hclspec.NewAttr("new_networking", "bool", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should bikeshed over this field name some. But "no pause container" with a default of false makes for an awkward double-negative so I appreciate you're trying to avoid that here. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You read my mind… Also I am not sure if the default (if omitted) wouldn't be always false and then it would be no pause container = false
which would be the default we don't want? Or can we express that better in hcl (bool reference to allow checking for null?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to nomad_native_networking
for the lack of a better name, feel free to change at will.
drivers/docker/driver.go
Outdated
if containerRuntime == "" { | ||
containerRuntime = "nomad" | ||
} else { | ||
return c, fmt.Errorf("new-style networking not compatible with custom runtimess") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this a bit and I'm not sure that's inevitable? What if the Nomad runtime function took an argument that was the binary to call next (defaulting to runc
) along with args to transparently pass-thru? We likely need that anyway because we can't rely on the location of runc
being at a specific path like /usr/bin/runc
as we've done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that is just me being lazy. In practice, the docker api supports querying the registered runtimes (run docker info
on your box and you will see them, so it must be part of the API). We could query that and then pass the "sub"-runtime through via annotations like with the network ns (we have to use annotations since it is docker calling the runtime and we cannot pass arbritrary args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay we cannot pass on the binary to call next via annotations since the oci runtime operations like state
& delete
only pass in a container ID. So the only thing we can do is force the enduser to specify something like this in docker daemon.json:
{
"runtimes": {
"nomad": {"path": "/home/florian/sources/nomad/bin/nomad",
"runtimeArgs": ["runc", "/usr/bin/runc"]},
"nomad-nvidia": {"path": "/home/florian/sources/nomad/bin/nomad",
"runtimeArgs": ["runc", "/usr/bin/nvidia-runtime"]},
}
}
and then we can just prefix nomad-
to the requested runtime if the "new" networking is enabled. Not really nice but probably good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit that prefixes other runtimes with nomad-*
. Should we consider the interaction with allow_runtimes
? I am leaning towards a no since those nomad-*
runtimes have to be configured explicitly anyways?
Will see what I can do, but will most likely have to be in my free time since at work we are probably moving to k8s after the recent license changes and the general move to hide more and more behind the enterprise license (and other stuff like good CNI plugins with network policy basically not being existing for nomad). Would love you folks to talk me out of it though :þ What timeframe are you thinking about for getting that in? If you want it this year, it might be easier if you continue it. If you are not in a rush, then I might be able to get something done, no promises. |
No real rush. As you might imagine things slow down a bit going into the end of the year anyways. Our next major release 1.10 LTS isn't until the spring, but this seems like the sort of thing we could land in a minor release no problem. |
Found some time today to clean up the PR, made the cmd hidden and renamed it to
|
ff67f45
to
fa14919
Compare
fa14919
to
ea9f363
Compare
ea9f363
to
c3a5780
Compare
8993b1d
to
76a5f44
Compare
So I have a deal for you @tgross. Would you mind taking this over and writing the tests? I'll adjust everything else you want (or you do it by your own if you are faster) but getting this tested will take me a fair amount of time to even get a basic test setup running (I'd happily improve them though -- assuming time permits it -- if you can show me an example of how to write those -- ie how should docker daemon.json get configured etc). |
Sure thing, no problem. I'll look to getting it landed sometime in this major cycle. |
Closes #15086