Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of docker pause containers with a custom runtime #20017

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

apollo13
Copy link
Contributor

@apollo13 apollo13 commented Feb 20, 2024

Closes #15086

hostConfig.NetworkMode = netMode
if d.config.NewNetworking {
// "host" is not actually true here, it will cause joining the existing namespace
hostConfig.NetworkMode = "host"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might still be better to use "none" here and test if it still works? But the differences are probably marginal. docker network inspect host/none will show the containers either way, even if they are not really attached.

drivers/docker/driver.go Outdated Show resolved Hide resolved
@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
@apollo13
Copy link
Contributor Author

Hi @tgross, I see you added a needs-rebase label. I'd happily rebase and work with you on that if there is any consensus on how to move this forward. Personally I think it would be a massive stability win for nomad.

@tgross
Copy link
Member

tgross commented Jul 19, 2024

Hi @tgross, I see you added a needs-rebase label. I'd happily rebase and work with you on that if there is any consensus on how to move this forward. Personally I think it would be a massive stability win for nomad.

Ah yeah... I marked this (and all other open PRs at the time) as needs-rebase just because of a CI change around backports and our new LTS workflow that wouldn't work on any open PR that wasn't rebased on main after those changes landed. Sorry, I should have posted a note too. 😀

As far as this proposal goes, I really like the idea of dropping the pause containers but I'm still fuzzy on whether this particular implementation is viable (especially with having to deal with the group-level networks). There's not really enough for me to go on here and unfortunately I haven't had time to dig in further to make sure we understand all the implications of the design.

@apollo13
Copy link
Contributor Author

@tgross Any chance that the team looks at this and we end up with some sort of plan? Even if this is not supported for nvidia runtimes, it would be a massive win for everyone else.

@tgross
Copy link
Member

tgross commented Oct 29, 2024

As I mentioned above, I'm still a little fuzzy as to the viability of this plan. The PR doesn't have a working implementation in place, so it's hard to reason about without going thru it from scratch. Unfortunately we haven't had time to do so.

@apollo13
Copy link
Contributor Author

Mkay, combined with #15086 (comment) it is a working implementation (or at least was last time I checked). I rather not put work into it if there is no chance on getting this in, so I also would like to keep the python script for now and not rewrite it into go :)

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2024

@tgross this PR now contains a fully working implementation. Start nomad with this config:

plugin "docker" {
  config {
    new_networking = true
  }
}

and configure docker like this (/etc/docker/daemon.json):

{
	"runtimes": {"nomad": {"path": "/home/florian/sources/nomad/bin/nomad", "runtimeArgs": ["runc"]}}
}

Adjust the path to the runtime executable as needed. Let me know what you think!

@tgross
Copy link
Member

tgross commented Nov 5, 2024

Thanks @apollo13! I'll take a detailed look thru in the next couple days.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 5, 2024

@tgross Thanks, no rush though. I probably cannot finish it anyways this year. That said I think it would be really valuable to have and maybe it can act as a starting point. Don't hesitate to ask if anything is unclear though!

@tgross tgross changed the title Get rid of docker pause containers with a custom runtime. Closes #15086 Get rid of docker pause containers with a custom runtime Nov 8, 2024
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @apollo13! With the work you've done here I've got a pretty good handle on how this is supposed to work, and it seems really promising. I demo'd your work here at our weekly internal demos and there was general consensus this would be a good path forward.

It's a little unfortunate that it requires extra dockerd configuration, because I'd love for this to be the default behavior, but we'll have to live with that for ease of use and backwards compatibility. For the folks for whom this matters, it'll be great.

I've left some overall architectural comments for when you return to this, but we can get into the details then. If you can't get back to it because of outside forces (totally understandable), just let us know and we'll look into carrying the PR forward (with credit to you, of course!).

command/runc.go Outdated Show resolved Hide resolved
command/runc.go Outdated Show resolved Hide resolved
command/commands.go Outdated Show resolved Hide resolved
command/runc.go Outdated Show resolved Hide resolved
@@ -310,6 +310,8 @@ var (
// task containers. If true, nomad doesn't start docker_logger/logmon processes
"disable_log_collection": hclspec.NewAttr("disable_log_collection", "bool", false),

"new_networking": hclspec.NewAttr("new_networking", "bool", false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should bikeshed over this field name some. But "no pause container" with a default of false makes for an awkward double-negative so I appreciate you're trying to avoid that here. 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You read my mind… Also I am not sure if the default (if omitted) wouldn't be always false and then it would be no pause container = false which would be the default we don't want? Or can we express that better in hcl (bool reference to allow checking for null?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to nomad_native_networking for the lack of a better name, feel free to change at will.

if containerRuntime == "" {
containerRuntime = "nomad"
} else {
return c, fmt.Errorf("new-style networking not compatible with custom runtimess")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this a bit and I'm not sure that's inevitable? What if the Nomad runtime function took an argument that was the binary to call next (defaulting to runc) along with args to transparently pass-thru? We likely need that anyway because we can't rely on the location of runc being at a specific path like /usr/bin/runc as we've done here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is just me being lazy. In practice, the docker api supports querying the registered runtimes (run docker info on your box and you will see them, so it must be part of the API). We could query that and then pass the "sub"-runtime through via annotations like with the network ns (we have to use annotations since it is docker calling the runtime and we cannot pass arbritrary args)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we cannot pass on the binary to call next via annotations since the oci runtime operations like state & delete only pass in a container ID. So the only thing we can do is force the enduser to specify something like this in docker daemon.json:

{
	"runtimes": {
          "nomad": {"path": "/home/florian/sources/nomad/bin/nomad",
                             "runtimeArgs": ["runc", "/usr/bin/runc"]},
          "nomad-nvidia": {"path": "/home/florian/sources/nomad/bin/nomad",
                             "runtimeArgs": ["runc", "/usr/bin/nvidia-runtime"]},
       }
}

and then we can just prefix nomad- to the requested runtime if the "new" networking is enabled. Not really nice but probably good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit that prefixes other runtimes with nomad-*. Should we consider the interaction with allow_runtimes? I am leaning towards a no since those nomad-* runtimes have to be configured explicitly anyways?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 8, 2024

If you can't get back to it because of outside forces (totally understandable), just let us know and we'll look into carrying the PR forward (with credit to you, of course!).

Will see what I can do, but will most likely have to be in my free time since at work we are probably moving to k8s after the recent license changes and the general move to hide more and more behind the enterprise license (and other stuff like good CNI plugins with network policy basically not being existing for nomad). Would love you folks to talk me out of it though :þ

What timeframe are you thinking about for getting that in? If you want it this year, it might be easier if you continue it. If you are not in a rush, then I might be able to get something done, no promises.

@tgross
Copy link
Member

tgross commented Nov 8, 2024

What timeframe are you thinking about for getting that in? If you want it this year, it might be easier if you continue it. If you are not in a rush, then I might be able to get something done, no promises.

No real rush. As you might imagine things slow down a bit going into the end of the year anyways. Our next major release 1.10 LTS isn't until the spring, but this seems like the sort of thing we could land in a minor release no problem.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 9, 2024

Found some time today to clean up the PR, made the cmd hidden and renamed it to runcshim. Now you also need to pass the "next runc" binary via runtimeArgs in the docker daemon.json:

{
	"runtimes": {"nomad": {"path": "/home/florian/sources/nomad/bin/nomad", "runtimeArgs": ["runcshim", "runc"]}}
}

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 9, 2024

So I have a deal for you @tgross. Would you mind taking this over and writing the tests? I'll adjust everything else you want (or you do it by your own if you are faster) but getting this tested will take me a fair amount of time to even get a basic test setup running (I'd happily improve them though -- assuming time permits it -- if you can show me an example of how to write those -- ie how should docker daemon.json get configured etc).

@tgross
Copy link
Member

tgross commented Nov 11, 2024

So I have a deal for you @tgross. Would you mind taking this over and writing the tests?

Sure thing, no problem. I'll look to getting it landed sometime in this major cycle.

@tgross tgross added theme/driver/docker type/enhancement and removed stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows labels Nov 11, 2024
@tgross tgross added this to the 1.10.0 milestone Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

idea: eliminate the use of "pause" containers for docker networking
2 participants