Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism for editing the nomad0 CNI template #13824

Closed
the-maldridge opened this issue Jul 18, 2022 · 24 comments
Closed

Mechanism for editing the nomad0 CNI template #13824

the-maldridge opened this issue Jul 18, 2022 · 24 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cni theme/config theme/networking type/enhancement

Comments

@the-maldridge
Copy link

Proposal

Right now the configuration for the nomad0 bridge device is hard coded. Among other things, this makes it impossible to use Consul Connect with nomad and IPv6.

Use-cases

This would enable IPv6 with the bridge, it would also allow the use of more advanced or configurable CNI topologies.

Attempted Solutions

To the best of my knowledge, there is no current solution to make consul connect and nomad both play nice with IPv6, or other similarly advanced dual-stack network configurations.

@tgross
Copy link
Member

tgross commented Jul 25, 2022

Hi @the-maldridge! This seems like a reasonable idea, and I've marked it for roadmapping.

The CNI configuration we use can be found at networking_bridge_linux.go#L141-L180. Note that also configures the firewall and portmap plugins.

One approach we could take here is to allow the administrator to override that template with a config file somewhere on the host. The configuration seems fairly straightforward, but then it's a matter of hunting down anywhere in the client code that has specific assumptions about that template and figuring out how to detect what's the right behavior from there.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking theme/cni theme/config labels Jul 25, 2022
@the-maldridge
Copy link
Author

@tgross I like the template idea, it would provide the most flexibility while removing a dependency on a hard-coded string literal, something I always like to do. What do you think about using go:embed to include the default template rather than using the string literal as a means of simplifying the code that loads the various options. I can't remember off the top of my head though what version of Go that was introduced in to know if nomad already targets that version.

@tgross
Copy link
Member

tgross commented Jul 25, 2022

Yup, main targets go1.18, so we're fine on using embed and we're gradually moving some of our other embedded blobs over to that as well (the big lift still being the UI bundle).

@pruiz
Copy link
Contributor

pruiz commented Dec 24, 2022

@tgross I think some sort of 'escape hatch' similar to those being used for envoy maybe an option here. If we could pass on some additional 'json' to some partas of the nomad's bridge conflist file, like adding additional plugins to the list, etc. that would make it easier to extend nomad's bridge CNI setup.

In my case that would allow for using cilium along with nomad's own bridge. And being able to mix and match Consul Connect enabled services with other policied by cilium. Or even have direct l3 reachability from tasks on different nomad nodes, tunneled by cilium under nomad's bridge.

@pruiz
Copy link
Contributor

pruiz commented Dec 28, 2022

Another option that came to my mind could be using something like https://github.com/qntfy/kazaam in order to allow the user to specify some 'json transformations' to apply to normad's bridge CNI config in runtime.

This would work like:

  • A new optional setting (bridge_transform_rules?) holding the json string with the transformation rules to apply.
  • Modify buildNomadBridgeNetConfig() so if bridge_transform_rules is provided, apply such transformations using kazaam to before returning the built configuration.
  • This new (transformed) configuration would be passed onto CNI in order to build the final CNI bridge.
  • We could even pass some nomad variables to transformations, just in case this is useful in some contexts.

While this might not be the most straightforward mean to 'edit' the CNI template, this is probably the most flexible option, and can open a lot of possibilities for sysadmins to integrate nomad's bridge with many different networking systems.

Dunno what do you think @tgross.. if this seems 'acceptable' from hashicorp's point of view I could try to hack something.

Regards
Pablo

@tgross
Copy link
Member

tgross commented Jan 3, 2023

@pruiz my major worry with that specific approach is that it introduces a new DSL into the Nomad job spec. Combine that with HCL2 interpolation and Levant/nomad-pack interpolation and that could get really messy. If we were going to allow job operator configuration of the bridge at all, I'm pretty sure we'd want it to be a HCL that generates the resulting JSON CNI config (which isn't all that complex of an object, in any CNI config I've seen at least).

That also introduces a separation of duties concern. Right now the cluster administrator owns the bridge configuration to the limited degree we allow that; expanding that configuration is what's been proposed as the top-level issue here. Extending some of that ownership to the job operator blurs that line.

Can you describe in a bit more detail (ideally with examples) what kind of configurations you couldn't do with the original proposal here (along with the cni mode for the network block)? That might help us get to a workable solution here.

@pruiz
Copy link
Contributor

pruiz commented Jan 3, 2023

Hi @tgross,

I probably miss explained my self a bit. I was not proposing to add the new 'bridge_transform_rules' parameter to nomad's job spec. Just adding it to nomad client/host config..

IMHO, being able to fine-tune bridge's CNI config from job spec would be good, but it opens a lot more issues hard to solve, as bridge instance (and veth's attached to it) should be consistent among jobs for things like Consul Connect to work.

However, being able to customize bridge's CNI settings at host-level (ie. from /etc/nomad.d/nomad.hcl) opens (I think) a lot of potential. And keeping it (right now) restricted to cluster admins, makes sense (at least to me), as cluster admin is the one with actual knowledge of the networking & environment where the node lives on.

As per the new-DSL issue, I understand your point about adding another sub-DSL to config, but I just dont see how we can apply 'unlimited' modifications to a json document using HCL.

Adding some 'variables' to interpolate to the JSON emitted by networking_bridge_linux.go and replace them with new values at /etc/nomad.d/nomad.hcl, seems something workable, but as it happens with other similar approaches, the user N+1 is going to find he needs a new interpolable variable somewhere within the JSON which is not yet provided.. That's why I was looking into something more unrestricted.

@pruiz
Copy link
Contributor

pruiz commented Jan 3, 2023

In my use case, for example, my idea would be to mix Consul Connect & Cilium on top of nomad's bridge.

In order to do so, my nomad's host config (/etc/nomad.d/nomad.hcl) would include something like:

  • bridge_network_subnet => $Per-Node-Network-Prefix
  • bridge_transform_rules => Transform emitted JSON with intended goals:
    ** Add cilium's CNI driver to plugins chain, so we can use Cilium on nomad's native bridge (ie. using network=bridge, instead of network=cni/* on job [1])
    ** Maybe update 'ipam' on bridge's plugin to use cilium as ipam (not sure, but may open additional integration)
    ** Maybe disable 'ipMasq' on bridge's plugin, so we control outgoing masquerade with cilium? (not sure)
    ** Maybe tune 'firewall' plugin currently present on JSON to skip outgoing masquerade for cilium prefixes (so jobs can reach other node using cilium when allowed to)

With this configuration applied on cluster nodes, I could be able to launch jobs using the native bridge (instead of cni/*) which will be able to make mixed use of Consul Connect and Cilium, enabling:

  • Connecting from a task on a job to another service exposed using Consul Connect (w/ mtls, etc.) as we do right now.
  • Connecting from a task on a job to another endpoint (external to nomad, or even internet) using plain networking, but allowed/disallowed based on Ciliums' bpf installed policy.
  • Connecting from a task on a job to another task using direct IP networking by using Ciliums' tuneling between nomad cluster nodes.
  • etc.

All at the same time and from within the same Task Group.

Regards
Pablo

[1] Currently jobs using Cilium (by means of a network=cni/*) cannot use Consul Connect (and vice-versa)..

@the-maldridge
Copy link
Author

That's a really complete and much better phrased explanation and feature matrix than I was typing up @pruiz, it sounds like we have almost identical use cases here. I also think this is something that realistically only a cluster root operator should change, since this is going to involve potentially installing additional packages at the host level to make it work.

As to the HCL/JSON issue, what about writing the transforms in HCL and then converting that to the relevant JSON as is already done for jobspecs? It adds implementation complexity for sure, but it also keeps the operator experience uniform, which it sounds like is a primary goal here.

@tgross
Copy link
Member

tgross commented Jan 3, 2023

Ok, I'm glad we're all on the same page then that this belongs to the cluster administrator.

So if I tried to boil down the "transformations" proposal a bit, the primary advantage here over simply pointing to a CNI config file is wanting to avoid handling unique-per-host CNI configuration files so that you can do things like IP prefixes per host (as opposed to having host configuration management do it). That seems reasonable given we already have Nomad creating the bridge. You'd still need a source for the per-host configuration though. Suppose we had a 90/10 solution here by supporting a cni_bridge_config_template (happy to workshop that name) that also supports interpolation, where would we put the values we're interpolating without having per-host configuration anyways? Take it from the environment somehow?

@pruiz
Copy link
Contributor

pruiz commented Jan 4, 2023

Hi @tgross I think the cni_bridge_config_template seems like a good middle point, yes, cause:

  • PROs: Simplicity (obvious ;))
  • CONs: When a new version of nomad has a new 'default cni template' in code, cluster operator should be aware to upgrade it's local template to match changes.

And I think this is something everybody can cope with.

As for the actual template file to pass to cni_bridge_config_template, I think that could be a plain text file onto which nomad can perform such variable interpolations. Or a consul-template file which nomad can render (passing the variables to consul-template's engine), as nomad already uses consul-template for other similiar stuff. Dunno what do you guys think on this?

Last, with regard to interpolation variables, I think nomad could pass at a minimun the same values it is already using when generating bridge's json:

  • bridgeName => This is being passed nowadays already.
  • subnet => This is also being passed from 'bridge_network_subnet' config setting.
  • iptablesAdminChainName

And we could consider exposing as interpolation also (but not sure):

  • env.* => Environment variables (automatic if using consul-template)
  • node.* => Nomad's node current variables.
  • meta.* => Nomad's client meta values from config.

Regards

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 3, 2023

Hi everyone 👋

After further discussion we feel like adding more customization to the default bridge may result in unexpected outcomes that are hard for us to debug. The bridge network mode should be predictable and easily reproducible by the team so we can rely on common standard configuration.

Users that require more advanced customization are able to create their own bridge network using CNI. The main downside of this is that Consul Service Mesh currently requires network_mode = "bridge", but this is a separate problem that that is being tracked in #8953.

Feel free to 👍 and add more comments there.

Thank you everyone for the ideas and feedback!

@lgfa29 lgfa29 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2023
@the-maldridge
Copy link
Author

Hmm, that's a frustrating resolution as it means that to use consul connect in conjunction with CNI I'd now need to edit every network block in every service template in every cluster, whether or not those tasks used a CNI network previously. At that point it seems like the better option to me is to abandon consul connect entirely and use a 3rd party CNI to achieve a similar result.

I'm following the other ticket, but it really doesn't look like any consideration is given there to the default path that nomad comes with out of the box. Any thoughts on how to continue to have working defaults and still enjoy both CNI and Consul Connect?

@pruiz
Copy link
Contributor

pruiz commented Feb 6, 2023

@lgfa29 While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups (for example applications requiring direct [non-nat, non-proxied] connections from clients, or clusters requiring flexible connection between nodes on dynamically allocated ports, solutions requiring maxing out the network I/O performance of the host, etc.).

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad (ie. CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job, one can not mix connect with custom-CNIs, etc.). And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task, nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business. I just hope you guys can reconsider this issue.

Regards
Pablo

@brotherdust
Copy link

@lgfa29 While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups (for example applications requiring direct [non-nat, non-proxied] connections from clients, or clusters requiring flexible connection between nodes on dynamically allocated ports, solutions requiring maxing out the network I/O performance of the host, etc.).

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad (ie. CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job, one can not mix connect with custom-CNIs, etc.). And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task, nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business. I just hope you guys can reconsider this issue.

Regards

Pablo

I, too, support @pruiz use-case. I had to abandon Hashistack altogether because of Nomad's opinions on CNI. Consul Connect is a good generic solution, but it leaves much to be desired in the flexibility department. I tried to plumb in Cilium using their (deprecated) Consul integration and after a few months I had to bag it. It doesn't seem impossible, but it's beyond my current capabilities. So, yes. What Pablo is proposing doesn't seem unreasonable and I ask HCI to reconsider.

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 6, 2023

Hi everyone 👋

Thanks for the feedback. I think I either didn't do a good job explaining myself or completely misunderstood the proposal. I will go over the details and check with the rest of the team again to make sure I have things right.

Apologies for the confusion.

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 8, 2023

Hi everyone 👋

After a more thorough look into this I want to share what I have observed so far and expand on the direction we're planning to take for Nomad's networking story.

The main question I'm trying to answer is:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

From my investigation so far I have not been able to find examples where a custom CNI configuration would not be able to accomplish the same results as a the proposed cni_bridge_config_template. That being said, I have probably missed several scenarios so I am very curious to hear more examples and use cases that I may have missed.

My first test attempted to validate the following:

Can I can create a custom bridge network based on Nomad's default bridge?

For this I copied Nomad's bridge configuration from the docs and changed the IP range.

mybridge.conflist
{
  "cniVersion": "0.4.0",
  "name": "mybridge",
  "plugins": [
    {
      "type": "loopback"
    },
    {
      "type": "bridge",
      "bridge": "mybridge",
      "ipMasq": true,
      "isGateway": true,
      "forceAddress": true,
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "192.168.15.0/24"
            }
          ]
        ],
        "routes": [
          { "dst": "0.0.0.0/0" }
        ]
      }
    },
    {
      "type": "firewall",
      "backend": "iptables",
      "iptablesAdminChainName": "NOMAD-ADMIN"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    }
  ]
}

I then used the following job to test each network.

example.nomad
job "example" {
  datacenters = ["dc1"]

  group "cache-cni" {
    network {
      mode = "cni/mybridge"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "local/script.sh"
      }
    }
  }

  group "cache-bridge" {
    network {
      mode = "bridge"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "/local/script.sh"
      }
    }
  }
}

I was able to access the allocations from the host via the port mapping, as expected from the default bridge network.

shell
$ nomad service info redis
Job ID   Address             Tags  Node ID   Alloc ID
example  192.168.15.46:6379  []    7c8fc26d  4068e3b7
example  172.26.64.135:6379  []    7c8fc26d  f94a4782

$ nc -v 192.168.15.46 6379
Connection to 192.168.15.46 6379 port [tcp/redis] succeeded!
ping
+PONG
^C

$ nc -v 172.26.64.135 6379
Connection to 172.26.64.135 6379 port [tcp/redis] succeeded!
ping
+PONG
^C

$ nomad alloc status 40
ID                  = 4068e3b7-b4f9-b935-db17-784a693aa134
Eval ID             = d60c7ff0
Name                = example.cache-cni[0]
Node ID             = 7c8fc26d
Node Name           = lima-default
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m28s ago
Modified            = 1m14s ago
Deployment ID       = 09df8981
Deployment Health   = healthy

Allocation Addresses (mode = "cni/mybridge"):
Label  Dynamic  Address
*db    yes      127.0.0.1:20603 -> 6379

Task "ping" (poststart sidecar) is "running"
Task Resources:
CPU         Memory           Disk     Addresses
48/100 MHz  840 KiB/300 MiB  300 MiB

Task Events:
Started At     = 2023-02-07T22:59:47Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-02-07T22:59:47Z  Started     Task started by client
2023-02-07T22:59:46Z  Task Setup  Building Task Directory
2023-02-07T22:59:42Z  Received    Task received by client

Task "redis" is "running"
Task Resources:
CPU         Memory           Disk     Addresses
17/100 MHz  3.0 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2023-02-07T22:59:46Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-02-07T22:59:46Z  Started     Task started by client
2023-02-07T22:59:45Z  Task Setup  Building Task Directory
2023-02-07T22:59:42Z  Received    Task received by client

$ nc -v 127.0.0.1 20603
Connection to 127.0.0.1 20603 port [tcp/*] succeeded!
ping
+PONG
^C

$ nomad alloc status f9
ID                  = f94a4782-d4ad-d0e9-ced7-de90c1cfadf3
Eval ID             = d60c7ff0
Name                = example.cache-bridge[0]
Node ID             = 7c8fc26d
Node Name           = lima-default
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m50s ago
Modified            = 1m35s ago
Deployment ID       = 09df8981
Deployment Health   = healthy

Allocation Addresses (mode = "bridge"):
Label  Dynamic  Address
*db    yes      127.0.0.1:20702 -> 6379

Task "ping" (poststart sidecar) is "running"
Task Resources:
CPU         Memory           Disk     Addresses
51/100 MHz  696 KiB/300 MiB  300 MiB

Task Events:
Started At     = 2023-02-07T22:59:47Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-02-07T22:59:47Z  Started     Task started by client
2023-02-07T22:59:47Z  Task Setup  Building Task Directory
2023-02-07T22:59:42Z  Received    Task received by client

Task "redis" is "running"
Task Resources:
CPU         Memory           Disk     Addresses
14/100 MHz  2.5 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2023-02-07T22:59:47Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-02-07T22:59:47Z  Started     Task started by client
2023-02-07T22:59:46Z  Task Setup  Building Task Directory
2023-02-07T22:59:42Z  Received    Task received by client

$ nc -v 127.0.0.1 20702
Connection to 127.0.0.1 20702 port [tcp/*] succeeded!
ping
+PONG
^C

So it seems to be possible to have a custom bridge network based off Nomad's default that behaves the same way, with the exception of some items that I will address below.

Next I wanted to test something different:

Can I create networks with other CNI plugins based on the Nomad bridge?

For the first test I used the macvlan plugin since it's a simple one.

macvlan
{
  "cniVersion": "0.4.0",
  "name": "mymacvlan",
  "plugins": [
    {
      "type": "loopback"
    },
    {
      "name": "mynet",
      "type": "macvlan",
      "master": "eth0",
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "192.168.10.0/24"
            }
          ]
        ],
        "routes": [
          {
            "dst": "0.0.0.0/0"
          }
        ]
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      },
      "snat": true
    }
  ]
}
job "example" {
  datacenters = ["dc1"]

  group "cache-bridge" {
    network {
      mode = "bridge"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "local/script.sh"
      }
    }
  }

  group "cache-cni" {
    network {
      mode = "cni/mymacvlan"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "/local/script.sh"
      }
    }
  }
}
$ nomad service info redis
Job ID   Address             Tags  Node ID   Alloc ID
example  192.168.10.2:6379   []    62e0ad12  af93e40f
example  172.26.64.137:6379  []    62e0ad12  c1483937

$ nc -v 192.168.10.2 6379
^C

$ nomad alloc logs -task ping af
Pinging 192.168.10.2:6379
PONG
Pinging 172.26.64.137:6379
Pinging 192.168.10.2:6379
PONG

$ nomad alloc logs -task ping c1
Pinging 192.168.10.2:6379
Pinging 172.26.64.137:6379
PONG
Pinging 192.168.10.2:6379
Pinging 172.26.64.137:6379
PONG

I wasn't able to get cross-network and host port mapping communication working, but allocations in the same network were able to communicate. I think this is where my lack of more advanced networking configuration is a problem and I wonder if I'm just missing a route configuration somewhere.

macvlan - same network
job "example" {
  datacenters = ["dc1"]

  group "cache-cni-1" {
    network {
      mode = "cni/mymacvlan"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "local/script.sh"
      }
    }
  }

  group "cache-cni-2" {
    network {
      mode = "cni/mymacvlan"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "/local/script.sh"
      }
    }
  }
}
$ nomad service info redis
Job ID   Address            Tags  Node ID   Alloc ID
example  192.168.10.3:6379  []    62e0ad12  a6b08599
example  192.168.10.4:6379  []    62e0ad12  abd6f643

$ nomad alloc logs -task ping ab
Pinging 192.168.10.3:6379
PONG
Pinging 192.168.10.4:6379
PONG
Pinging 192.168.10.3:6379
PONG
Pinging 192.168.10.4:6379
PONG
Pinging 192.168.10.3:6379
PONG

$ nomad alloc logs -task ping a6
Pinging 192.168.10.3:6379
PONG
Pinging 192.168.10.4:6379
PONG
Pinging 192.168.10.3:6379
PONG
Pinging 192.168.10.4:6379
PONG

Next I tried a Cilium network setup since @pruiz and @brotherdust mentioned it. It is indeed quite challenging to get it working, but I think I was able to get enough running for what I needed. First I tried to run as an external configuration using the generic Veth Chaining approach because I think this is what is being suggested here, the ability to chain additional plugins to Nomad's bridge.

Cilium - custom CNI

Once again I started from the bridge configuration in our docs and chained "type": "cilium-cni" as mentioned in the Cilium docs.

{
  "cniVersion": "0.4.0",
  "name": "cilium",
  "plugins": [
    {
      "type": "loopback"
    },
    {
      "type": "bridge",
      "bridge": "mybridge",
      "ipMasq": true,
      "isGateway": true,
      "forceAddress": true,
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "192.168.15.0/24"
            }
          ]
        ],
        "routes": [
          { "dst": "0.0.0.0/0" }
        ]
      }
    },
    {
      "type": "firewall",
      "backend": "iptables",
      "iptablesAdminChainName": "NOMAD-ADMIN"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
      "type": "cilium-cni"
    }
  ]
}

I also used the Consul KV store backend because that's what I most familiarized with, I don't think this choice influences the test.

$ consul agent -dev

I then copied the Cilium CNI plugin to my host's /opt/cni/bin/. I actually don't know where to download it from, so I just extract it from the Docker image.

$ docker run --rm -it -v /opt/cni/bin/:/host cilium/cilium:v1.12.6 /bin/bash
root@df6cdba526a8:/home/cilium# cp /opt/cni/bin/cilium-cni /host
root@df6cdba526a8:/home/cilium# exit

Enable some Docker driver configuration to be able to mount host volumes and run the Cilium agent in privileged mode.

client {
  cni_config_dir = "..."
}

plugin "docker" {
  config {
    allow_privileged = true
    volumes {
      enabled = true
    }
  }
}

Start Nomad and run the Cilium agent job.

job "cilium" {
  datacenters = ["dc1"]

  group "agent" {
    task "agent" {
      driver = "docker"

      config {
        image   = "cilium/cilium:v1.12.6"
        command = "cilium-agent"
        args = [
          "--kvstore=consul",
          "--kvstore-opt", "consul.address=127.0.0.1:8500",
          "--enable-ipv6=false",
        ]

        privileged   = true
        network_mode = "host"

        volumes = [
          "/var/run/docker.sock:/var/run/docker.sock",
          "/var/run/cilium:/var/run/cilium",
          "/sys/fs/bpf:/sys/fs/bpf",
          "/var/run/docker/netns:/var/run/docker/netns:rshared",
          "/var/run/netns:/var/run/netns:rshared",
        ]
      }
    }
  }
}

Make sure things are good.

$ sudo cilium status
KVStore:                 Ok         Consul: 127.0.0.1:8300
Kubernetes:              Disabled
Host firewall:           Disabled
CNI Chaining:            none
Cilium:                  Ok   1.12.6 (v1.12.6-9cc8d71)
NodeMonitor:             Disabled
Cilium health daemon:    Ok
IPAM:                    IPv4: 2/65534 allocated from 10.15.0.0/16,
BandwidthManager:        Disabled
Host Routing:            Legacy
Masquerading:            IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:       20/20 healthy
Proxy Status:            OK, ip 10.15.100.217, 0 redirects active on ports 10000-20000
Global Identity Range:   min 256, max 65535
Hubble:                  Disabled
Encryption:              Disabled
Cluster health:          1/1 reachable   (2023-02-07T23:51:41Z)

Run job that uses bridge and cni/cilium.

job "example" {
  datacenters = ["dc1"]

  group "cache-cni" {
    network {
      mode = "cni/cilium"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "local/script.sh"
      }
    }
  }

  group "cache-bridge" {
    network {
      mode = "bridge"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "/local/script.sh"
      }
    }
  }
}

Remove the reserved:init label from the Cilium endpoint as pointed out by @pruiz (great catch!).

$ sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
10         Enabled            Enabled           5          reserved:init                        10.15.115.127   ready
310        Disabled           Disabled          4          reserved:health                      10.15.203.243   ready
4041       Disabled           Disabled          1          reserved:host                                        ready

$ sudo cilium endpoint labels -d reserved:init 10

$ sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
10         Disabled           Disabled          63900      no labels                            10.15.115.127   ready
310        Disabled           Disabled          4          reserved:health                      10.15.203.243   ready
4041       Disabled           Disabled          1          reserved:host                                        ready

Test connection. Results are the same as macvlan. I'm probably missing some routing rules again or maybe native-routing mode should be used.

$ nomad service  info redis
Job ID   Address             Tags  Node ID   Alloc ID
example  10.15.115.127:6379  []    bac2e14a  97b83d17
example  172.26.64.138:6379  []    bac2e14a  cf0cbb83

$ nc -v 10.15.115.127 6379
^C

$ nomad alloc logs -task ping 97
Pinging 10.15.115.127:6379
PONG
Pinging 172.26.64.138:6379
Pinging 10.15.115.127:6379
PONG
Pinging 172.26.64.138:6379

Change job so both groups are in the cni/cilium network.

job "example" {
  datacenters = ["dc1"]

  group "cache-cni-1" {
    network {
      mode = "cni/cilium"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "local/script.sh"
      }
    }
  }

  group "cache-cni-2" {
    network {
      mode = "cni/cilium"

      port "db" {
        to = 6379
      }
    }

    service {
      name         = "redis"
      port         = "db"
      provider     = "nomad"
      address_mode = "alloc"
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:7"
        ports = ["db"]
      }
    }

    task "ping" {
      driver = "docker"

      lifecycle {
        hook    = "poststart"
        sidecar = true
      }

      config {
        image   = "redis:7"
        command = "/bin/bash"
        args    = ["/local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env bash

while true; do
  {{range nomadService "redis"}}
  echo "Pinging {{.Address}}:{{.Port}}"
  redis-cli -h {{.Address}} -p {{.Port}} PING
  {{end}}
  sleep 3
done
EOF
        destination = "/local/script.sh"
      }
    }
  }
}

Remove labels again.

$ sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
563        Enabled            Enabled           5          reserved:init                        10.15.251.107   ready
680        Enabled            Enabled           5          reserved:init                        10.15.203.84    ready
3561       Disabled           Disabled          4          reserved:health                      10.15.203.243   ready
4041       Disabled           Disabled          1          reserved:host                                        ready

$ sudo cilium endpoint labels -d reserved:init 563

$ sudo cilium endpoint labels -d reserved:init 680

$ sudo cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
563        Disabled           Disabled          63900      no labels                            10.15.251.107   ready
680        Disabled           Disabled          63900      no labels                            10.15.203.84    ready
3561       Disabled           Disabled          4          reserved:health                      10.15.203.243   ready
4041       Disabled           Disabled          1          reserved:host                                        ready

Check connection.

$ nomad service info redis
Job ID   Address             Tags  Node ID   Alloc ID
example  10.15.203.84:6379   []    bac2e14a  3414181a
example  10.15.251.107:6379  []    bac2e14a  e4fc7bf2

$ nomad alloc logs -task ping 34
Pinging 10.15.203.84:6379
PONG
Pinging 10.15.251.107:6379
Pinging 10.15.203.84:6379
PONG
Pinging 10.15.251.107:6379
PONG

$ nomad alloc logs -task ping e4
Pinging 10.15.203.84:6379
PONG
Pinging 10.15.251.107:6379
PONG
Pinging 10.15.203.84:6379
PONG
Pinging 10.15.251.107:6379
PONG

Although far from a production deployment, I think this does show that it's possible to setup custom CNI networks without modifying Nomad's default bridge.

Except for the points I mentioned earlier, so I will try to list them all here and open follow-up issues for us to address them.

  • The logic to cleanup iptables rules looks for rules with nomad as comment. This is not true for custom CNI networks so they may leak.
  • Nomad automatically creates an iptables rule to forward traffic to its bridge. Currently this may need to be done manually if the rule is not present or if the custom CNI network uses a different firewall chain.
  • CNI networks are not reloaded on SIGHUP, so they require the agent to restart. CNI plugins are sometimes deployed as fully bundled artifacts, like Helm charts, that are able to apply CNI configs to a live cluster.
  • CNI configuration and plugins must be placed in Nomad clients, usually requiring and additional configuration management layer.
  • Connect integration assumes bridge network mode at job validation.
  • Communication across networks is not guaranteed and may require additional user configuration.
  • Allocations are limited to a single network preventing them from accessing multiple networks at the same time (like a CNI network and bridge).
  • Lack of documentation about CNI networking and how to deploy popular solutions from vendors.

These are all limitations of our current CNI implementation that we need to address, and are planning to do so. The last item is more complicated since it requires more partnership and engagement with third-party providers, but we will also be looking into how to improve that.

What's left to analyze is main the question:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

For this I applied the same Cilium configuration directly to the code that generates the Nomad bridge. If I understood the proposal correctly chaining CNI plugins to the Nomad bridge would be the main use case for this feature, but please correct me if I'm wrong.

But things were not much better, and most of the items above were still an issue.

Cilium - embedded in Nomad

The first thing you notice is what I mentioned in my previous comment. network_mode = "bridge" now behaves completely differently from usual. Trying to run the default exmaple.nomad job in bridge mode results in failures because Nomad's bridge is now actually Cilium.

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2023-02-08T00:41:51Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
cache       0       1         0        1       0         0     0

Latest Deployment
ID          = e0e9f013
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        2       0        1          2023-02-08T00:51:51Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
30ffb0ce  643dc5fa  cache       0        run      pending  23s ago    23s ago
d7c34e23  643dc5fa  cache       0        stop     failed   1m30s ago  22s ago

$ nomad alloc status d7
ID                   = d7c34e23-0c11-e57f-1b28-ff2274264854
Eval ID              = eccbefd9
Name                 = example.cache[0]
Node ID              = 643dc5fa
Node Name            = lima-default
Job ID               = example
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 1m45s ago
Modified             = 37s ago
Deployment ID        = e0e9f013
Deployment Health    = unhealthy
Replacement Alloc ID = 30ffb0ce

Allocation Addresses (mode = "bridge"):
Label  Dynamic  Address
*db    yes      127.0.0.1:30418 -> 6379

Task "redis" is "dead"
Task Resources:
CPU      Memory   Disk     Addresses
500 MHz  256 MiB  300 MiB

Task Events:
Started At     = N/A
Finished At    = 2023-02-08T00:42:29Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type             Description
2023-02-08T00:42:30Z  Killing          Sent interrupt. Waiting 5s before force killing
2023-02-08T00:42:29Z  Alloc Unhealthy  Unhealthy because of failed task
2023-02-08T00:42:29Z  Setup Failure    failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="cilium-cni" failed (add): unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Is the agent running?
2023-02-08T00:41:51Z  Received         Task received by client

Running the Cilium agent and deleting the endpoint labels as before fixes the problem and the allocation is now at least healthy. But, also like before, we can't access the task from the host or outside the Cilium network. Again, this is probably my fault and could be fixed with proper network configuration.

Since we're in bridge we are able to run Connect jobs, so I tested the countdash example generated from nomad job init -short -connect, but it did not work. As I described in #8953, I think there's more work missing than just removing the validations.

And so, looking at the list of issues above, the proposal here would only incidentally fix the first two items because of the way things are named and currently implemented, and both items are things we need to fix for CNI anyway.

Now, to address some of the comments since the issue was close.

From @the-maldridge.

it means that to use consul connect in conjunction with CNI I'd now need to edit every network block in every service template in every cluster, whether or not those tasks used a CNI network previously.

Having to update jobspecs is indeed an unfortunate consequence, but this is often true for new features in general and, hopefully, it's a one-time process. Modifying Nomad's bridge would also like require all allocations to be recreated so a migration of workload is also expected in both scenarios. The upgrade path also seems risky? How would you go from the default bridge to a customized bridge?

At that point it seems like the better option to me is to abandon consul connect entirely and use a 3rd party CNI to achieve a similar result.

Nomad networking features and improvements have been lagging and we're planning to address them. CNI, Consul Connect, IPv6 (which was the original use case you mentioned) are all things we are looking into improving, but unfortunately I don't have any dates to provide at this point to help you make a decision on which tool to use.

I'm following the other ticket, but it really doesn't look like any consideration is given there to the default path that nomad comes with out of the box.

You are right, the issue I linked was about enabling Consul Connect on CNI networks. #14101 and #7905 are about IPv6 support in Consul Connect and Nomad's bridge.

Any thoughts on how to continue to have working defaults and still enjoy both CNI and Consul Connect?

Right now the only way I can think of to solve your issue is to run a patched version of Nomad to customize the hardcoded bridge config. But even that I'm not sure if it will be enough to fully enable Connect with IPv6.

From @pruiz.

While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups

Agreed. We (the Nomad team) need to find a way to address this and better integrate with other networking solutions. We don't have any specifics at this point, but community support is always a good start and much appreciated!

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad

💯 we need to improve our CNI integration.

CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job

That's correct, but so would be the proposal here if I understood it correctly?

one can not mix connect with custom-CNIs

Right, and the plan is to address this in #8953. It may be that removing the validation is enough. Having more people test the custom binary I provided there would be very helpful.

And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task

That's also true, but also not covered by this proposal? As far as I know, Kubernetes also suffers from the same issue and there are meta-plugins to multiplex different networks, like Multus. I have this in my list above to be created as a follow-up issue.

nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

Yup, that's covered in #8953. One thing to clarify is what do you mean by "mixing jobs". Do you envision an alloc that uses Consul Connect to be able to reach an alloc on Cilium for example? If that's the case I'm not sure if it would work without a gateway 🤔

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I'm sorry, I didn't quite follow this part. Are you talking about, for example, having to deploy the Cilium infrastructure to use something beyond Connect?

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

This is the void we expect CNI to fill by allowing users to create their own custom networks that fits their specific needs. This specific item is not about commercial support but feature support in general. We try to be careful about backwards compatibility and this would introduce a feature we expect to deprecate. I understand the frustration but, historically, we treat code shipped as code being used. For experimentations a temporary fork may be the best approach.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business.

This is not a business decision, and I apologize if I made it sound like one. This was a technical decision as we found that arbitrary modifications to the default bridge network could be dangerous as it can break things in very subtle ways and the Nomad bridge has a predictable behaviour that we often rely on to debug issues.

We are always happy to receive contributions, and I hope this doesn't discourage you from future contributions (we have lots to do!). But sometimes we need to close feature requests to make sure we are moving towards a direction we feel confident in maintaining.

I just hope you guys can reconsider this issue.

Always! As I mentioned, the main point that I may be missing is understanding what you would be able to do with this feature that would not be possible with a well functioning CNI integration. Could you provide an example of what you would like to add to Nomad's bridge config? That can help us understand the use case better and yes, we are always willing to reconsider.

From @brotherdust.

I had to abandon Hashistack altogether because of Nomad's opinions on CNI. Consul Connect is a good generic solution, but it leaves much to be desired in the flexibility department. I tried to plumb in Cilium using their (deprecated) Consul integration and after a few months I had to bag it.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?


To finish this (already very) long comment I want to make sure that it is clear that closing this issue it's just an indication that we find a stronger and better CNI integration to be a better approach for customized networking. What "stronger and better" means depends a lot from your input, so I appreciate all the discussion and feedback so far, please keep them coming 🙂

@brotherdust
Copy link

@lgfa29 , thank you for your thoughtful and detailed response. I'm sure it took some time out of your regular activities and I can appreciate it!

I agree with you 100% that Nomad needs better CNI integration and much better IPv6 support.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?

I need some time to gather my thoughts into something more cogent. I'll get back to you soon.

@the-maldridge
Copy link
Author

Wow, kudos for such an in-depth survey of the available options. I'm truly impressed that you got Cilium working and were able to use it even in a demo environment.

I think perhaps the deeper issue that I encounter with this while looking at it is that there is a constant upgrade treadmill to operate an effective cluster. A treadmill that often times involves tracking down users in remote teams, who do not have dedicated operations resources but still expect the things they want to do in the hosted cluster environment to work. The kubernetes world solved this long ago with mutating ingress controllers to be able to monkey-patch jobspecs on the way in, and while I recognize the good arguments the Nomad team has made in the past against user-hosted ingress controllers, I can't deny that that converts operations teams into the very same mutating controller resources.

As to having to update jobspecs to make use of the new features, I remember the 0.12 upgrade cycle far too well when I spent about a week trying to figure out why none of my network config worked as I understood it to at the time. I'm really starting to wonder if the answer here is to just not use any of the builtin networking at all, to always stand up a CNI network that I own, and then put everything there. That seems to be the supported mechanism for managing a stable experience for downstream Nomad consumers, would you agree?

@brotherdust
Copy link

brotherdust commented Feb 9, 2023

Edit: added mention of Fermyon-authored Cilium integration with Nomad.

@lgfa29 , thank you for your thoughtful and detailed response. I'm sure it took some time out of your regular activities and I can appreciate it!

I agree with you 100% that Nomad needs better CNI integration and much better IPv6 support.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?

I need some time to gather my thoughts into something more cogent. I'll get back to you soon.

Ok. Thoughts gathered! First, I want to qualify what I'm describing with the fact that I am, first and foremost, a network engineer. This isn't to say that I have expert opinions in this context, but to indicate that I might have a different set of tools in my bag than a software engineer or developer; therefore, there's a danger that I'm approaching this problem from the wrong perspective and I'm more than willing to hear advice on how to think about this differently.

The goals enumerated below are enumerated for a reason: we'll be using them for reference later on.

1. Hardware Setup

  1. 3-node bare-metal cluster, AMD EPYC 7313P 16C CPU, 128GB RAM, SAS-backed SSD storage
  2. Each node has a bonded 4x25Gbps connection to the ToR switch

2. Design Goals

2.1 General

  1. Zero-trust principles shall be applied wherever possible and feasible. If it cannot be applied, justification shall be documented in ADR.

2.2 Workload Characteristics

2.2.1 Types

  1. Mostly containers (preferably rootless), with a sprinkling of VMs
  2. If it's a VM, use firecracker or something like it

2.2.2 Primary Use-Cases

  1. IPFIX flow data processing, storage, and search
  2. Private PKI for device trust (SCEP) and services
  3. Systems/network monitoring, telemetry collection and analysis

2.3 Security

  1. I have complete control of the network, so inter-node transport encryption is not required. In fact, it may be a detriment to performance and should be avoided if possible. HOWEVER:
  2. Keeping the authentication component of mTLS is desirable to prevent unauthorized or unwanted traffic
  3. Cluster ingress/egress shall be secured with TLS where possible. Where it's not possible, IP whitelisting will be used.
  4. User authentication will be provided by Azure AD, 2FA required
  5. Service authentication may also be provided by Azure AD; tokens shall be issued from Vault
  6. Role-based access controls and principle of least privilege will be strictly enforced
  7. Vault will be automatically unsealed from Azure KMS

2.4 PKI

  1. Offline CA root is HSM backed and physically secured
  2. Online intermediate CAs shall be used for issuing certificates or as a backing for an RA
  3. Intended use of certificates (at present) shall be as-follows:
  4. SCEP registration authority for network devices (requires a dedicated non-EC intermediate CA!)
  5. TLS for cluster ingress
  6. mTLS for inter-node communications (encryption not required, just the authentication component if possible)

2.5 Networking

  1. L3-only, IPv6-only using public addressing shall be preferred
  2. Nomad groups shall be allocated a fixed, cluster-wide IPv6 address during their lifecycle, even if it migrates to another node
  3. Nomad group addresses shall be advertised to the network using a BGP session with the ToR switch
  4. Load balancing, if needed, shall be handled primarily by ECMP function on the ToR switch. If more control is required, a software LB shall be spun up as nomad.job.type = service

2.6 Storage

  1. Hyper-converged storage with options to control how data is replicated, for performance use-cases where the data need not be replicated and only stored on the node where Nomad task lives

3. How It Went Down

I set off finding the pieces that would fit. It eventually came down to k8s and Hashistack. I selected Hashistack because it's basically the opposite of k8s. I'll skip my usual extended diatribe about k8s and just say the k8s is very... opinionated... and is the ideal solution for boiling the ocean, should one so desire.

Pain Points

In a general sense, the most difficult parts of the evaluation comes down to one thing: where Hashistack doesn't cover the use-case, a third-party component must be integrated. Or, if it does cover the use-case, the docs are confusing or incomplete.

CNI

To the detriment of all, all the cool kids build service-mesh CNIs for k8s. They use k8s APIs, CRDs and such; things that Nomad (and Consul, indirectly) do not understand; and, frankly, shouldn't. Nomad has CNI support, but it's very basic in the sense that it cannot be programmatically or natively configured via Nomad jobspec. It seems there is some template functionality I wasn't aware of, as indicated by some of the content of this thread, so I'll have to revisit that.

I very much agree with @lgfa29 that probably the best outcome is just to integrate Cilium as part of Nomad. That creates its own burden on the Hashicorp, so I'm not sure if they're going to be willing to do that. In this instance, I am happy to volunteer some time to maintain the integration once it is completed.

Which brings me to a related note: I saw a HashiConf talk by Taylor Thomas from Fermyon. In it he describes a full-featured Cilium integration with Nomad they are planning on open sourcing. It hasn't happened yet due to time constraints, so I reached out to them to see what the timeline is and if they would like some help. Hopefully I or someone more qualified (which is pretty much anyone) can get the ball rolling on that. If anyone wants me to keep them up to date on this item, let me know.

PKI

I realize this seems somewhat off-subject, but it is somewhat related.

This article covers some of the issues I experience, which I'll quote from here:

What does it REALLY takes to operate a whole hashistack in order to support the tiny strawberry atop the cake, namely nomad?

First of all, vault, which manages the secrets. To run vault in a highly available fashion, you would either need to provide it with a distributed database (which is another layer of complexity), or use the so called: integrated storage, which, needless to say is based on raft1. Then, you have to prepare an self signed CA1 in order to establish the root of trust, not to mention the complexity of unsealing the cluster on every restart manually (without the help of cloud KMS).

The next is consul, that provides service discovery. Consul models the connectivity between nodes into two categories, lan and wan, and each lan is a consul datacenter. Consul datacenters federate over the wan to form a logical cluster. However, data is not replicated across datacenters, it's only stores in respective datacenters (with raft2) and requests destined for other datacenters are simply forwarded (requiring full connecitity across all consul servers). For the clustering part, a gossip protocol is used, formaing a lan gossip ring1 per datacenter, and a wan gossip ring2 per cluster. In order to encrypt connections between consul servers, we need a PSK1 for the gossip protocol, and another CA2 for rpc and http api. Although the PSK and the CA can be managed by vault, there is no integration provided, you have to template files out of the secrets, and manage all rotations by yourself. And, if you wanna use the consul connect feature (a.k.a. service mesh), another CA3 is required.

Finally, we've got to nomad. Luckily, nomad claims to HAVE consul integration, and can automatically bootstrap itself given a consul cluster is beneath it. You would expect (as I do) that nomad can rely on consul for interconnection and cluster membership, but the reality is a bloody NO. The so called integration provides nothing more than saving you typing a seed node for cluster bootstrap, and serves no purpose beyond that. Which means, you still have to run a gossip ring3 per nomad region (which is like a consul datacenter) and another gossip ring4 for cross region federation. And, nomad also stores its state in per region raft3 clusters. To secure nomad clusters, another PSK2 and CA4 is needed.

Let's recap what we have now, given that we run a single vault cluster and 2 nomad regions, each containing 2 consul datacenters: 2 PSKs, 4 CAs, 7 raft clusters, 8 gossip rings. And all the cluster states are scattered across dozens of services, making the backup and recovery process a pain in the ass.

So, besides experiencing exactly what the author mentioned, I can add: if you want to integrate any of these components with an existing enterprise CA, beware that, for example:

  • The bare minimum set of SAN's required for TLS and mTLS to function, what is recommended, and what should be avoided are poorly or not documented at all.
  • The specific algorithms supported by a given Hashistack component are not documented. Had to learn the hard way that ed25519 isn't supported (or wasn't when I was doing eval)
  • Debugging TLS issues in Hashistack is nearly impossible

I think what happened is that the developers assumed that we'd want to use the self-signed CA that came with each component and nothing else. So, they weren't expecting a particular kind of error, or didn't see the need to comprehensively document what a certificate should look like. For lab purposes, this is acceptable. When one is trying to set up a production cluster, it's pretty rough.

On a final note, I seriously appreciate that this is open source software and that I am more than welcome to provide a PR. I even thought about justifying an enterprise license. But, in this particular case, a PR wouldn't be enough to address the architectural decisions that lead to where we are now; and, based on my experience with enterprise support contracts, would probably never be addressed unless there were some serious money on the table. I get it, I do. My expectations are low; but I thought it was at least worth the time to write all this out so that you would benefit from my experience.

Thanks again! Seriously great software!

@pruiz
Copy link
Contributor

pruiz commented Feb 10, 2023

Hi @lgfa29,

First, thnks for the thoughtful response, I'll try to answer some points I think relevant below.. ;)

Hi everyone 👋

After a more thorough look into this I want to share what I have observed so far and expand on the direction we're planning to take for Nomad's networking story.

The main question I'm trying to answer is:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

From my investigation so far I have not been able to find examples where a custom CNI configuration would not be able to accomplish the same results as a the proposed cni_bridge_config_template. That being said, I have probably missed several scenarios so I am very curious to hear more examples and use cases that I may have missed.

I think the main deviation from your tested scenarios and the one I have in mind is that I want a single task (within a given allocation) should be able to use both Consul Connect and Cilium's networking.
So the job would declare a single network stanza inherited by any tasks on it (which can be just a single one), and then that would work like:

  • When the (dockerized) process tries to connect to localhost (ie. to a port where envoy is listening) that would work (cilium will just allow this traffic), thus the process will be able to use Connect.
  • When the process tries to connect anywhere else (another host on local lan or internet), the traffic will flow thru the default route towards the bridge, and there Cilium will be the one taking the spot. That maybe just filtering/allowing the traffic, tunneling it (towards other cluster node), or applying NAT (or letting nomad's own NAT setting do their own)
  • In the opposite direction, traffic coming in from within a cilium tunnel will be filtered/controled by Cilium. And specifically traffic coming to envoy [from other cluster node] will be allowed by default, so Consul Connect works on both directions.
  • Then when it comes to addressing (so we can actually have tuneling), there are two options:
    • Just let the operator to set non-overlapping network prefixes for each host at nomad's configuration file (and let nomad's bridge handle addressing)
    • Use CNI chaining so the nomad bridge's uses Cilium's own IPAM. (Which maybe more flexible/needed for some deployments)

This is the kind of integration between Consul Connect & Cilium I want to achieve.

[...]

From @pruiz.

[...]

one can not mix connect with custom-CNIs

Right, and the plan is to address this in #8953. It may be that removing the validation is enough. Having more people test the custom binary I provided there would be very helpful.

That would be an option for me, but given that we can use Connect on a custom CNI network, hopefully delegating to nomad's deployment/management of envoy proxy stuff.

And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task

That's also true, but also not covered by this proposal? As far as I know, Kubernetes also suffers from the same issue and there are meta-plugins to multiplex different networks, like Multus. I have this in my list above to be created as a follow-up issue.

Yeah, I know, kubernetes is similar here, but my point was that support for more than one networking, could be another way around for this.. just provide my tasks with one network 'connecting' to nomad's bridge, and another one connecting to cilium. :)

nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

Yup, that's covered in #8953. One thing to clarify is what do you mean by "mixing jobs". Do you envision an alloc that uses Consul Connect to be able to reach an alloc on Cilium for example? If that's the case I'm not sure if it would work without a gateway 🤔

This is what I explained at the top: I think we could make Connect and Cilium work ontop of the same bridge.. and have both working together side by side.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

This is the void we expect CNI to fill by allowing users to create their own custom networks that fits their specific needs. This specific item is not about commercial support but feature support in general. We try to be careful about backwards compatibility and this would introduce a feature we expect to deprecate. I understand the frustration but, historically, we treat code shipped as code being used. For experimentations a temporary fork may be the best approach.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business.

This is not a business decision, and I apologize if I made it sound like one. This was a technical decision as we found that arbitrary modifications to the default bridge network could be dangerous as it can break things in very subtle ways and the Nomad bridge has a predictable behaviour that we often rely on to debug issues.

No bad feelings ;), I understood your point. Just wished we could find a iterim solution for the current limitations of Connect.

Regards
Pablo

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 15, 2023

@the-maldridge

The kubernetes world solved this long ago with mutating ingress controllers to be able to monkey-patch jobspecs on the way in, and while I recognize the good arguments the Nomad team has made in the past against user-hosted ingress controllers, I can't deny that that converts operations teams into the very same mutating controller resources.

I've heard some people mentioning an approach like this before (for example, here is Seatgeek speaking at HashiConf 2022), but I'm not sure if there's been any final decision on this by the team.

I'm really starting to wonder if the answer here is to just not use any of the builtin networking at all, to always stand up a CNI network that I own, and then put everything there. That seems to be the supported mechanism for managing a stable experience for downstream Nomad consumers, would you agree?

That's the direction we're going. The built-in networks should be enough for most users and a custom CNI should be used by those that need more customization. The problem right now (in addition to the CNI issues mentioned previously) is that there's a big gap between the two. We need to figure out a way to make CNI adoption more seamless.

@brotherdust thanks for the detail report of your experience!

To the detriment of all, all the cool kids build service-mesh CNIs for k8s. They use k8s APIs, CRDs and such; things that Nomad (and Consul, indirectly) do not understand; and, frankly, shouldn't.

Yup, that's the part about partnerships I mentioned in my previous comment. But those can take some time to be established. The work that @pruiz has done in Cilium is huge for this!

Nomad has CNI support, but it's very basic in the sense that it cannot be programmatically or natively configured via Nomad jobspec. It seems there is some template functionality I wasn't aware of, as indicated by some of the content of this thread, so I'll have to revisit that.

Could you expand a little on this? What kind of dynamically values would you like to set and where?

I very much agree with @lgfa29 that probably the best outcome is just to integrate Cilium as part of Nomad.

Maybe I misspoke, but I don't expect any vendor specific code in Nomad at this point. The problem I mentioned is that, in theory, the CNI spec is orchestrator agnostic but in practice a lot of plugins have components that rely on Kubernetes APIs and, unfortunately, there is not much we can do about it.

I am happy to volunteer some time to maintain the integration once it is completed.

And that's another important avenue as well. These types of integration are usually better maintained by people that actually use them, which is not our case. Everything I know about Cilium at this point was what I learned from community in #12120 🙂

I think what happened is that the developers assumed that we'd want to use the self-signed CA that came with each component and nothing else. So, they weren't expecting a particular kind of error, or didn't see the need to comprehensively document what a certificate should look like. For lab purposes, this is acceptable. When one is trying to set up a production cluster, it's pretty rough.

I would suggest opening a separate issue for this (if one doesn't exist yet).

But, in this particular case, a PR wouldn't be enough to address the architectural decisions that lead to where we are now

You're right, this will be a big effort that will require multiple PRs, but my is to break it down into smaller issues (some of them listed in my previous comment already) so maybe there will be something smaller that you can contribute 🙂

Things like documentation, blog posts, demos etc. are also extremely valuable to contribute.

Thanks again! Seriously great software!
❤️

@pruiz

I think the main deviation from your tested scenarios and the one I have in mind is that I want a single task (within a given allocation) should be able to use both Consul Connect and Cilium's networking.

Yup, I got that. But I want to make sure we're on the same as to why I closed this issue. So imagine the feature requested here were implemented, which cni_bridge_config_template would you write to accomplish what you're looking for? And what is preventing you from using a separate CNI network for this?

From what I gathered so far the only things preventing you from doing what you want are shortcomings in our CNI implementation. If that's not the case I would like to hear what cni_bridge_config_template can do that a custom CNI would not be able to.

That would be an option for me, but given that we can use Connect on a custom CNI network, hopefully delegating to nomad's deployment/management of envoy proxy stuff.

Yes, the sidecar deployment is conditional on service.connect not the network type.

I would appreciate if you could test the binary I have linked in #8953 (comment) to see if it works for you.

Yeah, I know, kubernetes is similar here, but my point was that support for more than one networking, could be another way around for this.. just provide my tasks with one network 'connecting' to nomad's bridge, and another one connecting to cilium. :)

Yup, I have this on my list and I will open a new issue about multiple network interfaces per alloc 👍

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 13, 2023

Hi all 👋

I just wanted to note that, as mentioned previously, I've created follow-up issues on specific areas that must be improved. You can find them linked above. Feel free to 👍, add more comments there, or create new issues if I missed anything.

Thanks!

@brotherdust
Copy link

@lgfa29 , thanks much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cni theme/config theme/networking type/enhancement
Projects
Development

No branches or pull requests

5 participants