Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Consul Service Mesh on CNI networks #8953

Open
timotheencl opened this issue Sep 23, 2020 · 15 comments
Open

Support Consul Service Mesh on CNI networks #8953

timotheencl opened this issue Sep 23, 2020 · 15 comments

Comments

@timotheencl
Copy link

Nomad version

Nomad v0.12.5 (514b0d6)

Operating system and Environment details

Ubuntu Focal amd64 20.04.1

Issue

Hi !

I would like to add a CNI macvlan network to use with Consul Connect to enable an ingress gateway to be part of this publicly available network for clients.

Howerver after setup the CNI config file, nomad says that only "bridge" or "host" is correct.

Error submitting job: Unexpected response code: 500 (1 error occurred:
        * Consul Connect Gateway service requires Task Group with network mode of type "bridge" or "host"

)

Thanks !

My CNI config:

{
    "cniVersion": "0.4.0",
    "name": "mynet",
    "plugins": [
        {
            "type": "macvlan",
            "master": "enp0s10",
            "ipam": {
                "type": "dhcp"
            }
        },
        {
            "type": "portmap",
            "capabilities": {
                "portMappings": true
            },
            "snat": true
        }
    ]
}

And my job file:

job "http-echo" {
  datacenters = ["dc1"]

  group "ingress" {
    count = "2"

    network {
      mode = "cni/mynet"
      port "inbound" {
        to     = 8080
      }
    }

    service {
      name = "http-echo-ingress"
      port = "inbound"

      connect {
        gateway {
          proxy {
            connect_timeout = "500ms"
          }
          ingress {
            listener {
              port     = 8080
              protocol = "tcp"

              service {
                name = "http-echo"
              }
            }
          }
        }
      }
    }
  }
  
  group "api" {
    count = "2"

    network {
      mode = "bridge"
    }

    service {
      name = "http-echo"
      port = "5678"

      connect {
        sidecar_service {}
      }

      check {
        expose = true
        type = "http"
        path =  "/health"
        interval = "5s"
        timeout = "2s"
      }
    }

    task "api" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args = [
          "-listen", ":5678",
          "-text", "Hello world !",
        ]
      }
    }
  }
}

@nickethier
Copy link
Member

Hey @timotheenicolas

Unfortunately this is currently not supported with CNI, but I don't know of any technical limitation. Pinging @shoenig to see if this is a simple validation change or a more in depth one.

@timotheencl
Copy link
Author

Thanks :) I think it would be a cool feature to have the ability to create Ingress GW which can bind on their own IP on a macvlan network

@pratheekrebala
Copy link

pratheekrebala commented Apr 13, 2022

Just wanted to check if you had a chance to look at this further? Our use-case here is similar to @timotheenicolas's. We'd like to expose a connect sidecar service on a CNI based overlay network. Thank you!

@sinisterstumble
Copy link

sinisterstumble commented Aug 13, 2022

Would like to see this happen too, similar use case. Thank you!

@shoenig shoenig changed the title Ingress gateway on macvlan CNI network [Connect] Ingress gateway on macvlan CNI network Jan 20, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 1, 2023

Hi everyone 👋

I've been looking into this issue but I can't seem to get Connect working which may indicate that there's more work that needs to be done than just removing the validation or it could be that I'm not configuring my CNI network properly (probably more likely 😅).

I have some custom binaries at the bottom of this page https://github.com/hashicorp/nomad/actions/runs/4059725660 that was built with my changes. This is the diff d375f60...1207475 of what is in the binary.

Would anyone with more experience with CNI be able to test them? One important note, these binaries are for development purpose only and should not run in production so make sure you don't accidentally run them with your production data.

I used the sample job file that is generated from nomad job init -short -connect with a few modifications:

  • network.mode set to cni/mycni.
  • service.address_mode = "alloc" to register the IP assigned by the CNI plugin in Consul instead of the host IP.
  • sidecar_service.disable_default_tcp_check = true to get around the fact that Nomad registers the sidecar proxy TCP check using the host IP. This is something else that may need to be fixed.

Thanks in advance!

@lgfa29 lgfa29 changed the title [Connect] Ingress gateway on macvlan CNI network Support Consul Service Mesh on CNI networks Feb 3, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 3, 2023

I have edited the title here to expand the scope to all CNI networks (so not just macvlan) and to Consul Service Mesh in general (no just ingress gateways).

@netdata-be
Copy link

I'm trying to use consul connect on my nomad clusters.
However I'm limited to the fact that I have to lower the MTU on de bridge created by nomad.

Since it is hard coded I'm not able to do that, so I thought using a custom configuration and refer to it using mode = "cni/xxx"

But that fails because if this issue.

@lgfa29 Is there something I can do to help advancing this (older) issue?

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 4, 2024

Hi @netdata-be 👋

We're not currently working on this issue and I didn't receive feedback on the attempted fix mentioned in #8953 (comment) and haven't had the time to validate it further.

If I were to build another set of binaries with those changes would be able to help validate if the changes work?

sundbry added a commit to arctype-co/nomad that referenced this issue Jan 17, 2024
@nakermann1973
Copy link

@lgfa29 - It looks like this would solve most of my questions at https://discuss.hashicorp.com/t/configure-network-pinning-for-jobs/63434. I would be happy to test a patched version of 1.6.x or 1.7.x to validate the changes.

@nakermann1973
Copy link

Hi @lgfa29 I have done some tests using your patch (applied to nomad 1.7.7). A job which includes CNI and consul connect starts correctly, but the health check uses the incorrect address.

I am using a macvlan cni config:

{
  "cniVersion": "1.0.0",
  "name": "vlan107_dhcp",
  "plugins": [
    {
      "type": "macvlan",
      "master": "eth0.107",
      "ipam": {
        "type": "dhcp"
      }
    }
    ,
    {
      "type": "portmap",
      "capabilities": {
          "portMappings": true
      },
      "snat": true
    }
  ]
}

ip a l in the envoy sidecar shows that I get an address on vlan107 (it is via dhcp). This address is also shown correctly in the top-right of the consul service view.

2: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether fe:a6:92:0f:bc:7f brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.107.139/24 brd 172.17.107.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::fca6:92ff:fe0f:bc7f/64 scope link 
       valid_lft forever preferred_lft forever

/secrets/envoy-bootstrap.cmd

connect envoy -grpc-addr unix://alloc/tmp/consul_grpc.sock -http-addr localhost:8500 -admin-bind 127.0.0.2:19001 -address 127.0.0.1:19101 -proxy-id _nomad-task-d04fe8fb-efa7-b2a6-565b-d709d1cf1a2e-group-nodered-nodered-1880-sidecar-proxy -bootstrap

There is a process (envoy?) listening on port 29130 (this is on IP 172.17.107.139

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp        0      0 127.0.0.2:19001         0.0.0.0:*               LISTEN      101        35993407   -                   
tcp        0      0 0.0.0.0:1880            0.0.0.0:*               LISTEN      1000       35996775   -                   
tcp        0      0 0.0.0.0:29130           0.0.0.0:*               LISTEN      101        35993417   -                   

Consul is trying to health check the nomad client's address though (dial tcp 172.17.17.234:29130: i/o timeout).

image

From the docs (https://developer.hashicorp.com/nomad/docs/job-specification/service#address_mode), I expected the consul check of the sidecar to use the IP provided by CNI My service stanza is:

    service {
      name = "nodered"
      address_mode = "alloc"
      port = 1880
      connect {
        sidecar_service {
          proxy {}
        }
      }
  }

I can do more tests. Please let me know if anything more would help.

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 3, 2024

Ah nice, thanks for testing it @nakermann1973, I'm glad it kind of works 😅

Health checks are an interesting point. First you need to make sure the Consul agent would be able to reach the service at the IP:port allocated by the CNI plugin. Next we need a way to tell Nomad to use that IP:port as well.

For the first part, I'm not sure there's a single way to fix it. Each environment will need to be configured to fulfill this requirement.

The second part may require some code changes in how Nomad registers the service (and its health check) in Consul. If you run nomad job inspect <job ID> do you see any health checks in the sidecar or your task?

And as a last note, I no longer work for HashiCorp, so I probably won't be able to help much on this issue any more.

@nakermann1973
Copy link

I rolled back to the prod release, as it seemed like with this patch that health checks were failing across multiple services. I didn't dig into it too much, as my focus was to recover the failing services.

do you see any health checks in the sidecar or your task

I don't recall seeing any when I inspected the job

@nakermann1973
Copy link

@tgross - Any chance of getting someone to have a look at this?

@tgross
Copy link
Member

tgross commented Dec 3, 2024

@nakermann1973 it's not on the roadmap currently. I can't really speak to what would get it on the roadmap, either, but I'll flag it for @arodd and @jrasell to chat about (James is out this week though).

@sundbry
Copy link
Contributor

sundbry commented Dec 3, 2024

There are patches for CNI + Connect support on my fork of nomad 1.6 here (Apache license) https://github.com/jupitercloud/nomad/commits/main/

I've been using nomad with Consul Connect & Calico CNI for a couple of years - it works great. The one thing I haven't yet resolved is the health check problem - it's on my roadmap as a priority.

Sad to say since the BSL license change I haven't felt motivated to try to upstream them. I was hoping when IBM came in that decision might get reversed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants