Consul Connect enabled jobs fail if using health check #7709

spuder · 2020-04-13T21:52:55Z

Nomad version

Nomad = 0.11.0
Consul = 1.7.2
ACL's = Enabled
Envoy = 1.13

Issue

Consul connect enabled jobs fail to connect in envoy if a health check is defined (even if the health check passes). Consul connect enable jobs work as expected if no health check is defined on the service.

Consul connect in nomad is new, and others have had trouble as reported here:

Setup

I have a connect enabled job in nomad named bar. I have a legacy vm called foo running ubuntu 18.04 with envoy installed.

foo:14002(vm) -> foo-sidecar-proxy:(envoy) --> bar-sidecar-proxy(nomad) -> bar:3000(nomad)

The vm running 'foo' has the following in /etc/consul/service_foobar.json

{
  "service": {
    "checks": [],
    "connect": {
      "sidecar_service": {
        "proxy": {
          "upstreams": [
            {
              "destination_name": "bar",
              "local_bind_port": 14002
            },
	    {
	      "destination_name": "count-api",
	      "local_bind_port": 15002
	    }
          ]
        }
      }
    },
    "enable_tag_override": false,
    "id": "foo",
    "name": "foo",
    "tags": []
  }
}

Envoy has been started with the following command

/usr/local/bin/consul connect envoy --sidecar-for foo -admin-bind localhost:19000

The following nomad job works correctly (note that it does not have a health check). The vm foo is able to communicate with the nomad job running on port 3000 through envoy

ssh foo
curl localhost:14002/actuator/health
{
  "groups": [],
  "status": {
    "code": "UP",
    "description": ""
  }
}

  group "group" {
    count = <%= @roam['variables']['count'] %>

    # Allow all containers in a group to share a private loopback interface
    network {
      mode = "bridge"
    }
  service { # Register this task in Consul and define health checks
      name = "bar"
      port = "3000"
      connect {
        sidecar_service {}
      }
    }

I'm not sure if this is a bug, or a documentation issue. Here are all the configurations that I have tried:

Attempt 1

This works and is able to communicate over the envoy proxy service mesh, but there is no health check.

  group "group" {
    count = <%= @roam['variables']['count'] %>

    # Allow all containers in a group to share a private loopback interface
    network {
      mode = "bridge"
    }
  service { # Register this task in Consul and define health checks
      name = "bar"
      port = "3000"
      connect {
        sidecar_service {}
      }
    }

curl localhost:14002/actuator/health
{
  "groups": [],
  "status": {
    "code": "UP",
    "description": ""
  }
}

Result: ✅

Job starts
Health check passes
Envoy is able to connect

Attempt 2

    network {
      mode = "bridge"
      port "http" {
        to = "3000"
      }
    }
  service {
      name = "bar"
      port = "http"
      check {
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "10s"
        timeout = "2s"
        address_mode = "driver"
      }
      connect {
        sidecar_service {}
      }
    }

curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

Job starts
Health check passes
Envoy is not able to connect

Attempt 3

    network {
      mode = "bridge"
      port "http" {
        to = "3000"
      }
    }
  service {
      name = "bar"
      port = "http"
      address_mode = "driver"
      check {
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "10s"
        timeout = "2s"
        address_mode = "driver"
      }
      connect {
        sidecar_service {}
      }
    }

curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

Job starts
Health check passes
Envoy is not able to connect

Attempt 4

    network {
      mode = "bridge"
      port "http" {
        to = "3000"
      }
    }
  service {
      name = "bar"
      port = "http"
      address_mode = "driver"
      check {
        port = "http"
        type = "http"
        path = "/"
        interval = "5s"
        timeout = "2s"
        address_mode = "driver"
      }
      connect {
        sidecar_service {}
      }
    }

curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

Job starts
Health check does not pass
Envoy is not able to connect

Attempt 5

Set port http `to = -1` ``` group "group" { count = 1 network { mode = "bridge" port "http" { to = -1 } } service { name = "bar" port = "3000" check { port = "http" type = "http" path = "/actuator/health" interval = "5s" timeout = "2s" } connect { sidecar_service {} } } ```

Result: ❌

Job starts
Health check passes health check
Envoy is not able to connect

Attempt 6

Use `expose = true` as mentioned in this issue https://github.com//issues/7556 ``` network { mode = "bridge" port "http" { to = -1 } } service { name = "bar" port = "3000" check { port = "http" type = "http" path = "/actuator/health" expose = true interval = "5s" timeout = "2s" } connect { sidecar_service { } } } ```

Result: ❌

Job starts
Health check passes health check
Envoy is not able to connect

The text was updated successfully, but these errors were encountered:

shoenig · 2020-04-13T22:58:34Z

Hey @spuder thanks for reporting, and sorry you're having trouble with this.

Rather than us trying to debug from bits of your configuration, do you mind starting from some examples and working backwards to figure out what's going wrong? This configuration is working with v0.11.0 for me. If this baseline example doesn't work, can you provide logs from nomad and consul agents with log_level=DEBUG?

# example.nomad

job "example" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }

      check {
        name     = "api-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }

      check {
        name     = "dashboard-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

Running)

$ consul agent -dev

$ sudo nomad agent -dev-connect

$ nomad job run example.nomad

Check Nomad)

$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-04-13T16:41:44-06:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       0         1        0       0         0
dashboard   0       0         1        0       0         0

Latest Deployment
ID          = 9c22115e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
api         1        1       1        0          2020-04-13T16:51:58-06:00
dashboard   1        1       1        0          2020-04-13T16:52:04-06:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
e0f7ed62  c5839b0b  api         0        run      running  27s ago  14s ago
e7b3cd01  c5839b0b  dashboard   0        run      running  27s ago  7s ago

Checking Consul)

$ curl -s localhost:8500/v1/agent/checks | jq '.[] | select(.Name=="dashboard-health")'
{
  "Node": "NUC10",
  "CheckID": "_nomad-check-5794a0c4287f9d66c4a5450586f7410b33a6bd3f",
  "Name": "dashboard-health",
  "Status": "passing",
  "Notes": "",
  "Output": "HTTP GET http://192.168.1.53:25646/health: 200 OK Output: Hello, you've hit /health\n",
  "ServiceID": "_nomad-task-e7b3cd01-3d24-a3f1-7841-ad897586fe0f-group-dashboard-count-dashboard-9002",
  "ServiceName": "count-dashboard",
  "ServiceTags": [],
  "Type": "http",
  "Definition": {},
  "CreateIndex": 0,
  "ModifyIndex": 0
}

$ curl -s localhost:8500/v1/agent/checks | jq '.[] | select(.Name=="api-health")'
{
  "Node": "NUC10",
  "CheckID": "_nomad-check-aab24708f3160bd44748d8b8f0a85b8c6e5ceb16",
  "Name": "api-health",
  "Status": "passing",
  "Notes": "",
  "Output": "HTTP GET http://192.168.1.53:21128/health: 200 OK Output: Hello, you've hit /health\n",
  "ServiceID": "_nomad-task-e0f7ed62-a523-0544-75ca-2a41402a2c93-group-api-count-api-9001",
  "ServiceName": "count-api",
  "ServiceTags": [],
  "Type": "http",
  "Definition": {},
  "CreateIndex": 0,
  "ModifyIndex": 0
}

Checking Dashboard)

$ curl -s -w '%{response_code}\n' localhost:9002 -o /dev/null
200

shoenig · 2020-04-13T23:00:04Z

Likewise, I get similar successfull results using the underlying proxy.expose plumbing as opposed to the shortcut check.expose parameter used above.

job "example" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {
          proxy {
            expose {
              path {
                path            = "/health"
                protocol        = "http"
                local_path_port = 9001
                listener_port   = "healthcheck"
              }
            }
          }
        }
      }

      check {
        name     = "api-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }

            expose {
              path {
                path            = "/health"
                protocol        = "http"
                local_path_port = 9002
                listener_port   = "healthcheck"
              }
            }
          }
        }
      }

      check {
        name     = "dashboard-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

spuder · 2020-04-14T17:04:57Z

I figured it out. The name attribute is required on both the service and the check. If you only put the name on one or the other, consul connect will fail without any errors.

    network {
      mode = "bridge"
      port "http" {
        to = -1
      }
    }
    service {
      name = "foo"      # <- Named service 1
      port = "3000"
      check {
        name = "foo-health"      # <- Named service 2
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "5s"
        timeout = "2s"
        expose   = true
      }
      connect {
        sidecar_service {}
      }
    }

Possible remediations

Make 'name' a required attribute when using expose = true
Document that health checks must have a unique name.

shoenig · 2020-05-28T16:26:40Z

Make 'name' a required attribute when using expose = true
Document that health checks must have a unique name.

Both of these sound like good suggestions, @spuder

jharley · 2020-09-05T16:52:31Z

Possibly related to #7221: if the service stanza is using a named port (e.g. port = "http" and not port = 5000) it will generate an error: error in job mutator expose-check: unable to determine local service port for service check.

zhenik · 2020-09-29T09:17:08Z

Hi, I use this example. Differences: no need to register dynamic port -1 (static for ui) and no port stanza under check

job "countdash" {
  datacenters = ["dc1"]
  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
      check {
        expose   = true
        name     = "api-alive"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "web" {
      driver = "docker"
      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode ="bridge"
      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port = 8080
            }
          }
        }
      }
      check {
        expose   = true
        name     = "dashboard-alive"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "dashboard" {
      driver = "docker"
      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }
      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

Another example with minio

job "minio" {

  type          = "service"
  datacenters   = ["dc1"]
  namespace     = "default"

  group "s3" {
    network {
      mode = "bridge"
    }
    service {
      name = "minio"
      port = 9000
      # https://docs.min.io/docs/minio-monitoring-guide.html
      check {
        expose    = true
        name      = "minio-live"
        type      = "http"
        path      = "/minio/health/live"
        interval  = "10s"
        timeout   = "2s"
      }
      check {
        expose    = true
        name      = "minio-ready"
        type      = "http"
        path      = "/minio/health/ready"
        interval  = "15s"
        timeout   = "4s"
      }
      connect {
        sidecar_service {
        }
      }
    }

    task "server" {
      driver = "docker"

      config {
        image             = "minio/minio:latest"
        memory_hard_limit = 2048
        args              = [
          "server",
          "/local/data",
          "-address",
          "127.0.0.1:9000"
        ]
      }
      resources {
        cpu     = 200
        memory  = 1024
      }
    }
  }
}

shoenig · 2021-06-17T15:56:33Z

Make 'name' a required attribute when using expose = true
Document that health checks must have a unique name.

These aren't required anymore, I think in recent versions of Consul

Possibly related to #7221: if the service stanza is using a named port (e.g. port = "http" and not port = 5000) it will generate an error: error in job mutator expose-check: unable to determine local service port for service check.

Using a network port label for a service port that will be fronted by a Connect sidecar is probably not what you intended - the service.port value in this case is informing Envoy of the local port your service is going to bind to (inside the network namespace). Unlike with normal services, it is not used for service discovery, and should not be referenced by anything other than the internal Connect plumbing, making the value of a port label here dubious*. I'd like to better document this in #10677.

[*] you could do something like

port "api" {
  static = 9001
  to = 9001
}

and reference the api port label, but then you have a hole in your service mesh.

github-actions · 2022-10-18T02:44:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig added theme/consul/connect Consul Connect integration stage/waiting-reply labels Apr 13, 2020

stale bot removed the stage/waiting-reply label Apr 14, 2020

spuder mentioned this issue Apr 19, 2020

Services are registered without health check #7736

Closed

shoenig self-assigned this May 28, 2020

shoenig added theme/docs Documentation issues and enhancements type/enhancement labels May 28, 2020

shannonmcin mentioned this issue Jul 4, 2020

Health checks for Consul Connect-enabled services without a health endpoint not possible? #8357

Closed

spuder mentioned this issue Nov 9, 2020

Nomad job displayed as unhealthy even though job is running properly #9282

Closed

krishicks mentioned this issue Dec 18, 2020

docs: Document required name for expose=true check #9691

Merged

shoenig closed this as completed Jun 17, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul Connect enabled jobs fail if using health check #7709

Consul Connect enabled jobs fail if using health check #7709

spuder commented Apr 13, 2020 •

edited

Loading

Result: ✅

Result: ❌

Result: ❌

Result: ❌

Result: ❌

Result: ❌

shoenig commented Apr 13, 2020

shoenig commented Apr 13, 2020

spuder commented Apr 14, 2020 •

edited

Loading

shoenig commented May 28, 2020

jharley commented Sep 5, 2020

zhenik commented Sep 29, 2020 •

edited

Loading

shoenig commented Jun 17, 2021

github-actions bot commented Oct 18, 2022

Consul Connect enabled jobs fail if using health check #7709

Consul Connect enabled jobs fail if using health check #7709

Comments

spuder commented Apr 13, 2020 • edited Loading

Nomad version

Issue

Setup

Result: ✅

Result: ❌

Result: ❌

Result: ❌

Result: ❌

Result: ❌

shoenig commented Apr 13, 2020

shoenig commented Apr 13, 2020

spuder commented Apr 14, 2020 • edited Loading

shoenig commented May 28, 2020

jharley commented Sep 5, 2020

zhenik commented Sep 29, 2020 • edited Loading

shoenig commented Jun 17, 2021

github-actions bot commented Oct 18, 2022

spuder commented Apr 13, 2020 •

edited

Loading

spuder commented Apr 14, 2020 •

edited

Loading

zhenik commented Sep 29, 2020 •

edited

Loading