health check fails during grace period #9431

kneufeld · 2020-11-23T17:09:26Z

Nomad version

Nomad v1.0.0-beta3 (fcb32ef7ba037b2b83e7a0fd1c53c7410a2990db)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

During startup the task is occasionally killed for being unhealthy despite:

being in the check_restart { grace = "30s" } time
being in migrate { healthy_deadline = "10m" } time

Reproduction steps

Only happens occasionally

Job file (if appropriate)

Complicated job with 5 tasks, 1 job per task, connect proxies between them. Can email if desired.

Nomad Client logs (if appropriate)

From hashi-ui

WHEN            DURATION          TYPE                   MESSAGE   SIGNAL   CODE

1 minute	0 seconds	Received	Task received by client		0
37 seconds	35 seconds	Task Setup	Building Task Directory		0
37 seconds	0 seconds	Downloading Artifacts	Client is downloading artifacts		0
34 seconds	3 seconds	Driver	Downloading image		0
26 seconds	8 seconds	Started	Task started by client		0
4 seconds	21 seconds	Restart Signaled	healthcheck: check "director" unhealthy		0
0 seconds	6 seconds	Terminated	Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"		137
0 seconds	0 seconds	Restarting	Task restarting in 10.989924836s		0
1 second	13 seconds	Started	Task started by client

Nomad Server logs (if appropriate)

Can't find anything good. Hashi-ui relative logs vs journald absolute logs are annoying. This is actually from Friday (3 days ago).

The text was updated successfully, but these errors were encountered:

tgross · 2020-11-30T18:57:37Z

Hi @kneufeld! Is your check_restart.limit set to 1? It looks from the documentation that we shouldn't be checking at all until the grace period. Can you provide the jobspec?

kneufeld · 2020-12-01T01:34:17Z

Is your check_restart.limit set to 1?

@tgross As a matter a fact it is...

I'll email the jobspec.

tgross · 2020-12-07T14:57:27Z

Just wanted to circle back on this @kneufeld: I've been trying to reproduce the problem by building up a jobspec incrementally with all the features in the one you've provided and haven't yet hit on the source of the issue. I see you have #9307 open as well so hopefully folks in that thread will have some more data for you as well.

kneufeld · 2020-12-07T15:39:09Z

Thank you for taking a look. This isn't a major issue, just something I felt should be documented.

#9307 on the other hand is a major problem for us and will likely have us moving to kubernetes.

spaulg · 2021-04-06T14:29:03Z

I'm finding a similar thing happen to one of my jobs. I have a MySQL container running. MySQL does a temporary startup/shutdown cycle before starting properly. To skip this, based on the docs, the grace flag in the check_restart stanza should delay the health checks failing to recognise the service. But that's not happening. I'm seeing failed health check notices in Consul's log output despite the grace period of 60s not having passed.

Nomad version: v1.0.4 (9294f35)
Consul version: v1.9.4 (10bb6cb3b)

Here's a timeline of the data, based on different data sources:

Date/Time	Nomad	Consul	MySQL
14:28:14	Task received by client
14:28:14	Building task directory
14:28:14	Downloading image
14:28:54	Task started by client
14:28:54			[temporary] init server in progress
14:29:00		Health check failed
14:29:03			[temporary] ready for connections
14:29:07			[temporary] shutdown started
14:29:09			[temporary] shutdown complete
14:29:10		Health check failed
14:29:10			init server in progress
14:29:11			ready for connections

job "mysql-server" {
  datacenters = ["dc1"]
  type        = "service"

  group "mysql-server" {
    count = 1

    network {
      port "db" {
        static = 3306
      }
    }

    task "mysql-server" {
      driver = "docker"

      config {
        image = "mysql:8.0.22"
        ports = ["db"]

        # Add nice cap to prevent minutely logging
        cap_add = ["SYS_NICE"]
      }

      env {
        MYSQL_ROOT_PASSWORD = "password"
      }

      resources {
        cpu    = 100
        memory = 1000
      }

      service {
        name = "mysql-server"
        port = "db"

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"

          check_restart {
            grace = "60s"
          }
        }
      }
    }
  }
}

pznamensky · 2021-07-30T13:21:13Z

I can confirm that in Nomad v1.0.6 (592cd45) updating a task leads to restarts due to slow start.
We also have check_restart.grace which is longer than application is ready but we have similar problem and similat timeline as @spaulg has.

maxramqvist · 2021-11-10T13:01:59Z

We can confirm the same issue on Nomad v1.1.4.

maxramqvist · 2023-04-20T08:51:29Z

The same issue there with 1.5.1. The grace period is not honored, task is marked as unhealthy just a few seconds after Nomad reports "task started".

Anything we can do to help with this?

Job spec:

{
  "ID": "application",
  "Name": "application",
  "Type": "service",
  "Datacenters": [
    "dev"
  ],
  "Meta": {
    "timestamp": "1681980389",
    "version": "cb84602"
  },
  "Constraints": [
    {
      "LTarget": "${node.class}",
      "RTarget": "build",
      "Operand": "!="
    }
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 1,
    "HealthCheck": "checks",
    "MinHealthyTime": 25000000000,
    "HealthyDeadline": 100000000000,
    "AutoRevert": true,
    "AutoPromote": true,
    "Canary": 1
  },
  "TaskGroups": [
    {
      "Name": "application",
      "Count": 1,
      "Networks": [
        {
          "Mode": "bridge",
          "DynamicPorts": [
            {
              "Label": "healthcheck",
              "Value": 0,
              "To": -1,
              "HostNetwork": "default"
            },
            {
              "Label": "envoy-metrics",
              "Value": 0,
              "To": 9102,
              "HostNetwork": "default"
            }
          ]
        }
      ],
      "Services": [
        {
          "Name": "application",
          "Port": 80,
          "PortLabel": "80",
          "AddressMode": "auto",
          "Tags": [
            "namespace=default",
            "service=application",
            "awesomejobid=e27853c7-8a3b-4332-9efc-8aa835775c93",
            "external",
            "mesh",
            "prometheus_metrics_path=/metrics",
            "prometheus_metrics_address=${attr.unique.network.ip-address}:${NOMAD_PORT_metrics}",
            "prometheus_metrics=false"
          ],
          "Checks": [
            {
              "Type": "http",
              "Protocol": "http",
              "Port": null,
              "Path": "/api/health",
              "AddressMode": "",
              "Interval": 10000000000,
              "Timeout": 2000000000,
              "CheckRestart": {
                "Limit": 2,
                "Grace": 30000000000,
                "IgnoreWarnings": true
              },
              "PortLabel": "healthcheck",
              "Expose": true
            }
          ],
          "Connect": {
            "Native": false,
            "SidecarService": {
              "Tags": [
                "mesh",
                "ingress",
                "prometheus_metrics_address=${NOMAD_HOST_ADDR_envoy-metrics}",
                "prometheus_metrics_path=/metrics"
              ],
              "Proxy": {
                "ExposeConfig": {
                  "Path": [
                    {
                      "Path": "/api/health",
                      "Protocol": "http",
                      "LocalPathPort": 80,
                      "ListenerPort": "healthcheck"
                    }
                  ]
                },
                "Upstreams": [
                  {
                    "DestinationName": "backend-app",
                    "LocalBindPort": 13370
                  }
                ],
                "Config": {
                  "envoy_prometheus_bind_addr": "0.0.0.0:9102",
                  "local_request_timeout_ms": 0
                }
              }
            },
            "SidecarTask": {
              "ShutdownDelay": 10000000000,
              "KillTimeout": 10000000000,
              "Config": {
                "labels": [
                  {
                    "environment": "dev",
                    "namespace": "default",
                    "service": "application-connect-proxy"
                  }
                ]
              },
              "Resources": {
                "CPU": 50,
                "MemoryMB": 96
              }
            }
          }
        }
      ],
      "Tasks": [
        {
          "Name": "application",
          "Vault": null,
          "Driver": "docker",
          "Config": {
            "network_mode": null,
            "image": "the-image:123abc",
            "volumes": [
              "/usr/local/share/ca-certificates/:/certs/"
            ],
            "logging": [],
            "labels": []
          },
          "Services": null,
          "Env": {
            "NAMESPACE": "default"
          },
          "Templates": [],
          "Resources": {
            "CPU": 100,
            "MemoryMB": 256
          },
          "RestartPolicy": {
            "Attempts": 10,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "Artifacts": [],
          "KillTimeout": 10000000000,
          "ShutdownDelay": 10000000000
        }
      ],
      "EphemeralDisk": {
        "SizeMB": 300
      }
    }
  ]
}

jorgemarey · 2023-08-03T21:46:45Z

Hi,
I noticed the same thing. I was expecting nomad would wait for the grace period before restarting the allocation, but it didn't.

This happened not at the first start, but after some time running, the application failed and during starting it waited less than grace + check interval * times.

Looking at the code and at the logs, I noticed one thing that I don't know if it's intentional (in that case I didn't saw that documented) or a bug. In case of a restart after the first boot, the grace interval contains the restart delay. So if the delay is 25s and the check_restart grace is 60s. The application would wait for a useful grace of 35 seconds and then the check restart limits are applied (check interval * (times -1)).

mikenomitch · 2023-08-25T16:49:08Z

Hey @jorgemarey, thanks for the sleuthing regarding the restart delay and grace. I think we'll probably want to change behavior, but at least document it clearly if not. Added to our todo list, but not in the immediate todo column.

It will realistically be at least a month before we could take this up, so if anybody wants to take a crack at making sure the grace period accounts for the delay, let us know, we'd accept a PR!

jorgemarey · 2023-08-30T14:12:52Z

Hi @mikenomitch, I'll try to make a PR for this if I find the time.

jorgemarey · 2024-02-01T11:33:07Z

Hi @mikenomitch, I was looking into this again. I saw that this only happens when the check is defined at a group level (if it's defined at task level the delay applied is correct) I think the problem is that the group service hook is executed on the prerun stage So it's registered as the checkRestarter executed before the restart is completed.

Could the group_service_hook RegisterWorkload be moved to poststart fase as it's in the service_hook? I don't know if this conflicts with other things, but should solve this issue. At least what I'm referring to, that is that the check_restart.grace contains the restart.delay

tgross added theme/restart/reschedule type/bug stage/needs-investigation labels Nov 30, 2020

kneufeld mentioned this issue Dec 1, 2020

consul connect is extremely unreliable at startup #9307

Closed

mmcquillan added the hcc/jira label Jun 24, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health check fails during grace period #9431

health check fails during grace period #9431

kneufeld commented Nov 23, 2020

tgross commented Nov 30, 2020

kneufeld commented Dec 1, 2020

tgross commented Dec 7, 2020

kneufeld commented Dec 7, 2020

spaulg commented Apr 6, 2021 •

edited

Loading

pznamensky commented Jul 30, 2021

maxramqvist commented Nov 10, 2021

maxramqvist commented Apr 20, 2023

jorgemarey commented Aug 3, 2023 •

edited

Loading

mikenomitch commented Aug 25, 2023

jorgemarey commented Aug 30, 2023

jorgemarey commented Feb 1, 2024

health check fails during grace period #9431

health check fails during grace period #9431

Comments

kneufeld commented Nov 23, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

tgross commented Nov 30, 2020

kneufeld commented Dec 1, 2020

tgross commented Dec 7, 2020

kneufeld commented Dec 7, 2020

spaulg commented Apr 6, 2021 • edited Loading

pznamensky commented Jul 30, 2021

maxramqvist commented Nov 10, 2021

maxramqvist commented Apr 20, 2023

jorgemarey commented Aug 3, 2023 • edited Loading

mikenomitch commented Aug 25, 2023

jorgemarey commented Aug 30, 2023

jorgemarey commented Feb 1, 2024

spaulg commented Apr 6, 2021 •

edited

Loading

jorgemarey commented Aug 3, 2023 •

edited

Loading