Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health check fails during grace period #9431

Open
kneufeld opened this issue Nov 23, 2020 · 12 comments
Open

health check fails during grace period #9431

kneufeld opened this issue Nov 23, 2020 · 12 comments

Comments

@kneufeld
Copy link
Contributor

Nomad version

Nomad v1.0.0-beta3 (fcb32ef7ba037b2b83e7a0fd1c53c7410a2990db)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

During startup the task is occasionally killed for being unhealthy despite:

  1. being in the check_restart { grace = "30s" } time
  2. being in migrate { healthy_deadline = "10m" } time

Reproduction steps

Only happens occasionally

Job file (if appropriate)

Complicated job with 5 tasks, 1 job per task, connect proxies between them. Can email if desired.

Nomad Client logs (if appropriate)

From hashi-ui

WHEN            DURATION          TYPE                   MESSAGE   SIGNAL   CODE
1 minute	0 seconds	Received	Task received by client		0
37 seconds	35 seconds	Task Setup	Building Task Directory		0
37 seconds	0 seconds	Downloading Artifacts	Client is downloading artifacts		0
34 seconds	3 seconds	Driver	Downloading image		0
26 seconds	8 seconds	Started	Task started by client		0
4 seconds	21 seconds	Restart Signaled	healthcheck: check "director" unhealthy		0
0 seconds	6 seconds	Terminated	Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"		137
0 seconds	0 seconds	Restarting	Task restarting in 10.989924836s		0
1 second	13 seconds	Started	Task started by client

Nomad Server logs (if appropriate)

Can't find anything good. Hashi-ui relative logs vs journald absolute logs are annoying. This is actually from Friday (3 days ago).

@tgross
Copy link
Member

tgross commented Nov 30, 2020

Hi @kneufeld! Is your check_restart.limit set to 1? It looks from the documentation that we shouldn't be checking at all until the grace period. Can you provide the jobspec?

@kneufeld
Copy link
Contributor Author

kneufeld commented Dec 1, 2020

Is your check_restart.limit set to 1?

@tgross As a matter a fact it is...

I'll email the jobspec.

@tgross
Copy link
Member

tgross commented Dec 7, 2020

Just wanted to circle back on this @kneufeld: I've been trying to reproduce the problem by building up a jobspec incrementally with all the features in the one you've provided and haven't yet hit on the source of the issue. I see you have #9307 open as well so hopefully folks in that thread will have some more data for you as well.

@kneufeld
Copy link
Contributor Author

kneufeld commented Dec 7, 2020

Thank you for taking a look. This isn't a major issue, just something I felt should be documented.

#9307 on the other hand is a major problem for us and will likely have us moving to kubernetes.

@spaulg
Copy link

spaulg commented Apr 6, 2021

I'm finding a similar thing happen to one of my jobs. I have a MySQL container running. MySQL does a temporary startup/shutdown cycle before starting properly. To skip this, based on the docs, the grace flag in the check_restart stanza should delay the health checks failing to recognise the service. But that's not happening. I'm seeing failed health check notices in Consul's log output despite the grace period of 60s not having passed.

Nomad version: v1.0.4 (9294f35)
Consul version: v1.9.4 (10bb6cb3b)

Here's a timeline of the data, based on different data sources:

Date/Time Nomad Consul MySQL
14:28:14 Task received by client
14:28:14 Building task directory
14:28:14 Downloading image
14:28:54 Task started by client
14:28:54 [temporary] init server in progress
14:29:00 Health check failed
14:29:03 [temporary] ready for connections
14:29:07 [temporary] shutdown started
14:29:09 [temporary] shutdown complete
14:29:10 Health check failed
14:29:10 init server in progress
14:29:11 ready for connections
job "mysql-server" {
  datacenters = ["dc1"]
  type        = "service"

  group "mysql-server" {
    count = 1

    network {
      port "db" {
        static = 3306
      }
    }

    task "mysql-server" {
      driver = "docker"

      config {
        image = "mysql:8.0.22"
        ports = ["db"]

        # Add nice cap to prevent minutely logging
        cap_add = ["SYS_NICE"]
      }

      env {
        MYSQL_ROOT_PASSWORD = "password"
      }

      resources {
        cpu    = 100
        memory = 1000
      }

      service {
        name = "mysql-server"
        port = "db"

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"

          check_restart {
            grace = "60s"
          }
        }
      }
    }
  }
}

@pznamensky
Copy link

I can confirm that in Nomad v1.0.6 (592cd45) updating a task leads to restarts due to slow start.
We also have check_restart.grace which is longer than application is ready but we have similar problem and similat timeline as @spaulg has.

@maxramqvist
Copy link

We can confirm the same issue on Nomad v1.1.4.

@maxramqvist
Copy link

The same issue there with 1.5.1. The grace period is not honored, task is marked as unhealthy just a few seconds after Nomad reports "task started".

Anything we can do to help with this?

Job spec:

{
  "ID": "application",
  "Name": "application",
  "Type": "service",
  "Datacenters": [
    "dev"
  ],
  "Meta": {
    "timestamp": "1681980389",
    "version": "cb84602"
  },
  "Constraints": [
    {
      "LTarget": "${node.class}",
      "RTarget": "build",
      "Operand": "!="
    }
  ],
  "Update": {
    "Stagger": 10000000000,
    "MaxParallel": 1,
    "HealthCheck": "checks",
    "MinHealthyTime": 25000000000,
    "HealthyDeadline": 100000000000,
    "AutoRevert": true,
    "AutoPromote": true,
    "Canary": 1
  },
  "TaskGroups": [
    {
      "Name": "application",
      "Count": 1,
      "Networks": [
        {
          "Mode": "bridge",
          "DynamicPorts": [
            {
              "Label": "healthcheck",
              "Value": 0,
              "To": -1,
              "HostNetwork": "default"
            },
            {
              "Label": "envoy-metrics",
              "Value": 0,
              "To": 9102,
              "HostNetwork": "default"
            }
          ]
        }
      ],
      "Services": [
        {
          "Name": "application",
          "Port": 80,
          "PortLabel": "80",
          "AddressMode": "auto",
          "Tags": [
            "namespace=default",
            "service=application",
            "awesomejobid=e27853c7-8a3b-4332-9efc-8aa835775c93",
            "external",
            "mesh",
            "prometheus_metrics_path=/metrics",
            "prometheus_metrics_address=${attr.unique.network.ip-address}:${NOMAD_PORT_metrics}",
            "prometheus_metrics=false"
          ],
          "Checks": [
            {
              "Type": "http",
              "Protocol": "http",
              "Port": null,
              "Path": "/api/health",
              "AddressMode": "",
              "Interval": 10000000000,
              "Timeout": 2000000000,
              "CheckRestart": {
                "Limit": 2,
                "Grace": 30000000000,
                "IgnoreWarnings": true
              },
              "PortLabel": "healthcheck",
              "Expose": true
            }
          ],
          "Connect": {
            "Native": false,
            "SidecarService": {
              "Tags": [
                "mesh",
                "ingress",
                "prometheus_metrics_address=${NOMAD_HOST_ADDR_envoy-metrics}",
                "prometheus_metrics_path=/metrics"
              ],
              "Proxy": {
                "ExposeConfig": {
                  "Path": [
                    {
                      "Path": "/api/health",
                      "Protocol": "http",
                      "LocalPathPort": 80,
                      "ListenerPort": "healthcheck"
                    }
                  ]
                },
                "Upstreams": [
                  {
                    "DestinationName": "backend-app",
                    "LocalBindPort": 13370
                  }
                ],
                "Config": {
                  "envoy_prometheus_bind_addr": "0.0.0.0:9102",
                  "local_request_timeout_ms": 0
                }
              }
            },
            "SidecarTask": {
              "ShutdownDelay": 10000000000,
              "KillTimeout": 10000000000,
              "Config": {
                "labels": [
                  {
                    "environment": "dev",
                    "namespace": "default",
                    "service": "application-connect-proxy"
                  }
                ]
              },
              "Resources": {
                "CPU": 50,
                "MemoryMB": 96
              }
            }
          }
        }
      ],
      "Tasks": [
        {
          "Name": "application",
          "Vault": null,
          "Driver": "docker",
          "Config": {
            "network_mode": null,
            "image": "the-image:123abc",
            "volumes": [
              "/usr/local/share/ca-certificates/:/certs/"
            ],
            "logging": [],
            "labels": []
          },
          "Services": null,
          "Env": {
            "NAMESPACE": "default"
          },
          "Templates": [],
          "Resources": {
            "CPU": 100,
            "MemoryMB": 256
          },
          "RestartPolicy": {
            "Attempts": 10,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "Artifacts": [],
          "KillTimeout": 10000000000,
          "ShutdownDelay": 10000000000
        }
      ],
      "EphemeralDisk": {
        "SizeMB": 300
      }
    }
  ]
}

@jorgemarey
Copy link
Contributor

jorgemarey commented Aug 3, 2023

Hi,
I noticed the same thing. I was expecting nomad would wait for the grace period before restarting the allocation, but it didn't.

This happened not at the first start, but after some time running, the application failed and during starting it waited less than grace + check interval * times.

Looking at the code and at the logs, I noticed one thing that I don't know if it's intentional (in that case I didn't saw that documented) or a bug. In case of a restart after the first boot, the grace interval contains the restart delay. So if the delay is 25s and the check_restart grace is 60s. The application would wait for a useful grace of 35 seconds and then the check restart limits are applied (check interval * (times -1)).

@mikenomitch
Copy link
Contributor

Hey @jorgemarey, thanks for the sleuthing regarding the restart delay and grace. I think we'll probably want to change behavior, but at least document it clearly if not. Added to our todo list, but not in the immediate todo column.

It will realistically be at least a month before we could take this up, so if anybody wants to take a crack at making sure the grace period accounts for the delay, let us know, we'd accept a PR!

@jorgemarey
Copy link
Contributor

Hi @mikenomitch, I'll try to make a PR for this if I find the time.

@jorgemarey
Copy link
Contributor

Hi @mikenomitch, I was looking into this again. I saw that this only happens when the check is defined at a group level (if it's defined at task level the delay applied is correct) I think the problem is that the group service hook is executed on the prerun stage So it's registered as the checkRestarter executed before the restart is completed.

Could the group_service_hook RegisterWorkload be moved to poststart fase as it's in the service_hook? I don't know if this conflicts with other things, but should solve this issue. At least what I'm referring to, that is that the check_restart.grace contains the restart.delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

8 participants