When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

philrenaud · 2024-01-05T18:31:27Z

I've encountered a situation (In Nomad 1.7.3.dev) where Nomad eagerly culls previous-version allocations when a new job version deployment calls for a lower count, even when there are canaries involved, and does not reinstate the culled allocations in the case that the new deployment fails.

Per #5033 (comment) (and internal discussion with the team), this seems like a bug.

Reproduction steps

Run a job (with 20 allocs, and 10 canaries), everything places and runs successfully. v0.
Update the job, modifying:
- something about the tasks that cause them to fail.
- the task group count down from 20 to 15
This deployment (v1) results in a deployment failing and we fall back to v0 as the live state of affairs. My v0 running allocations had not been shut down by the deployment, because it waited to see if they would succeed, and they didn’t. Back to v0.
Except.... 5 of them did shut down. Before the v1 deployment ever failed, it looks like Nomad eagerly shuts down “extra” v0 allocs whose numbers were greater than the new count.

Expected Result

v0 of my job's allocs has 20 running allocations (the v0 count)

Actual Result

v0 of my job's allocs has 15 running allocations (the v1 and failed deployment count)

Job file (if appropriate)

job "fails_every_10" {

  update {
    healthy_deadline = "60s"
    progress_deadline = "1h"
    auto_revert      = false
    canary           = 10
    max_parallel     = 10
  }

  datacenters = ["dc1", "dc2"]

  constraint {
    attribute = "${attr.kernel.name}"
        operator  = "set_contains_any"
    value     = "darwin,linux"
  }

  type = "service"

  group "grouper" {
    count = 20

    task "roll dice" {
      driver = "raw_exec"
        resources {
          cpu    = 50
          memory = 64
        }

      config {        
        command = "node"
        args    = ["-e", <<EOT
console.log('Hello, task!!');
const random = Math.random();
if (random < 0.10) {
  console.log('You rolled badly');
  process.exit(1);
} else {
  console.log('Stay awhile, and listen');
  setTimeout(function() { console.log('bye'); process.exit(0); }, 600000);
}
EOT
        ]
      }
    }

    restart {
      attempts = 1
      delay    = "10s"
      mode     = "fail"
    }

    reschedule {
      attempts  = 1
      interval  = "1h"
      delay = "10s"
      unlimited = false
    }
  }
}

The text was updated successfully, but these errors were encountered:

philrenaud added the type/bug label Jan 5, 2024

jrasell added theme/scheduling theme/deployments stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 8, 2024

mmcquillan added the hcc/jira label Jun 24, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

philrenaud commented Jan 5, 2024

When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

Comments

philrenaud commented Jan 5, 2024

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)