Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When canaries fail in a deployment with lesser count than its predecessor, the culled allocations are never restored #19643

Open
philrenaud opened this issue Jan 5, 2024 · 0 comments
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments theme/scheduling type/bug

Comments

@philrenaud
Copy link
Contributor

I've encountered a situation (In Nomad 1.7.3.dev) where Nomad eagerly culls previous-version allocations when a new job version deployment calls for a lower count, even when there are canaries involved, and does not reinstate the culled allocations in the case that the new deployment fails.

Per #5033 (comment) (and internal discussion with the team), this seems like a bug.

Reproduction steps

  • Run a job (with 20 allocs, and 10 canaries), everything places and runs successfully. v0.
  • Update the job, modifying:
    • something about the tasks that cause them to fail.
    • the task group count down from 20 to 15
  • This deployment (v1) results in a deployment failing and we fall back to v0 as the live state of affairs. My v0 running allocations had not been shut down by the deployment, because it waited to see if they would succeed, and they didn’t. Back to v0.
  • Except.... 5 of them did shut down. Before the v1 deployment ever failed, it looks like Nomad eagerly shuts down “extra” v0 allocs whose numbers were greater than the new count.

Expected Result

  • v0 of my job's allocs has 20 running allocations (the v0 count)

Actual Result

  • v0 of my job's allocs has 15 running allocations (the v1 and failed deployment count)

Job file (if appropriate)

job "fails_every_10" {

  update {
    healthy_deadline = "60s"
    progress_deadline = "1h"
    auto_revert      = false
    canary           = 10
    max_parallel     = 10
  }

  datacenters = ["dc1", "dc2"]

  constraint {
    attribute = "${attr.kernel.name}"
        operator  = "set_contains_any"
    value     = "darwin,linux"
  }

  type = "service"

  group "grouper" {
    count = 20

    task "roll dice" {
      driver = "raw_exec"
        resources {
          cpu    = 50
          memory = 64
        }

      config {        
        command = "node"
        args    = ["-e", <<EOT
console.log('Hello, task!!');
const random = Math.random();
if (random < 0.10) {
  console.log('You rolled badly');
  process.exit(1);
} else {
  console.log('Stay awhile, and listen');
  setTimeout(function() { console.log('bye'); process.exit(0); }, 600000);
}
EOT
        ]
      }
    }

    restart {
      attempts = 1
      delay    = "10s"
      mode     = "fail"
    }

    reschedule {
      attempts  = 1
      interval  = "1h"
      delay = "10s"
      unlimited = false
    }
  }
}
@jrasell jrasell added theme/scheduling theme/deployments stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 8, 2024
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments theme/scheduling type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants