System job port collision causes service job deployments to get stuck #18509
Labels
hcc/jira
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
theme/scheduling
type/bug
Nomad version
Output from
nomad version
The issue was also reproduced on the previous Nomad versions:
Versions 1.1-1.3 behave as expected.
Operating system and Environment details
Issue
When a system job with a static port mapping is updated, its previous allocation may still be holding the port by the time the new allocation is started.
If such system job update is followed by an update of a service job that changes its list of datacenters, the service job evaluation reports a port collision error and its deployment gets stuck indefinitely.
Reproduction steps
nomad agent -dev
sys.hcl
). The container should have a delay on receiving aSIGINT
andSIGTERM
signals to simulate graceful shutdownserv.hcl
)datacenters
property of the service job by adding or removing a datacenter from its listHere's a link to the repository with the Nomad jobs, the sample Docker image and the script that reliably reproduces this issue: https://github.com/andrba/nomad-port-collision-repro.
Expected Result
Actual Result
progress_deadline
Job file (if appropriate)
sys.hcl
serv.hcl
Nomad Server logs (if appropriate)
Port collision log record
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: