Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupt sent to allocs that have long terminated #24630

Open
EtienneBruines opened this issue Dec 9, 2024 · 13 comments
Open

Interrupt sent to allocs that have long terminated #24630

EtienneBruines opened this issue Dec 9, 2024 · 13 comments

Comments

@EtienneBruines
Copy link
Contributor

EtienneBruines commented Dec 9, 2024

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf10

Operating system and Environment details

Ubuntu 22.04.5 LTS on the server.

Ubuntu 24.04.1 LTS on the client.

Issue

Nomad clients send an interrupt on allocs that have long been terminated (but not yet gc'ed because of the settings).

Reproduction steps

  • Have some allocs (periodic / non-periodic)
  • Keep those allocs around
    • job_gc_threshold = "24h"
  • Run the GC

Expected Result

Tasks that have long terminated to not receive any signal / interrupt / event. Terminated should be a terminal state.

Actual Result

Screenshot_20241209_121455

Job file (if appropriate)

Probably irrelevant, although I did set shutdown_delay: 10s on the group.

Nomad Server logs (if appropriate)

Nothing related in that timespan.

Nomad Client logs (if appropriate)

2024-12-09T11:20:33.049Z [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=surrealdb
2024-12-09T11:20:33.049Z [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=surrealdb event_type=Killing event_reason=""
2024-12-09T11:20:33.050Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=surrealdb type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
2024-12-09T11:20:33.052Z [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=connect-proxy-surrealdb
2024-12-09T11:20:33.052Z [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=connect-proxy-surrealdb event_type=Killing event_reason=""
2024-12-09T11:20:33.052Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23 task=connect-proxy-surrealdb type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
2024-12-09T11:20:33.054Z [INFO]  client.gc: marking allocation for GC: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23
2024-12-09T11:20:33.056Z [DEBUG] client.gc: alloc garbage collected: alloc_id=3218e2e3-9973-b587-d518-ba9564720e23
2024-12-09T11:20:33.057Z [INFO]  client.gc: garbage collecting allocation: alloc_id=af2223c2-87e9-7358-b847-f68fc05eca19 reason="number of allocations (184) is over the limit (50)"
2024-12-09T11:20:33.057Z [DEBUG] client.alloc_runner.runner_hook.group_services: delay before killing tasks: alloc_id=af2223c2-87e9-7358-b847-f68fc05eca19 group=sync shutdown_delay=10s
2024-12-09T11:20:33.619Z [TRACE] client: next heartbeat: period=19.809518426s
2024-12-09T11:20:33.705Z [DEBUG] client: updated allocations: index=4549798 total=347 pulled=1 filtered=346
2024-12-09T11:20:33.705Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=346
2024-12-09T11:20:33.707Z [TRACE] client.alloc_runner: AllocRunner has terminated, skipping alloc update: alloc_id=ad3504fc-c1af-2abb-d0fb-6b5f6746b56c modify_index=4549749
2024-12-09T11:20:33.707Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=346 errors=0
2024-12-09T11:20:33.719Z [TRACE] client: next heartbeat: period=13.938668795s

Potentially related to #19917

@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Dec 9, 2024

A few minutes before the interrupt is sent, the task is also Received again. Perhaps the interrupt is the consequence of it being Received?


Edit: That theory is apparently invalid: Received does happen often, but not always:

Screenshot_20241209_124017

This also shows it's not related to the exec2 driver, since it happened with the docker driver here as well.

As shown here, it's also not limited to just 'periodic' jobs:

Screenshot_20241209_124541


Either way, neither the Received nor the Killing events should be happening after it has already been terminated.

@tgross
Copy link
Member

tgross commented Dec 9, 2024

Hi @EtienneBruines! This is puzzling for sure. But I'm having a little trouble putting together the right order of events though and I think that's preventing me from reproducing. The logs you're showing don't quite line up with the task event screen shots and I think I'm missing a step.

I'm using the following server config:

server {
  job_gc_threshold = "24h"
  job_gc_interval = "15s"
  # ....
}

My steps are:

  • I run a batch job that exits 0 after 10s
  • The allocations terminate normally (ex. they're batch jobs)
  • I wait for the periodic GC to happens. Nothing happens because the terminal allocations are newer than the threshold

But then in your case there's a new Received event out of the blue several minutes later?

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Dec 9, 2024
@tgross tgross self-assigned this Dec 9, 2024
@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Dec 9, 2024

The GC doesn't always trigger that weird behavior. I have the job_gc_interval set to the default value, which I think is 60s.

The alloc IDs in the logs do indeed not line up with the screenshots (because they were difficult to line up), but all of the alloc IDs in that log snippet would belong to any one of those allocs that this weird behavior happened to.

The older the alloc, the more such 'ghost' events would be present. Some recent allocs (last hour) do not have any, whilst those that are 4h+ hours old have usually 1, and some have like 3-5 of those.

The Received event is indeed out of the blue without me doing anything. Although sometimes it's not the Received event but the Killing event. Sometimes it's both, sometimes it's the other way around or shuffled. They're also not "quickly" after one another - between the Received and Killing events there's often quite a while - longer than the GC interval.

Screenshot_20241209_221508

Unsure if related:

  • Ineligible nodes can still send Killing events to those already-terminated allocs.
  • These ineligible nodes would sometimes receive a batch of jobs (about 200 in my case) from "somewhere" (the server, I'm assuming) that it will then start to process, two at a time (default gc size))
  • The issue persisted even when no longer having ineligible nodes

These logs match the alloc in this screenshot.

Screenshot_20241209_222621

The garbage collecting allocation looks like this:

{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2024-12-09T10:56:11.032825Z","alloc_id":"dda1a84f-ce9b-0d62-6569-96af86aedbdf","reason":"number of allocations (193) is over the limit (50)"}

@tgross
Copy link
Member

tgross commented Dec 11, 2024

Thanks for that extra info @EtienneBruines. A few thoughts on the architecture here that might inform where we can look next:

  • Clients "pull" their list of allocations from the leader with a non-stale blocking query (Node.GetClientAllocs). If there are no changes from the last query after 5min, the query unblocks and the client receives whatever the current set is which should be unchanged.
  • The list the client gets is a list of allocation IDs and their last modified index. The client filters this list by allocations it knows haven't changed, and then makes a second query Alloc.GetAllocs with a list of IDs it needs new info on. This query uses the stale flag to reduce load on the leader.
  • The client uses the DesiredStatus of the allocations it gets from Alloc.GetAllocs and its current state to determine what to do with the allocations it has.
  • Ineligible clients still get a list from the server (ex. you might stop an allocation running on an ineligible server, but Nomad doesn't stop them on its own unless there's a job change requiring it or an alloc on the node fails).
  • The Task Received event is fired not when an allocation is received from the server (as one might very reasonably expect) but when a new "allocrunner" is created along with its "taskrunners". Note this is before the allocation is started. When the allocrunner starts, it fires a variety of "hooks" before starting the taskrunner (which has its own hooks and then executes the driver). A taskrunner is persistent across task restarts, so we don't see this event on simple task restarts.

From what I can see in the info-level logs you've provided, we're seeing that the allocation is being marked for GC immediately (~100ms) after we see the Task Received event. That suggests the client knows the allocation is terminal at that point. But the client still has to run the allocrunner for terminal allocations to make sure that any resources they were using get cleaned up. (This is mostly to handle what happens when a client is restarted and allocs stopped while it was down.)

So that provides a couple possibilities for where the problem could be and next steps:

  • The allocations have a terminal ClientStatus but for some baffling reason the servers aren't setting them to DesiredStatus: "stop". I'd expect in that case that the allocation would actually start up. You should be able to eliminate this possibility by looking at the nomad alloc status $allocid.
  • One of the followers in your cluster is running with very stale state and that's resulting in the Node.GetClientAllocs and Alloc.GetAllocs list coming from different windows of time. You should be able to eliminate this possibility by looking at nomad operator autopilot health to check that no server is far behind the others.
  • The servers are working perfectly fine but there's a logic bug in the client. Maybe GC is failing silently and therefore the client is never properly marking the allocation as having been cleaned up, so it needs to keep trying each time it hears from the server. To diagnose this, we're going to need debug or even trace-level logs from one of the affected clients when this is happening (along with the alloc status output for an example allocation). This is likely to be a large amount of data, so you may want to send it to the [email protected] write-only mailing list. Use this issue ID in the subject line and that'll make sure I can find it.

@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Dec 12, 2024

Thank you for your thorough investigation!

  • Looking at one of the allocs that are showing symptoms:
# nomad alloc status 32e28097-c8e8-7430-c904-7c40ecb97061
ID                  = 32e28097-c8e8-7430-c904-7c40ecb97061
Eval ID             = 7b0003d1
Name                = yeti-middleware/periodic-1733983800.sync[0]
Node ID             = 2f5ed2f3
Node Name           = hashivault02-del.q-mex.net
Job ID              = yeti-middleware/periodic-1733983800
Job Version         = 28
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 4h15m ago
Modified            = 20m48s ago

Task "sync" is "dead"
Task Resources:
CPU        Memory           Disk     Addresses
0/100 MHz  6.2 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2024-12-12T06:10:10Z
Finished At    = 2024-12-12T06:10:14Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2024-12-12T10:05:10Z  Killing                Sent interrupt. Waiting 5s before force killing
2024-12-12T06:10:14Z  Terminated             Exit Code: 0
2024-12-12T06:10:10Z  Started                Task started by client
2024-12-12T06:10:10Z  Downloading Artifacts  Client is downloading artifacts
2024-12-12T06:10:10Z  Task Setup             Building Task Directory
2024-12-12T06:10:10Z  Received               Task received by client

The logs that are related to this alloc:

{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-12-12T10:05:10.043810Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T10:05:10.041617Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Sent interrupt. Waiting 5s before force killing","task":"sync","type":"Killing"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2024-12-12T10:05:00.039600Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","reason":"new allocations and over max (50)"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-12-12T06:10:14.692138Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061"}
{"@level":"info","@message":"plugin process exited","@module":"client.alloc_runner.task_runner.task_hook.logmon","@timestamp":"2024-12-12T06:10:14.691657Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","id":"2249936","plugin":"/usr/bin/nomad","task":"sync"}
{"@level":"info","@message":"not restarting task","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:14.682850Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","reason":"Policy allows no restarts","task":"sync"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:14.680659Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Exit Code: 0","task":"sync","type":"Terminated"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:10.900909Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Task started by client","task":"sync","type":"Started"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:10.293327Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Client is downloading artifacts","task":"sync","type":"Downloading Artifacts"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:10.247409Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Building Task Directory","task":"sync","type":"Task Setup"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-12-12T06:10:10.234480Z","alloc_id":"32e28097-c8e8-7430-c904-7c40ecb97061","failed":false,"msg":"Task received by client","task":"sync","type":"Received"}
  • It's unlikely for the server followers to be that stale - every time I checked they were all up-to-date with the exact index. Nevertheless, I added a 10 second-interval log to the nomad operator autopilot health to see if that changes with time.

I set up the collecting of those TRACE logs and am now checking to see if the issue occurs again (finding a new alloc ID) - which may take a few hours. But I think we found the culprit? The Desired Status is still run instead of stop?

@tgross
Copy link
Member

tgross commented Dec 12, 2024

But I think we found the culprit? The Desired Status is still run instead of stop?

Yeah, that's almost certainly it! Which suggests the problem is somewhere in the scheduler. It would probably help to look at the evals associated with that job (there's no direct CLI for that but you can use nomad operator api /v1/job/:job_id/evaluations).

This looks like a periodic dispatch job, right? Is there any chance this job has a disconnect block (or the deprecated max_client_disconnect)?

@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Dec 12, 2024

Which suggests the problem is somewhere in the scheduler. It would probably help to look at the evals associated with that job (there's no direct CLI for that but you can use nomad operator api /v1/job/:job_id/evaluations).

At the moment, I'm only seeing one eval for that job:

[
  {
    "ID": "7b0003d1-0918-5ef5-1aaf-0f47c6b40a3c",
    "Namespace": "K4565-bphone",
    "Priority": 20,
    "Type": "batch",
    "TriggeredBy": "periodic-job",
    "JobID": "yeti-middleware/periodic-1733983800",
    "JobModifyIndex": 4570014,
    "Status": "complete",
    "QueuedAllocations": {
      "sync": 0
    },
    "SnapshotIndex": 4570014,
    "CreateIndex": 4570014,
    "ModifyIndex": 4570016,
    "CreateTime": 1733983800004762600,
    "ModifyTime": 1733983800329242400
  }
]

Converting the ModifyTime of that eval, it shows that it hasn't been modified since it has been received and/or started (same second) and hasn't been modified when the event Terminated or the buggy-event Killing happened. I do not know whether it should be modified since then.

This looks like a periodic dispatch job, right?

This one, yes. But the bug is not exclusive to those. It also happened to regular type = service jobs.

Is there any chance this job has a disconnect block (or the deprecated max_client_disconnect)?

No, none of them do.

@tgross
Copy link
Member

tgross commented Dec 12, 2024

Ok yeah I wouldn't expect the eval to be modified once it's complete. Just as an experiment, would you mind running nomad job eval on that periodic dispatch (the child job that's already run, not the parent)? It should cause the existing allocation to be marked Desired Status: stop and not recreate any allocs. And then grab debug-level logs from the scheduler for that and the nomad alloc status of the old alloc once that eval is done?

@EtienneBruines
Copy link
Contributor Author

would you mind running nomad job eval on that periodic dispatch (the child job that's already run, not the parent)?

The alloc we were looking at now has not changed at all by doing this:

# nomad job eval yeti-middleware/periodic-1733983800
==> 2024-12-12T14:50:45Z: Monitoring evaluation "c5fac5c9"
    2024-12-12T14:50:45Z: Evaluation triggered by job "yeti-middleware/periodic-1733983800"
    2024-12-12T14:50:46Z: Evaluation status changed: "pending" -> "complete"
==> 2024-12-12T14:50:46Z: Evaluation "c5fac5c9" finished with status "complete"

(I ran the command twice, hence the multitude of evals):

[
  {
    "ID": "7b0003d1-0918-5ef5-1aaf-0f47c6b40a3c",
    "Namespace": "K4565-bphone",
    "Priority": 20,
    "Type": "batch",
    "TriggeredBy": "periodic-job",
    "JobID": "yeti-middleware/periodic-1733983800",
    "JobModifyIndex": 4570014,
    "Status": "complete",
    "QueuedAllocations": {
      "sync": 0
    },
    "SnapshotIndex": 4570014,
    "CreateIndex": 4570014,
    "ModifyIndex": 4570016,
    "CreateTime": 1733983800004762600,
    "ModifyTime": 1733983800329242400
  },
  {
    "ID": "8c7b7c2a-d0e6-3694-a82f-9c8414dbcdb3",
    "Namespace": "K4565-bphone",
    "Priority": 20,
    "Type": "batch",
    "TriggeredBy": "job-register",
    "JobID": "yeti-middleware/periodic-1733983800",
    "JobModifyIndex": 4570019,
    "Status": "complete",
    "QueuedAllocations": {
      "sync": 0
    },
    "SnapshotIndex": 4573181,
    "CreateIndex": 4573181,
    "ModifyIndex": 4573182,
    "CreateTime": 1734015165761423000,
    "ModifyTime": 1734015165851041000
  },
  {
    "ID": "c5fac5c9-6975-2d56-8b98-d1635040399e",
    "Namespace": "K4565-bphone",
    "Priority": 20,
    "Type": "batch",
    "TriggeredBy": "job-register",
    "JobID": "yeti-middleware/periodic-1733983800",
    "JobModifyIndex": 4570019,
    "Status": "complete",
    "QueuedAllocations": {
      "sync": 0
    },
    "SnapshotIndex": 4573173,
    "CreateIndex": 4573173,
    "ModifyIndex": 4573174,
    "CreateTime": 1734015045279735300,
    "ModifyTime": 1734015045377960200
  }
]

And then grab debug-level logs from the scheduler for that and the nomad alloc status of the old alloc once that eval is done?

# nomad alloc status 32e28097-c8e8-7430-c904-7c40ecb97061
ID                  = 32e28097-c8e8-7430-c904-7c40ecb97061
Eval ID             = 7b0003d1
Name                = yeti-middleware/periodic-1733983800.sync[0]
Node ID             = 2f5ed2f3
Node Name           = hashivault02-del.q-mex.net
Job ID              = yeti-middleware/periodic-1733983800
Job Version         = 28
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 8h41m ago
Modified            = 4h45m ago

Task "sync" is "dead"
Task Resources:
CPU        Memory           Disk     Addresses
0/100 MHz  6.2 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2024-12-12T06:10:10Z
Finished At    = 2024-12-12T06:10:14Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2024-12-12T10:05:10Z  Killing                Sent interrupt. Waiting 5s before force killing
2024-12-12T06:10:14Z  Terminated             Exit Code: 0
2024-12-12T06:10:10Z  Started                Task started by client
2024-12-12T06:10:10Z  Downloading Artifacts  Client is downloading artifacts
2024-12-12T06:10:10Z  Task Setup             Building Task Directory
2024-12-12T06:10:10Z  Received               Task received by client

I sent a bunch of DEBUG and TRACE logs from the server-leader and from the client, for a 30-second timespan surrounding the eval to the email address [email protected]. A bunch of it can be ignored because of template watches, but you'll be able to browse through it as you see fit.

@tgross
Copy link
Member

tgross commented Dec 12, 2024

Thanks @EtienneBruines. Clearly there's a scheduler bug here... I'll take a look at those and see if there are any clues.

@tgross
Copy link
Member

tgross commented Dec 12, 2024

I've had a look through those and I only see one evaluation in the logs, which was for a periodic job issued by the leader and not any of the 3 evals that you ran above (maybe those were processed on another server?):

logs for eval 0feaa6e2
2024-12-12T15:00:00.016Z [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=0feaa6e2-bf7b-d82e-2426-5aa7242d43a0 job_id=yeti-middleware/periodic-1734015600 namespace=K4565-bphone worker_id=357da35d-0f30-04c1-2f49-c4387d043cd2
  results=
  | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "sync": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2024-12-12T15:00:00.019Z [TRACE] nomad: evaluating plan: plan="(eval 0feaa6e2, job yeti-middleware/periodic-1734015600, NodeAllocations: (node[2f5ed2f3] (b076b37d yeti-middleware/periodic-1734015600.sync[0] run)))"
2024-12-12T15:00:00.022Z [DEBUG] nomad.periodic:  launching job: job="<ns: \"default\", id: \"halopsa-to-q-manager-2\">" launch_time="2024-12-12 15:00:00 +0000 UTC"
2024-12-12T15:00:00.042Z [DEBUG] nomad.periodic: scheduled periodic job launch: launch_delay=-42.639679ms job="<ns: \"default\", id: \"netbox-to-halopsa-partial\">"
2024-12-12T15:00:00.044Z [DEBUG] nomad.periodic:  launching job: job="<ns: \"default\", id: \"netbox-to-halopsa-partial\">" launch_time="2024-12-12 15:00:00 +0000 UTC"
2024-12-12T15:00:00.047Z [DEBUG] worker: submitted plan for evaluation: worker_id=357da35d-0f30-04c1-2f49-c4387d043cd2 eval_id=0feaa6e2-bf7b-d82e-2426-5aa7242d43a0
2024-12-12T15:00:00.047Z [DEBUG] worker.batch_sched: setting eval status: eval_id=0feaa6e2-bf7b-d82e-2426-5aa7242d43a0 job_id=yeti-middleware/periodic-1734015600 namespace=K4565-bphone worker_id=357da35d-0f30-04c1-2f49-c4387d043cd2 status=complete
2024-12-12T15:00:00.073Z [TRACE] nomad: evaluating plan: plan="(eval e1335812, job halopsa-to-q-manager-2/periodic-1734015600, NodeAllocations: (node[f0ab1802] (e91ef8cc halopsa-to-q-manager-2/periodic-1734015600.sync[0] run)))"
2024-12-12T15:00:00.080Z [DEBUG] nomad.periodic: scheduled periodic job launch: launch_delay=4m59.919021419s job="<ns: \"K4565-bphone\", id: \"yeti-middleware\">"

That eval results in the allocation b076b37d which according to the client logs runs successfully and eventually stops and updates the server with its completed status:

{"@level":"trace","@message":"sending updated alloc","@module":"client.alloc_runner","@timestamp":"2024-12-12T15:00:24.495649Z","alloc_id":"b076b37d-0a47-35ad-d8ee-31349dc76045","client_status":"complete","desired_status":""}

Is that allocation still in DesiredStatus: "running"?

@EtienneBruines
Copy link
Contributor Author

EtienneBruines commented Jan 6, 2025

Is that allocation still in DesiredStatus: "running"?

Yes. Well, to be pedantic, its desired status is run.

Knowing that, #11017 and #10456 are probably related / the same as this issue.

@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jan 6, 2025
@tgross tgross removed their assignment Jan 6, 2025
@tgross
Copy link
Member

tgross commented Jan 6, 2025

Ok, thanks @EtienneBruines. I'm going to get this marked for further examination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants