-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interrupt sent to allocs that have long terminated #24630
Comments
A few minutes before the interrupt is sent, the task is also Edit: That theory is apparently invalid: This also shows it's not related to the As shown here, it's also not limited to just 'periodic' jobs: Either way, neither the |
Hi @EtienneBruines! This is puzzling for sure. But I'm having a little trouble putting together the right order of events though and I think that's preventing me from reproducing. The logs you're showing don't quite line up with the task event screen shots and I think I'm missing a step. I'm using the following server config: server {
job_gc_threshold = "24h"
job_gc_interval = "15s"
# ....
} My steps are:
But then in your case there's a new |
The GC doesn't always trigger that weird behavior. I have the The alloc IDs in the logs do indeed not line up with the screenshots (because they were difficult to line up), but all of the alloc IDs in that log snippet would belong to any one of those allocs that this weird behavior happened to. The older the alloc, the more such 'ghost' events would be present. Some recent allocs (last hour) do not have any, whilst those that are 4h+ hours old have usually 1, and some have like 3-5 of those. The Unsure if related:
These logs match the alloc in this screenshot. The {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2024-12-09T10:56:11.032825Z","alloc_id":"dda1a84f-ce9b-0d62-6569-96af86aedbdf","reason":"number of allocations (193) is over the limit (50)"} |
Thanks for that extra info @EtienneBruines. A few thoughts on the architecture here that might inform where we can look next:
From what I can see in the info-level logs you've provided, we're seeing that the allocation is being marked for GC immediately (~100ms) after we see the So that provides a couple possibilities for where the problem could be and next steps:
|
Thank you for your thorough investigation!
The logs that are related to this alloc:
I set up the collecting of those TRACE logs and am now checking to see if the issue occurs again (finding a new alloc ID) - which may take a few hours. But I think we found the culprit? The |
Yeah, that's almost certainly it! Which suggests the problem is somewhere in the scheduler. It would probably help to look at the evals associated with that job (there's no direct CLI for that but you can use This looks like a periodic dispatch job, right? Is there any chance this job has a |
At the moment, I'm only seeing one eval for that job: [
{
"ID": "7b0003d1-0918-5ef5-1aaf-0f47c6b40a3c",
"Namespace": "K4565-bphone",
"Priority": 20,
"Type": "batch",
"TriggeredBy": "periodic-job",
"JobID": "yeti-middleware/periodic-1733983800",
"JobModifyIndex": 4570014,
"Status": "complete",
"QueuedAllocations": {
"sync": 0
},
"SnapshotIndex": 4570014,
"CreateIndex": 4570014,
"ModifyIndex": 4570016,
"CreateTime": 1733983800004762600,
"ModifyTime": 1733983800329242400
}
] Converting the
This one, yes. But the bug is not exclusive to those. It also happened to regular
No, none of them do. |
Ok yeah I wouldn't expect the eval to be modified once it's complete. Just as an experiment, would you mind running |
The alloc we were looking at now has not changed at all by doing this:
(I ran the command twice, hence the multitude of evals): [
{
"ID": "7b0003d1-0918-5ef5-1aaf-0f47c6b40a3c",
"Namespace": "K4565-bphone",
"Priority": 20,
"Type": "batch",
"TriggeredBy": "periodic-job",
"JobID": "yeti-middleware/periodic-1733983800",
"JobModifyIndex": 4570014,
"Status": "complete",
"QueuedAllocations": {
"sync": 0
},
"SnapshotIndex": 4570014,
"CreateIndex": 4570014,
"ModifyIndex": 4570016,
"CreateTime": 1733983800004762600,
"ModifyTime": 1733983800329242400
},
{
"ID": "8c7b7c2a-d0e6-3694-a82f-9c8414dbcdb3",
"Namespace": "K4565-bphone",
"Priority": 20,
"Type": "batch",
"TriggeredBy": "job-register",
"JobID": "yeti-middleware/periodic-1733983800",
"JobModifyIndex": 4570019,
"Status": "complete",
"QueuedAllocations": {
"sync": 0
},
"SnapshotIndex": 4573181,
"CreateIndex": 4573181,
"ModifyIndex": 4573182,
"CreateTime": 1734015165761423000,
"ModifyTime": 1734015165851041000
},
{
"ID": "c5fac5c9-6975-2d56-8b98-d1635040399e",
"Namespace": "K4565-bphone",
"Priority": 20,
"Type": "batch",
"TriggeredBy": "job-register",
"JobID": "yeti-middleware/periodic-1733983800",
"JobModifyIndex": 4570019,
"Status": "complete",
"QueuedAllocations": {
"sync": 0
},
"SnapshotIndex": 4573173,
"CreateIndex": 4573173,
"ModifyIndex": 4573174,
"CreateTime": 1734015045279735300,
"ModifyTime": 1734015045377960200
}
]
I sent a bunch of DEBUG and TRACE logs from the server-leader and from the client, for a 30-second timespan surrounding the eval to the email address [email protected]. A bunch of it can be ignored because of template watches, but you'll be able to browse through it as you see fit. |
Thanks @EtienneBruines. Clearly there's a scheduler bug here... I'll take a look at those and see if there are any clues. |
I've had a look through those and I only see one evaluation in the logs, which was for a periodic job issued by the leader and not any of the 3 evals that you ran above (maybe those were processed on another server?): logs for eval 0feaa6e2
That eval results in the allocation
Is that allocation still in |
Ok, thanks @EtienneBruines. I'm going to get this marked for further examination. |
Nomad version
Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf10
Operating system and Environment details
Ubuntu 22.04.5 LTS on the server.
Ubuntu 24.04.1 LTS on the client.
Issue
Nomad clients send an interrupt on allocs that have long been terminated (but not yet gc'ed because of the settings).
Reproduction steps
job_gc_threshold = "24h"
Expected Result
Tasks that have long terminated to not receive any signal / interrupt / event.
Terminated
should be a terminal state.Actual Result
Job file (if appropriate)
Probably irrelevant, although I did set
shutdown_delay: 10s
on the group.Nomad Server logs (if appropriate)
Nothing related in that timespan.
Nomad Client logs (if appropriate)
Potentially related to #19917
The text was updated successfully, but these errors were encountered: