-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task getting killed with OOM error is marked as complete #23412
Comments
Hi @vikramsg! I've seen a similar report about OOM'd tasks being marked as complete and I'm not sure whether the problem is that the tasks are being marked complete incorrectly or whether the report the task has been OOM'd is incorrect. Do you have any metrics that suggest the tasks are really being OOM'd or at least exiting with an error vs just completing? That would help us dig into what the underlying problem is. Also, I just wanted to note that versions before 1.6.0 are out of support, so you'll want to upgrade sooner rather than later. |
Hi @tgross, Can you tell me a little bit more about what metrics you want to see. The error is very flaky and non-repeatable so I can instrument it to capture issues. Right now I am mostly working with the Nomad REST API, so are there endpoint responses that I can record in the logs? On the task itself, hopefully the below points answer your questions.
Let me know if that helped. |
@vikramsg I'm thinking of out-of-band data like |
Nomad version
Nomad v1.5.2
Operating system and Environment details
Running on AWS.
Issue
We have various batch jobs running on NOMAD which runs on EC2 instances. Now we are connecting up Airflow to Nomad, so we don't want Nomad to handle restarts and reschedules but for this we want to accurately know if a job completed or failed.
This mostly works, but I am seeing on OOM errors that Nomad marks the job as complete.
Expected Result
reschedule
andrestart
blocks set to 0, Nomad is still trying to run the job again.Actual Result
Nomad marks the job as complete and restarts the job.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: