Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task getting killed with OOM error is marked as complete #23412

Open
vikramsg opened this issue Jun 21, 2024 · 3 comments
Open

Task getting killed with OOM error is marked as complete #23412

vikramsg opened this issue Jun 21, 2024 · 3 comments

Comments

@vikramsg
Copy link

Nomad version

Nomad v1.5.2

Operating system and Environment details

Running on AWS.

Issue

We have various batch jobs running on NOMAD which runs on EC2 instances. Now we are connecting up Airflow to Nomad, so we don't want Nomad to handle restarts and reschedules but for this we want to accurately know if a job completed or failed.

This mostly works, but I am seeing on OOM errors that Nomad marks the job as complete.
Screenshot 2024-06-21 at 16 57 18

Expected Result

  1. If a job fails due to Nomad killing it, it should not be marked as complete.
  2. Alternatively how do we determine if it was killed due to OOM.
  3. Also, even though we have reschedule and restart blocks set to 0, Nomad is still trying to run the job again.
    reschedule {
      attempts  = 0
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

Actual Result

Nomad marks the job as complete and restarts the job.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@tgross
Copy link
Member

tgross commented Jun 21, 2024

Hi @vikramsg! I've seen a similar report about OOM'd tasks being marked as complete and I'm not sure whether the problem is that the tasks are being marked complete incorrectly or whether the report the task has been OOM'd is incorrect. Do you have any metrics that suggest the tasks are really being OOM'd or at least exiting with an error vs just completing? That would help us dig into what the underlying problem is.

Also, I just wanted to note that versions before 1.6.0 are out of support, so you'll want to upgrade sooner rather than later.

@vikramsg
Copy link
Author

Hi @tgross,

Can you tell me a little bit more about what metrics you want to see. The error is very flaky and non-repeatable so I can instrument it to capture issues. Right now I am mostly working with the Nomad REST API, so are there endpoint responses that I can record in the logs?

On the task itself, hopefully the below points answer your questions.

  1. The first allocation of the task did not complete, just stopped and sent a completion response in the REST API.
  2. So, Airflow which essentially polls the REST API thinks the job succeeded.
  3. However, Nomad seems to have created a second allocation even though I set restart and reschedule to 0.
  4. But since I already got the completion response, I do not wait and poll for this new allocation to complete.

Let me know if that helped.

@tgross
Copy link
Member

tgross commented Jun 25, 2024

@vikramsg I'm thinking of out-of-band data like dmesg logs that show which process, if any, was actually OOM'd. That is, you're saying "just stopped" but in this situation we suspect we can't trust Nomad's report of why that is (otherwise you wouldn't have reported a bug! 😀 ). So I'm trying to figure out if we can verify that it's really a OOM and not killed for some unrelated reason that's getting misreported as a OOM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants