Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad has the wrong unique.storage.bytesfree after a restart #14871

Closed
TimoWilken opened this issue Oct 11, 2022 · 2 comments
Closed

Nomad has the wrong unique.storage.bytesfree after a restart #14871

TimoWilken opened this issue Oct 11, 2022 · 2 comments

Comments

@TimoWilken
Copy link

Nomad version

Output from nomad version: Nomad v1.3.3 (428b2cd8014c48ee9eae23f02712b7219da16d30)

Operating system and Environment details

CentOS 7; Nomad installed from Hashicorp Stable RPM repo and run via the bundled systemd service.

Issue

When Nomad is restarted, it resets its unique.storage.bytesfree property to however much disk space is available at that moment. However, if allocations are currently running, the disk space they use is counted twice -- once in the decrease in unique.storage.bytesfree and again by Nomad when deciding whether there is enough space to place another alloc on the same host.

Would it be possible to specify the value of the unique.storage.bytesfree metric manually (e.g. a client.disk_total_free_bytes setting in /etc/nomad.d/nomad.hcl), analogous to the way we can specify the amount of memory (client.memory_total_mb) or CPU (client.cpu_total_compute) that Nomad should assume is present on the host? Or could Nomad allocate based on unique.storage.bytestotal instead, and let us specify an amount of disk space to reserve for the system, as is done with CPU and memory in client.reserved.*?

Reproduction steps

  1. Run a Nomad agent on a host with 110G of disk space available on an external disk, and place Nomad's state directory on that external disk.
  2. Run a job on that host that requests a large ephemeral_disk (e.g. 50G) and just creates a 50G file in its task directory and sleeps forever.
  3. Restart the Nomad agent on that host.
  4. Now try to run a second copy of the above job on the same host.

Expected Result

The second job should run in parallel to the first (as notionally 60G should still be available to allocate -- 110G minus the 50G taken by the first job).

Actual Result

unique.storage.bytesfree goes down to 60G after the restart, and Nomad thinks it only has 10G of disk space left to allocate on the host (which is 60G minus 50G "taken" by the first, running job).

@tgross
Copy link
Member

tgross commented Nov 22, 2022

Hi @TimoWilken! Thanks to your clear instructions, I was able to reproduce this pretty easily. As it turns out, this is a duplicate of #6172. I'm going to rename that issue to clarify the problem a bit and then close this one out.

In short, there's an unfortunate interaction here where the fingerprint for storage is a StaticFingerprinter. That means it runs once at startup and then the scheduler has to account for the ephermeral disk is knows about.

So if we check the storage on our node:

$ nomad node status -self -verbose | grep storage
unique.storage.bytesfree              = 19552616448
unique.storage.bytestotal             = 41555521536
unique.storage.volume                 = /dev/sda1

Then create a job with:

ephemeral_disk {
  size    = 5000
}

We'll exec into that allocation to create a large file:

$ nomad alloc exec f880fb6d /bin/sh
/ # dd if=/dev/urandom of=/alloc/data/example.bin bs=5MB count=100
100+0 records in
100+0 records out
500000000 bytes (476.8MB) copied, 1.996275 seconds, 238.9MB/s

/ # ls -lah /alloc/data/
total 477M
drwxrwxrwx    2 nobody   nobody      4.0K Nov 22 21:06 .
drwxrwxrwx    5 nobody   nobody      4.0K Nov 22 21:04 ..
-rw-r--r--    1 root     root      476.8M Nov 22 21:06 example.bin

But we'll see the free storage isn't changed:

$ nomad node status -self -verbose | grep storage
unique.storage.bytesfree              = 19552616448
unique.storage.bytestotal             = 41555521536
unique.storage.volume                 = /dev/sda1

Then we restart. At that point we see the amount of storage used is less the amount we wrote to disk (500793344 bytes or 477.59MB). Note this isn't the same as the reserved amount, which would be 10x that.

$ nomad node status -self -verbose | grep storage
unique.storage.bytesfree              = 19051823104
unique.storage.bytestotal             = 41555521536
unique.storage.volume                 = /dev/sda1

Letting the user set this value is a nice hack and might help out with the situation where the storage fingerprinter can't read the correct value for some exotic storage configuration (I can't think of what that might be).

I suspect the right behavior here is to change the storage fingerprinter. Currently it runs df(1) with the datadir, so that we get the results of df for the file system the datadir is on. We should also rummage around in the allocation directories to subtract the amount of space we've actually used in alloc/data.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants