[Bug]: incorrect stats #1852

james-boydell · 2024-10-16T20:49:54Z

Steps to reproduce

compare htop to dstack stats and the values are incorrect

Actual behaviour

No response

Expected behaviour

No response

dstack version

0.18.18

Server logs

No response

Additional information

r4victor · 2024-10-17T06:23:44Z

@james-boydell, so CPU usage seems to be correctly reported 92+100+85+100=377 (the diff is likely due to measurement timing). The memory usage reported by dstack stats is indeed misleading. It reports cgroup's memory.usage_in_bytes, but it would be more sensible to report working_set_memory = memory.usage_in_bytes - cache. This is what docker stats and kubectl top report. We had this metric in the API, so I'm going to fix the CLI output. It should be close to top/htop reporting.

Please note that there can still be discrepancies with htop. htop's reporting may be more accurate but it's when working_set_memory reported by dstack stats reaches the max, the container gets OOM-killed.

See also:

r4victor · 2024-10-17T06:41:42Z

To sum up, after the fix dstack stats should report memory usage the same way as docker stats and kubectl top but it may not be the best approach:

usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

It needs to be seen if there are any downsides of using RSS+CACHE(+SWAP) instead if memory.usage_in_bytes besides being different from Kubernetes and Docker.

james-boydell · 2024-10-17T15:03:02Z

Hey @r4victor , thanks for looking into this!

For memory stats, I think it's important to report the same way the OOM killer would see it. I think most people will be watching if memory reaches the limit and the container gets killed. This will be important as you work towards issue #1780 and multiple jobs/runs are are on the same node (important for ssh/on prem fleet).

As for CPU, I don't think reporting the sum of all CPU core percentages makes sense as a single metic. If I see more than 100%, I think something is wrong. I'm unsure how you're pulling CPU metrics and I'm more familiar with kubernetes, but reporting the percentage of the CPU limit and/or request would be more useful, or average out the percentage of all cores.

james-boydell added the bug Something isn't working label Oct 16, 2024

r4victor self-assigned this Oct 17, 2024

r4victor mentioned this issue Oct 17, 2024

Show working set memory in dstack stats #1856

Merged

r4victor closed this as completed in #1856 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: incorrect stats #1852

[Bug]: incorrect stats #1852

james-boydell commented Oct 16, 2024

r4victor commented Oct 17, 2024

r4victor commented Oct 17, 2024

james-boydell commented Oct 17, 2024

[Bug]: incorrect stats #1852

[Bug]: incorrect stats #1852

Comments

james-boydell commented Oct 16, 2024

Steps to reproduce

Actual behaviour

Expected behaviour

dstack version

Server logs

Additional information

r4victor commented Oct 17, 2024

r4victor commented Oct 17, 2024

james-boydell commented Oct 17, 2024