Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: incorrect stats #1852

Closed
james-boydell opened this issue Oct 16, 2024 · 3 comments · Fixed by #1856
Closed

[Bug]: incorrect stats #1852

james-boydell opened this issue Oct 16, 2024 · 3 comments · Fixed by #1856
Assignees
Labels
bug Something isn't working

Comments

@james-boydell
Copy link
Contributor

Steps to reproduce

compare htop to dstack stats and the values are incorrect

Actual behaviour

No response

Expected behaviour

No response

dstack version

0.18.18

Server logs

No response

Additional information

image image
@james-boydell james-boydell added the bug Something isn't working label Oct 16, 2024
@r4victor r4victor self-assigned this Oct 17, 2024
@r4victor
Copy link
Collaborator

@james-boydell, so CPU usage seems to be correctly reported 92+100+85+100=377 (the diff is likely due to measurement timing). The memory usage reported by dstack stats is indeed misleading. It reports cgroup's memory.usage_in_bytes, but it would be more sensible to report working_set_memory = memory.usage_in_bytes - cache. This is what docker stats and kubectl top report. We had this metric in the API, so I'm going to fix the CLI output. It should be close to top/htop reporting.

Please note that there can still be discrepancies with htop. htop's reporting may be more accurate but it's when working_set_memory reported by dstack stats reaches the max, the container gets OOM-killed.

See also:

@r4victor
Copy link
Collaborator

To sum up, after the fix dstack stats should report memory usage the same way as docker stats and kubectl top but it may not be the best approach:

usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

It needs to be seen if there are any downsides of using RSS+CACHE(+SWAP) instead if memory.usage_in_bytes besides being different from Kubernetes and Docker.

@james-boydell
Copy link
Contributor Author

Hey @r4victor , thanks for looking into this!

For memory stats, I think it's important to report the same way the OOM killer would see it. I think most people will be watching if memory reaches the limit and the container gets killed. This will be important as you work towards issue #1780 and multiple jobs/runs are are on the same node (important for ssh/on prem fleet).

As for CPU, I don't think reporting the sum of all CPU core percentages makes sense as a single metic. If I see more than 100%, I think something is wrong. I'm unsure how you're pulling CPU metrics and I'm more familiar with kubernetes, but reporting the percentage of the CPU limit and/or request would be more useful, or average out the percentage of all cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants