-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zdb memory usage #129
Comments
Using There is maybe a memory issue somewhere else than normal operation. This test was made while uploading chunks for hub. It's possible that some memory issue occurs with |
The limit is 256MB on zos and it only has the zdb process in the container (on which the limit is applied). |
It's possible that it's a cache problem and the memory is allocated as a cache and not directly by zdb. From the OOM logs, the vm (virtual memory) used is about ~143MB, but "file" metric is ~281MB. The cgroup of the container running zdb has these data:
from the cgroup memory stats documentation 5.5. The usage is rss+cache (the usage_in_bytes file is an approx.). OOM logs:
|
I also think the buf/cache is freed if needed, because they are just used by the kernel to improve performance of file access. But if needed, data will be flushed to disk and then allocated to the process. Not the other way around. |
Some things to investigate:
|
hashmem also OOMs, zdb process restarts occasionally using 32M as datasize but not because of OOM (no dmesg logs, and container logs are truncated, segfault?). Not reproducible with me locally at all, started the zos vm on an hdd and the usage (rss + cache) always oscillates a little bit behind the limit just like in the nodes' case but it never gets killed. Stracing the original zdb version process while it's operating to check what it's doing at the end, causes it to not be killed by OOM! Tried to make another image which pushes its logs to some files to prevent it from being truncated. The store command fulfillment rate became much slower and the processes piled up causing the qsfs container zstor to be OOMed, but the zdb container didn't misbehave at all (except for being slow?). Chuncking the writes to cope with the slow rate didn't result in any OOM or crash. OOM logs with stacktrace and registers (for hashmem):
If I understand correctly from the logs, it failed because of a write syscall (with 2MB data?), and the |
Thanks ! It's possible to get segfault on hashmem, it's still in testing stage. There is a specific sync call on file rotation. I don't know if the stacktrace talk about a write or a sync, it's btrfs related. |
The zdb restarting occasionally is the latest with 32MB (not sure why as the logs are truncated). The memory limit is set on the node eventually using cgroups. But wasn't able to reproduce outside the node with:
I guessed the syscall from table. Edited the oom logs with a missing line at the beginning containing gfp_mask, might be useful. |
How much memory should be provided to zdb to prevent it from dying. It got killed by OOM for writing 20GB by zstor.
The text was updated successfully, but these errors were encountered: