Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between get_mem_status and get_mem_avail #3888

Open
stress-tess opened this issue Nov 6, 2024 · 5 comments
Open

Discrepancy between get_mem_status and get_mem_avail #3888

stress-tess opened this issue Nov 6, 2024 · 5 comments
Assignees

Comments

@stress-tess
Copy link
Member

Users reported a discrepancy between get_mem_status and get_mem_avail. In particular it seems that get_mem_status is only showing ~40% of the memory allocated to the server

@stress-tess
Copy link
Member Author

it seems like get_mem_status basically tries to parse the output of meminfo and divides available_mem / total_mem. I'm not familiar enough with that command to know if this is sensible or not. The relevant lines of code are:

proc getAvailMemory() : uint(64) throws {
if !isSupportedOS() {
throw new owned ErrorWithContext("getAvailMemory can only be invoked on Unix and Linux systems",
getLineNumber(),
getRoutineName(),
getModuleName(),
"UnsupportedOSError");
}
var lines = openReader('/proc/meminfo').lines();
var line : string;
var memAvail:uint(64);
for line in lines do {
if line.find('MemAvailable:') >= 0 {
var splits = line.split('MemAvailable:');
memAvail = splits[1].strip().strip(' kB'):uint(64);
break;
}
}
return (Math.round(availableMemoryPct/100 * memAvail)*1000):uint(64);
}
proc getTotalMemory() : uint(64) throws {
if !isSupportedOS() {
throw new owned ErrorWithContext("getTotalMemory can only be invoked on Unix and Linux systems",
getLineNumber(),
getRoutineName(),
getModuleName(),
"UnsupportedOSError");
}
var lines = openReader('/proc/meminfo').lines();
var line : string;
var totalMem:uint(64);
for line in lines do {
if line.find('MemTotal:') >= 0 {
var splits = line.split('MemTotal:');
totalMem = splits[1].strip().strip(' kB'):uint(64);
break;
}
}
return totalMem*1000:uint(64);
}
proc getLocaleMemoryStatuses() throws {
var memStatuses: [0..numLocales-1] LocaleMemoryStatus;
coforall loc in Locales with (ref memStatuses) {
on loc {
var availMem = getAvailMemory();
var totalMem = getTotalMemory();
var pctAvailMem = (availMem:real/totalMem)*100:int;
memStatuses[here.id] = new LocaleMemoryStatus(total_mem=totalMem,
avail_mem=availMem,
pct_avail_mem=pctAvailMem:int,
arkouda_mem_alloc=getArkoudaMemAlloc(),
mem_used=memoryUsed(),
locale_id=here.id,
locale_hostname=here.hostname);
}
}
return memStatuses;
}

@e-kayrakli
Copy link
Contributor

Some of my notes from slack conversations:

  • get_mem_status gets its information from /etc/meminfo. It is unclear to me how accurate the available memory information is from that interface when all allocations are Chapel-based. With fast GASNet segment, we'll register a good portion of memory at launch time, but that amount will likely be less than the physical memory available as /etc/meminfo tells us.
    • A likely advantage (as we might be observing) of this method is that it will also include non-Chapel-based allocations such as extern calls from IO libraries.
  • get_mem_avail asks the Chapel runtime about what's available. So, it will capture the fact that the registered heap (maximum Chapel-allocatable memory, essentially) is smaller than the physical memory, but it will miss extern allocations.

I think this interface could use some improvements. We have a use case where the output from the two functions can lead to confusion, we can probably use that as a metric for improving the user experience.

@stress-tess stress-tess removed their assignment Nov 13, 2024
@bradcray
Copy link
Contributor

For those following this issue, in today’s call, Mike mentioned that the system is (mis?)configured in a way that only makes half of the memory available per node, which probably accounts for part of what we’re seeing here. The other question is why some interfaces are reporting the full memory if that’s the case.

With a quick check, I’m seeing that the Arkouda memory-related calls refer directly to /proc/meminfo while the Chapel runtime tends to use calls like sysctlbyname(). So, at a guess, maybe one of those reflects the configuration constraints, and the other doesn’t?

@vasslitvinov suggested he would look further into this.

@vasslitvinov
Copy link
Contributor

Here are Arkouda's various memory-related measurements:

source feature notes
/proc/meminfo MemAvailable
/proc/meminfo MemTotal
pmap total
SymbolTable st.memUsed()
Chapel runtime chpl_memoryUsed() needs --memTrack
Arkouda function getMemLimit()

By Arkouda's definition,

get_mem_avail() + get_mem_used() = getMemLimit()

where getMemLimit() is one of:

source feature
Chapel's config memMax , if not zero
Chapel's comm 0.9*chpl_comm_regMemHeapInfo()
Chapel's runtime 0.9*sysconf(_SC_PHYS_PAGES) * page size

I propose to extend get_mem_status() to show most of the above, with totals over locales. This will give the user a comprehensive view of memory availability. It will show, for example, what is used by Chapel vs. outside of Chapel. We can then drop get_mem_avail and get_mem_used and give get_mem_status a better name, perhaps get_memory_info.

Note that the current implementation of get_mem_used() extrapolates from the initial locale when --memTrack, which is an over-estimate, otherwise reflects memory use only by the symbol table, which is an under-estimate. This affects the value of get_mem_avail() correspondingly.

@e-kayrakli
Copy link
Contributor

It will show, for example, what is used by Chapel vs. outside of Chapel

How do you think this will look like practically? e.g. will it report something like

Memory used by the Chapel runtime: 100
Memory used by the symbol table entries: 80
Memory used as per the OS: 120

Note that the current implementation of get_mem_used() extrapolates from the initial locale when --memTrack, which is an over-estimate, otherwise reflects memory use only by the symbol table, which is an under-estimate. This affects the value of get_mem_avail() correspondingly.

I didn't realize this was the case. I agree that this needs to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants