-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus metrics for job summaries cause ever-increasing memory on leader #18113
Comments
Hi @shantanugadgil! Most of the time when folks report high/climbing memory usage on the servers it's because of evaluations piling up from large batch workloads that aren't getting GC'd. But if that were the case here I'd expect to see the memory increase across all the servers (not just the leader) and I would definitely expect the leader to use moderately more memory than other servers, because it does quite a bit more work. It has long-running processes that startup at leader transition, so I'd expect there to be a spike right away. But as you've noted if the memory isn't yielding back to the OS a while after the leader transition, either. So all that suggests that we've got a leak. Given that you can reproduce this reliably, can you generate the following profiles for me from the leader and one of the servers that has stepped down from leadership for a little while (at least a few minutes)?
If you don't have You can tar/zip those up and send them to [email protected]. I'll do an analysis on those and report back here. I'll also see if I can repro here, but it's possible there's an environment-specific or feature-specific leak so it'd be best if we can get those profiles. Thanks! |
@tgross I got time to look at the agent runtime profiles API. I think this may not work as-is. We have not specified After a bit of confusion, and then squinting hard at the docs, I think an |
You don't need |
Sorry, but I am confused. We do not have ACLs, i.e. there is no [root]@[nomadserver-1]
$ grep -r debug /etc/nomad.d/
[root]@[nomadserver-1]
$ grep -r acl /etc/nomad.d/
Do you mean that I can somehow get profiling information without doing anything extra here? When I try the following, I get a ref: https://developer.hashicorp.com/nomad/api-docs/agent#sample-request-9 [ec2-user]@[nomadserver-1]
$ nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
nomadserver-1.us-west-2 aa.bb.cc.ddd 4648 alive false 3 1.6.1 nomadserver us-west-2
nomadserver-2.us-west-2 aa.bb.cc.ddd 4648 alive false 3 1.6.1 nomadserver us-west-2
nomadserver-3.us-west-2 aa.bb.cc.ddd 4648 alive true 3 1.6.1 nomadserver us-west-2
[ec2-user]@[nomadserver-1]
$ curl -O http://localhost:4646/v1/agent/pprof/goroutine?server_id=nomadserver-1.us-west-2
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 17 100 17 0 0 22636 0 --:--:-- --:--:-- --:--:-- 17000
[ec2-user]@[nomadserver-1]
$ cat goroutine
Permission denied
Also, although off-topic ... I think the "/profile" and "/trace" example commands are incorrect ... shouldn't it be In case some extra configuration is needed, what would need to be done? I don't foresee us enabling ACLs. Maybe we could |
Ah I think you're right. I was misreading the chart at https://developer.hashicorp.com/nomad/api-docs/agent#default-behavior where it very clearly has as the first row:
Sorry about that. So yes, you'll need to
🤦 Oops, yes! I'll try to get a PR to fix that shortly. (Edit: #18142) |
I have added I have confirmed that the previously failing commands now succeed, and generate expected pprof data. As I have restarted the agents. the symptom will take about 14 days to manifest itself. |
Fortunately we don't need to wait until the memory usage is super high, only that it's increased. If you were to take the profiles now and then again, say, 24h from now, we should definitely see a difference between the two (because your graphs show an increase of several GB a day). That should be enough to debug what's happening here. |
Done, sent some logs. |
Received. As you noted in the email, these are just the goroutines though so still incomplete. The leader (server-1) has almost same number of goroutines in the Aug 5 profile as the Aug 7 profile (635 vs 641), but that small variation is just servicing RPC requests. The total number feels unusually high for a server, but... what immediately jumps out at me though is that these servers have goroutines in the client allocrunner code paths even though they're running server workloads. So these are both servers and clients! We strongly recommend against this because your workloads can starve the server. So there's no goroutine leak here. I'm eagerly awaiting those heap profiles to see what we find there. I didn't think to ask this before because I expected Nomad to be the only major application on the host, but have you confirmed it's actually the Nomad process that's using up the memory and not something else on the box? (Post-edit remark: I originally had nonsense here about the |
By the way if you want to see what I'm seeing with these profiles, you can use the analysis tools in the Go toolchain as follows: A command like And that shows that the diff is just RPC handlers, as expected. |
The Nomad servers are in a separate Making the servers also "clients" is a very recent change in our setup (about 2 (two) months ago). Server memory climbing like this is something we have observed on our setup since forever (about version 1.3.1 I think, not exactly sure) Although we have Slack alerts from DataDog, we need to guarantee that the machine does not become "OOM" in case no one reacts to the alert. Hence, the local "mem free checker" job.
Yes, Nomad is the only major application on this machine. In the next set of logs, I'll send |
@shantanugadgil I've received the new set of profiles... but none of these are the heap profile that I asked you for last week. You've collected and sent the (Aside, if you hit |
Ok, got that set of new profiles, thanks @shantanugadgil! Ok, so what I can tell from the heap and goroutine profiles above is what we're not seeing:
The next step will be to take the profiles again for the current leader |
102 client nodes (excluding the header line, and the 3 servers)
For "reasons" we have BOTH I just noticed that Once these experiments are done, I could disable the prometheus metrics on the servers. For now leaving the servers alone for the next profile collection.
|
Ok, I've had a chance to examine the before and after profiles. tl;dr you may want to try As you noted, the resident set size is increasing over time. If we look at the leader in the first bundle we see a resident size of 1.4G:
And after it's 1.6G, confirming that we have a ~200M growth:
If we look at the goroutine profiles with The major difference I can see is in the heap profiles. As I noted earlier, there's a surprising amount of memory being used in the go-metrics library, specifically in the Promtheus sink and the children of its In the "after" profile, there's ~787MB of heap memory rooted in go-metrics. That's a difference of 137MB, and if I take a diff of the two profiles, I can see that makes up the meaningful part of the diff. If I go further and use the If we look at the non-leader servers, neither of them is showing such a large chunk of heap in go-metrics. Non-leader servers do emit metrics with The job summaries metrics (ref
|
@tgross I'll try the above setting, but we do not use a lot of we do use a lot of
I might even go ahead and disable |
Oh, I'm realizing that the logic we have for |
I have gone ahead and made the following changes to the server configs and restarted the servers.
About If there are any changes being planned to squish metrics do consider how the failing/pending/queued periodic jobs can be tracked going ahead. |
Yeah I don't love that we throw out those metrics entirely when We could conceivably merge all the summaries from all the dispatch/periodic instances with the same parent ID, which feels like what most people are going to want. But we'd need to keep a |
yes, this sounds fair. tracking via though for ... and I am not even considering |
Not without blowing up the cardinality again, which is what we're trying to avoid here. I'm going to rename this issue for clarity and try to carve out a little bit of time to experiment with some options soon. But just a heads up that I'm headed into a vacation until September, so I'll have to pick this back up then. 😀 |
I think it is the prometheus telemetry metrics setting and not the periodic job metrics. For 5 (five) days now, I have the following setting in the Nomad server configuration and the graph is shown below. The changes were done as per this comment: telemetry {
publish_allocation_metrics = false
publish_node_metrics = true
# reduce cardinality (?)
disable_hostname = true
collection_interval = "10s"
datadog_address = "127.0.0.1:8125"
prometheus_metrics = false
disable_dispatched_job_summary_metrics = true
}
Note the steady increase during the time I was sending debug logs and no memory increase since I switched off I don't think periodic jobs had anything to do with the problem. 🤷♂️ |
Thanks @shantanugadgil. I suspect the Prometheus collector doesn't age out the metrics we're collecting for those periodic jobs the way I'd expect. |
@tgross should the title be renamed to something about |
Sure, done. I don't have a fix for this at the moment or the time to dig in further right now, but I'll mark this for roadmapping. |
Some memory leak has come back in some way as part of version 1.7.2. Sending logs to |
Nomad version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c
(this issue has always existed in all previous versions)
Operating system and Environment details
Amazon Linux 2
Issue
The server leader's memory keeps increasing with no sign of it flattening.
We have to periodically restart the leader to counter this problem
Reproduction steps
You should be able to observe a graph similar to the following in you environment.
Expected Result
Server leader's memory should stay under control.
We run
nomad system gc
andnomad system reconcile summaries
every hour, but that does not help either.Actual Result
Leader memory keeps increasing
Job file (if appropriate)
N/A
Nomad Server logs (if appropriate)
N/A
Nomad Client logs (if appropriate)
N/A
FWIW, there are many other memory related issues, but they don't mention "server leader" specifically, hence creating a new issue.
Example: #17789 (comment)
The text was updated successfully, but these errors were encountered: