-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [3006.9] TCPReqClient
is leaking connections to master port 4506
#67076
Comments
Current situation on one of our masters: root@salt01:~# ./concache-cli.py
=> Got 3000 connected minions from ConCache
root@salt01:~# ./ckminions-cli.py
=> Got 3000 connected minions from CkMinions
root@salt01:~# lsof -ni :4506 | wc -l
24587 I would expect a maximum of 3000 connections to 4506, which would be one TCPReqClient connection per minion. |
After a lot of digging, I came up with the following two patches: --- /opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py.orig 2024-12-02 15:56:24.820178791 +0100
+++ /opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py 2024-12-02 15:56:53.970493570 +0100
@@ -582,6 +582,8 @@
def check_close(self):
if not self.send_future_map:
self._tcp_client.close()
+ if self._stream is not None:
+ self._stream.close()
self._stream = None
self._closing = False
self._closed = True
@@ -636,7 +638,7 @@
def _stream_return(self):
self._stream_return_running = True
unpacker = salt.utils.msgpack.Unpacker()
- while not self._closing:
+ while (not self._closed and not self._closing):
try:
wire_bytes = yield self._stream.read_bytes(4096, partial=True)
unpacker.feed(wire_bytes) This makes sure that the stream is actually closed. Closing the This patch also makes the while loop in Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py", line 643, in _stream_return
wire_bytes = yield self._stream.read_bytes(4096, partial=True)
AttributeError: 'NoneType' object has no attribute 'read_bytes' The second patch reconnects the Minion's long-running req channel after it loses the connection to a master: --- /opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py.orig 2024-12-02 15:25:06.509934340 +0100
+++ /opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py 2024-12-02 15:43:45.281986764 +0100
@@ -2863,6 +2863,12 @@
self.opts, io_loop=self.io_loop
)
+ if hasattr(
+ self.req_channel, "connect"
+ ): # TODO: consider generalizing this for all channels
+ log.debug("Reconnecting minion's long-running req channel")
+ yield self.req_channel.connect()
+
# put the current schedule into the new loaders
self.opts["schedule"] = self.schedule.option("schedule")
( This is normally done on minion startup in However, the reconnect is done in `handle_event', where it initialises the req channel again, but never connects it: If the minion sends multiple events after reconnecting, the master connection will be re-established by the send function of the TCP message client. This can lead to races where multiple connections are opened in parallel, resulting again in a leak of stream handles. |
Both issues also seem to be present in 3007.x when the TCP transport is used. |
Description
Over time, a salt minion can accumulate a lot of connections to the master on port 4506:
lsof -ni :4506
After a minion restart, the connections go down to 5, but build up again over time. On our previous version (3000.8) only one connection to port 4506 was open at all times.
On the master, single
ReqServer
workers can get huge:excerpt from top
Setup
Minion config (directories excluded)
Master config (directories excluded)
Steps to Reproduce the behavior
Rght after the start of the minion, at least 3 connections to the master port 4506 are already open:
If you run schedules that execute states, there might be more open connections. Check the connections again after a day or a few days depending on the activity on your servers (salt-call from the minion, schedules on the minion, etc.). The connection should increase. On our systems we have five schedules with varying intervals (180s ± 60s -> Most frequent, 14400s ± 2700s -> least frequent), all of which had
return_job: True
which is the default.Expected behavior
The minion has one permanent connection to master:4505 and on demand connections to master:4506
Versions Report
salt --versions-report (Minion)
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)salt --versions-report (Master)
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
There were multiple attempts at fixing this, but those mostly seemed to address ZeroMQ, e.g Bug #61837 and PRs #65508, #65061, #65247 and #65559
The text was updated successfully, but these errors were encountered: