Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [3006.9] TCPReqClient is leaking connections to master port 4506 #67076

Open
2 tasks done
yiip87 opened this issue Nov 28, 2024 · 3 comments
Open
2 tasks done

[BUG] [3006.9] TCPReqClient is leaking connections to master port 4506 #67076

yiip87 opened this issue Nov 28, 2024 · 3 comments
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@yiip87
Copy link

yiip87 commented Nov 28, 2024

Description
Over time, a salt minion can accumulate a lot of connections to the master on port 4506:

lsof -ni :4506
COMMAND      PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
/opt/salt 484470 root   17u  IPv4 6186861      0t0  TCP 87.XXX.XXX.XXX:52452->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   24u  IPv4 6186867      0t0  TCP 87.XXX.XXX.XXX:52460->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   25u  IPv4 6186868      0t0  TCP 87.XXX.XXX.XXX:52472->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   26u  IPv4 6194226      0t0  TCP 87.XXX.XXX.XXX:52714->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   27u  IPv4 6778844      0t0  TCP 87.XXX.XXX.XXX:40964->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   28u  IPv4 6352325      0t0  TCP 87.XXX.XXX.XXX:35408->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   29u  IPv4 6778845      0t0  TCP 87.XXX.XXX.XXX:40968->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   30u  IPv4 6974213      0t0  TCP 87.XXX.XXX.XXX:46170->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   31u  IPv4 6977733      0t0  TCP 87.XXX.XXX.XXX:46172->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   32u  IPv4 7174154      0t0  TCP 87.XXX.XXX.XXX:57334->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   33u  IPv4 6352610      0t0  TCP 87.XXX.XXX.XXX:35418->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   34u  IPv4 6191663      0t0  TCP 87.XXX.XXX.XXX:52698->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   35u  IPv4 7174692      0t0  TCP 87.XXX.XXX.XXX:57336->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   36u  IPv4 8416860      0t0  TCP 87.XXX.XXX.XXX:39858->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   37u  IPv4 7364717      0t0  TCP 87.XXX.XXX.XXX:44166->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   38u  IPv4 6493059      0t0  TCP 87.XXX.XXX.XXX:56894->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   39u  IPv4 6495901      0t0  TCP 87.XXX.XXX.XXX:56902->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   40u  IPv4 8419406      0t0  TCP 87.XXX.XXX.XXX:39864->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   41u  IPv4 7363063      0t0  TCP 87.XXX.XXX.XXX:44186->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   42u  IPv4 8660855      0t0  TCP 87.XXX.XXX.XXX:36052->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   43u  IPv4 7550338      0t0  TCP 87.XXX.XXX.XXX:49082->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   44u  IPv4 8665928      0t0  TCP 87.XXX.XXX.XXX:36062->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   49u  IPv4 7553362      0t0  TCP 87.XXX.XXX.XXX:49086->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   50u  IPv4 7809973      0t0  TCP 87.XXX.XXX.XXX:40708->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   51u  IPv4 7815404      0t0  TCP 87.XXX.XXX.XXX:40710->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   55u  IPv4 8209315      0t0  TCP 87.XXX.XXX.XXX:42400->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 484470 root   57u  IPv4 8210883      0t0  TCP 87.XXX.XXX.XXX:42410->10.XXX.XXX.5:4506 (ESTABLISHED)
With up to 3000 minions per master, this leads to oom_killer invocations on the master, killing the master completely or partial `ReqServer MWorker` processes.

After a minion restart, the connections go down to 5, but build up again over time. On our previous version (3000.8) only one connection to port 4506 was open at all times.

On the master, single ReqServer workers can get huge:

excerpt from top
top - 18:02:48 up 1 day, 22:12,  1 user,  load average: 2.29, 2.09, 1.95
Tasks: 287 total,   2 running, 284 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  0.7 sy,  0.0 ni, 94.6 id,  3.2 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :     30.9 total,      0.4 free,     21.8 used,      9.1 buff/cache
GiB Swap:     14.9 total,      0.3 free,     14.6 used.      9.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2199 salt      20   0 6100.3m   2.3g   5.8m S  22.9   7.4     10,57 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-10
   2188 salt      20   0 4367.6m   2.2g   5.8m S   0.0   7.2 328:30.73 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1
   2207 salt      20   0 5829.6m   2.1g   5.8m S   4.0   6.9      9,35 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-18
   2200 salt      20   0 5525.8m   2.0g   5.9m S   1.0   6.4      8,22 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-11
   2201 salt      20   0 5151.1m   1.8g   5.8m S   3.7   5.8      6,56 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-12
   2205 salt      20   0 3619.7m   1.3g   5.8m S   0.0   4.3 254:38.61 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-16
   2202 salt      20   0 2913.4m 979.8m   5.9m S   0.0   3.1 205:37.18 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-13
   2107 salt      20   0 3812.7m 839.1m   5.0m S   0.0   2.6   0:37.68 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon
   2198 salt      20   0 2274.6m 819.6m   5.7m S   0.0   2.6 155:13.57 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-9
   2203 salt      20   0 1782.1m 597.9m   5.7m S   0.0   1.9 129:52.60 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-14
   2193 salt      20   0 1255.6m 410.9m   5.5m S   0.0   1.3  79:11.94 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5
   2208 salt      20   0 1478.6m 408.7m   5.8m S   0.0   1.3  82:23.37 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-19
   2206 salt      20   0 1180.7m 330.7m   5.4m S   0.0   1.0  58:19.45 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-17
   2195 salt      20   0 1085.7m 328.9m   5.5m S   0.0   1.0  53:19.51 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7
   2187 salt      20   0 1217.6m 317.1m   5.4m S   0.0   1.0  62:02.60 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0
   2194 salt      20   0 1101.8m 304.7m   5.6m S   0.0   1.0  56:01.66 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6
   2189 salt      20   0 1076.1m 294.7m   5.5m S   0.0   0.9  49:39.19 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2
   2197 salt      20   0 1147.0m 292.2m   5.5m S   0.0   0.9  43:41.30 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-8
   2204 salt      20   0 1052.0m 286.9m   5.5m S   0.0   0.9  55:27.10 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-15
   2191 salt      20   0 1064.6m 282.2m   5.4m S   0.0   0.9  55:14.47 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3
   2192 salt      20   0 1048.0m 276.7m   5.5m S   0.0   0.9  60:56.03 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4
   2109 salt      20   0 1417.4m 192.8m   4.3m S   0.0   0.6  19:39.64 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Engine(salt.loaded.int.engines.reactor) Reactor-3:1
   2108 salt      20   0  300.5m 102.0m   3.3m S   0.3   0.3   5:17.45 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
 377163 salt      20   0  369.2m  55.4m   5.8m R  18.3   0.2   4:07.55 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance

Setup

Minion config (directories excluded)
acceptance_wait_time: 60
auth_timeout: 30
auth_tries: 2
keysize: 2048
log_level: error
log_level_logfile: info
master:
- 10.XXX.XXX.5
- 10.XXX.XXX.6
- 10.XXX.XXX.7
- 10.XXX.XXX.8
- 10.XXX.XXX.9
- 10.XXX.XXX.10
- 10.XXX.XXX.11
- 10.XXX.XXX.12
master_alive_interval: 240
master_failback: true
master_failback_interval: 1800
master_type: failover
random_master: true
random_reauth_delay: 270
recon_default: 1000
recon_max: 199000
recon_randomize: true
retry_dns: 0
transport: tcp
verify_master_pubkey_sign: true
Master config (directories excluded)
interface: 10.XXX.XXX.5
transport: tcp
keep_jobs_seconds: 86400
loop_interval: 60
job_cache: True
minion_data_cache: True
max_minions: 3000
con_cache: True
worker_threads: 20
master_sign_pubkey: True
master_pubkey_signature: pubkey_signature
master_use_pubkey_signature: True
rotate_aes_key: False
pillar_includes_override_sls: True
log_level: info
log_level_logfile: info
  • on-prem machine
  • onedir packaging

Steps to Reproduce the behavior

  • Install one or more salt-master servers and one salt-minion on different machines
  • Configure them with the configs above in /etc/salt/master.d/custom.conf and /etc/salt/minion.d/custom.conf respectively

Rght after the start of the minion, at least 3 connections to the master port 4506 are already open:

~ # lsof -ni :4506
COMMAND    PID USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
/opt/salt 5197 root   17u  IPv4 237399288      0t0  TCP 87.XXX.XXX.XXX:39310->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 5197 root   24u  IPv4 237395968      0t0  TCP 87.XXX.XXX.XXX:39326->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 5197 root   25u  IPv4 237399295      0t0  TCP 87.XXX.XXX.XXX:39334->10.XXX.XXX.5:4506 (ESTABLISHED)

If you run schedules that execute states, there might be more open connections. Check the connections again after a day or a few days depending on the activity on your servers (salt-call from the minion, schedules on the minion, etc.). The connection should increase. On our systems we have five schedules with varying intervals (180s ± 60s -> Most frequent, 14400s ± 2700s -> least frequent), all of which had return_job: True which is the default.

Expected behavior
The minion has one permanent connection to master:4505 and on demand connections to master:4506

Versions Report

salt --versions-report (Minion) (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.9

Python Version:
        Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.17.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 12 bookworm
        locale: utf-8
       machine: x86_64
       release: 6.1.114-xenu
        system: Linux
       version: Debian GNU/Linux 12 bookworm
salt --versions-report (Master) (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.9

Python Version:
        Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.17.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 12 bookworm
        locale: utf-8
       machine: x86_64
       release: 6.1.0-28-amd64
        system: Linux
       version: Debian GNU/Linux 12 bookworm

Additional context
There were multiple attempts at fixing this, but those mostly seemed to address ZeroMQ, e.g Bug #61837 and PRs #65508, #65061, #65247 and #65559

@yiip87 yiip87 added Bug broken, incorrect, or confusing behavior needs-triage labels Nov 28, 2024
@yiip87
Copy link
Author

yiip87 commented Nov 28, 2024

Current situation on one of our masters:

root@salt01:~# ./concache-cli.py
=> Got 3000 connected minions from ConCache

root@salt01:~# ./ckminions-cli.py
=> Got 3000 connected minions from CkMinions

root@salt01:~# lsof -ni :4506 | wc -l
24587

I would expect a maximum of 3000 connections to 4506, which would be one TCPReqClient connection per minion.

@yiip87
Copy link
Author

yiip87 commented Dec 2, 2024

After a lot of digging, I came up with the following two patches:

--- /opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py.orig	2024-12-02 15:56:24.820178791 +0100
+++ /opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py	2024-12-02 15:56:53.970493570 +0100
@@ -582,6 +582,8 @@
     def check_close(self):
         if not self.send_future_map:
             self._tcp_client.close()
+            if self._stream is not None:
+                self._stream.close()
             self._stream = None
             self._closing = False
             self._closed = True
@@ -636,7 +638,7 @@
     def _stream_return(self):
         self._stream_return_running = True
         unpacker = salt.utils.msgpack.Unpacker()
-        while not self._closing:
+        while (not self._closed and not self._closing):
             try:
                 wire_bytes = yield self._stream.read_bytes(4096, partial=True)
                 unpacker.feed(wire_bytes)

This makes sure that the stream is actually closed. Closing the TCPClientKeepAlive instance in self._tcp_client only closes its resolver, but doesn't touch the stream, so we need to make sure that the stream is closed as well.

This patch also makes the while loop in _stream_return watch for self._closed in addition to self._closing, since the latter is only true for a very short time (until the stream is closed, that is). This will prevent the following exception from being thrown:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/tcp.py", line 643, in _stream_return
    wire_bytes = yield self._stream.read_bytes(4096, partial=True)
AttributeError: 'NoneType' object has no attribute 'read_bytes'

The second patch reconnects the Minion's long-running req channel after it loses the connection to a master:

--- /opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py.orig	2024-12-02 15:25:06.509934340 +0100
+++ /opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py	2024-12-02 15:43:45.281986764 +0100
@@ -2863,6 +2863,12 @@
                         self.opts, io_loop=self.io_loop
                     )

+                    if hasattr(
+                        self.req_channel, "connect"
+                    ):  # TODO: consider generalizing this for all channels
+                        log.debug("Reconnecting minion's long-running req channel")
+                        yield self.req_channel.connect()
+
                     # put the current schedule into the new loaders
                     self.opts["schedule"] = self.schedule.option("schedule")
                     (

This is normally done on minion startup in connect_master:
https://github.com/saltstack/salt/blob/3006.x/salt/minion.py#L1401-L1405

However, the reconnect is done in `handle_event', where it initialises the req channel again, but never connects it:
https://github.com/saltstack/salt/blob/3006.x/salt/minion.py#L2865

If the minion sends multiple events after reconnecting, the master connection will be re-established by the send function of the TCP message client. This can lead to races where multiple connections are opened in parallel, resulting again in a leak of stream handles.

@yiip87
Copy link
Author

yiip87 commented Dec 2, 2024

Both issues also seem to be present in 3007.x when the TCP transport is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant