[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

nayutah · 2025-01-10T05:35:52Z

Describe the bug
When running a zookeeper cluster, the health probe operation may take more than 1 minute to complete and timeout. At the same time, the kb-agent may not 'waitpid' for the probe script properly, finally, the periodic probe scripts incurs lots of 'defunct' processes in the container.
I observe and trace the probe process in the container, find that in the process of health probe:

        livenessProbe:
          failureThreshold: 6
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
          exec:
            command:
              - /bin/bash
              - -c
              - |
                . "$ZOOBINDIR"/zkEnv.sh > /dev/null
                java -cp "$CLASSPATH" $CLIENT_JVMFLAGS $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
                localhost 2181 ruok | grep imok

the jvm takes a very long time to do the JIT work

288835 1001 20 0 2492424 47588 25044 R 0.3 0.3 0:00.19 C2 CompilerThread

when I use nc to replace 'java' cmd for health probe, the timeout and defunct problems disappear.
And this defection may occur frequently on VM with low-performance CPUs.
So I suggest using health probe cmd as belows:

            - bash
            - -c
            - |
              ZK_CLIENT_PORT=2181
              echo "ruok" | timeout 2 nc localhost ${ZK_CLIENT_PORT}; if [ $? -eq 0 ]; then echo "yes"; fi

To Reproduce
Steps to reproduce the behavior:

run a zookeeper cluster in GCP 4C16G vm
wait for the defunct processes to come out

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

shanshanying · 2025-01-10T05:37:59Z

@kubeJocker PTAL

nayutah · 2025-01-10T05:53:56Z

refer to clickhouse:
addons/clickhouse/templates/cmpd-keeper.yaml- # command: ['/bin/bash', '-c', 'echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok']

nayutah assigned shanshanying Jan 10, 2025

shanshanying assigned kubeJocker and unassigned shanshanying Jan 10, 2025

kubeJocker mentioned this issue Jan 10, 2025

chore: replace java by nc for zookeeper probe #1399

Merged

kubeJocker closed this as completed in #1399 Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

nayutah commented Jan 10, 2025

shanshanying commented Jan 10, 2025

nayutah commented Jan 10, 2025

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

Comments

nayutah commented Jan 10, 2025

shanshanying commented Jan 10, 2025

nayutah commented Jan 10, 2025