Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

Closed
nayutah opened this issue Jan 10, 2025 · 2 comments · Fixed by #1399
Closed

[BUG] Long duration zookeeper health probe makes lots of 'defunct' child processes #1397

nayutah opened this issue Jan 10, 2025 · 2 comments · Fixed by #1399
Assignees

Comments

@nayutah
Copy link
Contributor

nayutah commented Jan 10, 2025

Describe the bug
When running a zookeeper cluster, the health probe operation may take more than 1 minute to complete and timeout. At the same time, the kb-agent may not 'waitpid' for the probe script properly, finally, the periodic probe scripts incurs lots of 'defunct' processes in the container.
I observe and trace the probe process in the container, find that in the process of health probe:

        livenessProbe:
          failureThreshold: 6
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
          exec:
            command:
              - /bin/bash
              - -c
              - |
                . "$ZOOBINDIR"/zkEnv.sh > /dev/null
                java -cp "$CLASSPATH" $CLIENT_JVMFLAGS $JVMFLAGS org.apache.zookeeper.client.FourLetterWordMain \
                localhost 2181 ruok | grep imok

the jvm takes a very long time to do the JIT work

288835 1001 20 0 2492424 47588 25044 R 0.3 0.3 0:00.19 C2 CompilerThread

when I use nc to replace 'java' cmd for health probe, the timeout and defunct problems disappear.
And this defection may occur frequently on VM with low-performance CPUs.
So I suggest using health probe cmd as belows:

            - bash
            - -c
            - |
              ZK_CLIENT_PORT=2181
              echo "ruok" | timeout 2 nc localhost ${ZK_CLIENT_PORT}; if [ $? -eq 0 ]; then echo "yes"; fi

To Reproduce
Steps to reproduce the behavior:

  1. run a zookeeper cluster in GCP 4C16G vm
  2. wait for the defunct processes to come out

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@shanshanying
Copy link
Contributor

@kubeJocker PTAL

@nayutah
Copy link
Contributor Author

nayutah commented Jan 10, 2025

refer to clickhouse:
addons/clickhouse/templates/cmpd-keeper.yaml- # command: ['/bin/bash', '-c', 'echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants