Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Creating and then deleting workspaces causes leaked goroutines #3350

Open
julian-hj opened this issue Mar 25, 2025 · 4 comments
Open

bug: Creating and then deleting workspaces causes leaked goroutines #3350

julian-hj opened this issue Mar 25, 2025 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@julian-hj
Copy link

Describe the bug

baseline pprof from kcp before workspace creation/deletion:
pre-ws.pprof.txt

pprof after creating and then deleting an org workspace 100 times:
post-ws.pprof.txt

go tool rendering of pprof:

Image

Steps To Reproduce

  1. Patch kcp to enable pprof
    addpprof.patch
  2. Start kcp in tilt and set the kubeconfig to the kind cluster
  3. port forward pprof port: kubectl port-forward -n kcp-alpha deployment/alpha 6060:6060 &
  4. get a baseline pprof: http://localhost:6060/debug/pprof/goroutine > pre-ws.pprof
  5. create a simple yaml file for an organization workspace: org-ws.yaml.txt
  6. target kcp, and then create and delete the same org 100 times in a loop: for i in {1..100}; do k apply -f org-ws.yaml; sleep 3; k delete -f org-ws.yaml; sleep 3; done
  7. collect pprof again and observe the large number of goroutines.

Expected Behaviour

Fewer goroutines

Additional Context

This looks related to #3016 although that issue is long since fixed, and didn't require new workspaces.

As far as I can tell, creating a new workspace causes the creation of listeners that wont get cancelled until the sharedIndexInformer is terminated.

@julian-hj julian-hj added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2025
@kcp-ci-bot kcp-ci-bot added this to kcp Mar 25, 2025
@github-project-automation github-project-automation bot moved this to New in kcp Mar 25, 2025
@embik
Copy link
Member

embik commented Mar 26, 2025

Hi @julian-hj, thanks for reporting this issue in such detail! We'll take a look at fixing this.

@embik embik moved this from New to Backlog in kcp Mar 26, 2025
@ntnn
Copy link
Contributor

ntnn commented Mar 27, 2025

@embik Could you please assign the issue to me? Thanks!

@embik
Copy link
Member

embik commented Mar 27, 2025

/assign @ntnn

Thanks!

@sanjimoh
Copy link

Repro was straightforward with the mentioned steps from @julian-hj . Below are some of my observations from the collected pprofs.

There is a significant spike in Goroutine Count of ~4.6x

  • Pre-workspace creation/deletion: 3,392 goroutines
  • Post-workspace creation/deletion: 15,592 goroutines

Major Contributors to Goroutine Growth deducted from the pprof:

  1. The informer system shows significant growth. An increase from 285 (8.40%) to 6,385 (40.95%)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).pop](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.pop)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.run)
[github.com/kcp-dev/apimachinery/v2/third_party/informers.(*processorListener).run.func1](http://github.com/kcp-dev/apimachinery/v2/third_party/informers.%28*processorListener%29.run.func1)
  1. The wait group & backoff utilities show substantial growth
[k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1](http://k8s.io/apimachinery/pkg/util/wait.%28*Group%29.Start.func1) increased from 726 (21.40%) to 12,926 (82.90%)
[k8s.io/apimachinery/pkg/util/wait.BackoffUntil](http://k8s.io/apimachinery/pkg/util/wait.BackoffUntil) increased from 756 (22.29%) to 6,856 (43.97%)
[k8s.io/apimachinery/pkg/util/wait.JitterUntil](http://k8s.io/apimachinery/pkg/util/wait.JitterUntil) increased from 680 (20.05%) to 6,780 (43.48%)
  1. Channel operations show significant growth.
runtime.chanrecv increased from 715 (21.08%) to 6,815 (43.71%)
runtime.selectgo increased from 2,074 (61.14%) to 8,176 (52.44%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Backlog
Development

No branches or pull requests

4 participants