Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(discovery): discovery synchronization for stale lost targets #689

Merged
merged 38 commits into from
Nov 28, 2024

Conversation

andrewazores
Copy link
Member

@andrewazores andrewazores commented Oct 9, 2024

Welcome to Cryostat! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Fixes: #634

Description of the change:

  1. Changes a bit of logic within the KubeApiDiscovery class to try to ensure that events are handled serially
  2. Adds logic to DiscoveryNode utility methods to retrieve existing entities if they exist, rather than potentially creating duplicates. When many events are processed at once there can be multiple transactions, which can interrupt each other, and each may have tried to create duplicate entities. This added logic allows the later executions to retrieve the previously created entities and reuse them, rather than erroneously creating duplicates.
  3. Uses the existing k8s discovery resync period to do a "full resync" periodically. This full resync consists of simply checking the Informer store for each target namespace and checking that the database matches the expected set. This should not be strictly necessary if everything is otherwise implemented properly, but if anything does go wrong then this helps ensure that the database is kept up to date and eventually becomes consistent, even if an Informer notification gets missed or is incorrectly handled etc.
  4. (Not directly related to original change) adds periodic retry logic to the S3 storage bucket startup existence/creation check. Previously this would only be done once when the Cryostat container first started, but there is no guarantee that the storage container is up and ready before Cryostat is. Related to fix(startup): improve startup detection for bucket creation cryostat-storage#23 .
  5. Extracts JVM ID updating logic to a separate utility service class and decouples it from the Target persistence lifecycle. JVM IDs are now updated asynchronously by a separate worker - Targets can be persisted in the database without blocking the transaction to open a network connection and attempt to retrieve the JVM ID. The initial connection attempt occurs a short delay after the target is first discovered. A scheduled task fires periodically to attempt to connect to any known target which does not yet have a JVM ID and update it.

Motivation for the change:

See #634

How to manually test:

  1. Check out and build PR, or use quay.io/andrewazores/cryostat:k8s-discovery-30
  2. Deploy in Kubernetes (or OpenShift) using the Operator or Helm chart. I have been using Helm: helm install cryostat --set authentication.openshift.enabled=true --set core.route.enabled=true --set core.discovery.kubernetes.enabled=true --set core.image.repository=quay.io/andrewazores/cryostat --set core.image.tag=k8s-discovery-30 ./charts/cryostat/
  3. Open UI and go to Topology view to visually observe what Cryostat has discovered
  4. See comments below about using oc rollout restart and oc scale deployment to try to trigger the bad behaviour.

@andrewazores
Copy link
Member Author

andrewazores commented Oct 9, 2024

Deployed on OpenShift with:

$ helm install cryostat \
  --set authentication.openshift.enabled=true \
  --set core.route.enabled=true \
  --set core.image.repository=quay.io/andrewazores/cryostat \
  --set core.image.tag=k8s-discovery-2 \
  ./charts/cryostat/

Manually testing with various ways to mess around with a sample application deployment:

$ oc rollout restart deployment/quarkus-test
$ oc exec -it deployment/quarkus-test -- /bin/bash -c 'kill 1'
$ oc scale deployment quarkus-test --replicas=3
$ oc scale deployment quarkus-test --replicas=1
$ oc scale deployment quarkus-test --replicas=0

With or without this PR I can eventually get into a bad state as described in the original issue - either a stale discovered Pod that is not really there anymore, or else some Pods that exist but are not discovered.

cryostat-6659dd8598-d2n75-cryostat.log

After getting Cryostat into this state, with the various discovery exceptions logged above, I can no longer get it to discover other sample applications - even fully undeploying the sample application and deploying it fresh, or restarting or re-scaling the deployment.

@andrewazores
Copy link
Member Author

$ oc scale deployment quarkus-test --replicas=3
$ oc rollout restart deployment/quarkus-test

and repeatedly restarting the deployment to cause rollouts of multiple replicas seems to be a good way to trigger the bugged behaviour, eventually. It isn't deterministic and there are definitely multiple worker threads involved, so this seems like a thread synchronization issue and/or race condition.

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 10/17/2024, 12:03:01 PM. View Actions Run.

Copy link

No OpenAPI schema changes detected.

Copy link

No GraphQL schema changes detected.

Copy link

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11388705321

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 10/18/2024, 10:35:06 AM. View Actions Run.

@andrewazores andrewazores marked this pull request as ready for review October 18, 2024 14:35
Copy link

No OpenAPI schema changes detected.

Copy link

No GraphQL schema changes detected.

Copy link

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11405720776

@andrewazores
Copy link
Member Author

Seems like this is not a proper fix yet, only a mitigation - the problem occurs less frequently, but I can still get it to happen on occasion. I'll keep working on it.

@andrewazores andrewazores removed the request for review from Josh-Matsuoka October 22, 2024 17:56
@andrewazores andrewazores marked this pull request as draft October 22, 2024 17:56
@andrewazores andrewazores force-pushed the k8s-discovery-sync branch 2 times, most recently from 6805f84 to 6b4a902 Compare October 23, 2024 14:37
@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 10/23/2024, 10:45:31 AM. View Actions Run.

Copy link

No GraphQL schema changes detected.

Copy link

No OpenAPI schema changes detected.

Copy link

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/11482383330

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 10/23/2024, 11:19:56 AM. View Actions Run.

Copy link

No OpenAPI schema changes detected.

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 11/25/2024, 2:04:39 PM. View Actions Run.

Copy link

No OpenAPI schema changes detected.

Copy link

No GraphQL schema changes detected.

Copy link

CI build and push: All tests pass ✅
https://github.com/cryostatio/cryostat/actions/runs/12016993150

@andrewazores andrewazores merged commit c3ee63b into cryostatio:main Nov 28, 2024
9 checks passed
@andrewazores andrewazores deleted the k8s-discovery-sync branch November 28, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Kubernetes discovery - targets not being removed after pods tear down
2 participants