Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm: add readiness probe for sysfs scaling #1190

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mikhail-sakhnov
Copy link
Contributor

@mikhail-sakhnov mikhail-sakhnov commented Dec 30, 2024

Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to check if runner pod and vm is ready. Runner's endpoint /ready does nothing in case of qmp scaling model.
Move neonvm-daemon line in the inittab to have it starting later since we use it for the health checks.

It is a blocker for #1141 because working without ready probe introduces race in some e2e tests and kuttl tries to execute shell command kill while VM is not fully booted and acpid rules are not in action yet.

It closes #1147 which is in the backlog but since it is a blocker I extracted it from the PR #1141

#1147

Copy link

github-actions bot commented Dec 30, 2024

No changes to the coverage.

HTML Report

Click to open

@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from e5c9385 to 249d38b Compare December 30, 2024 14:01
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from 249d38b to 1ea3177 Compare January 7, 2025 15:06
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, left a couple questions

vm-builder/files/inittab Show resolved Hide resolved
pkg/neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch 3 times, most recently from 61c45f4 to e71bf59 Compare January 23, 2025 18:59
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from e71bf59 to 5825069 Compare February 3, 2025 11:41
@mikhail-sakhnov mikhail-sakhnov marked this pull request as ready for review February 3, 2025 12:09
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch 2 times, most recently from 17b03f6 to abf64fe Compare February 3, 2025 12:27
@sharnoff sharnoff self-assigned this Feb 3, 2025
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions — high level, looks fine, just fussing over the details.

neonvm-runner/cmd/main.go Outdated Show resolved Hide resolved
neonvm-runner/cmd/main.go Show resolved Hide resolved
pkg/neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
pkg/neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
pkg/neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to
check if runner pod and vm is ready. Runner's endpoint /ready does nothing in
case of qmp scaling model.

Move neonvm-daemon line in the inittab to start it right before vmstart.

Modify logic in the migration controller to not waiting for the pod
readiness - neonvm-daemon doesn't start until the migration is finished.
De-facto, that doesn't change behavior for the migration at all since
before the PR we had no readiness probe.

Signed-off-by: Mikhail Sakhnov <[email protected]>
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some blocking questions then will be good to merge

default:
panic(fmt.Errorf("unknown pod phase: %q", pod.Status.Phase))
}
}

// isRunnerPodReady returns whether the runner pod is ready respecting the readiness probe of its containers.
func isRunnerPodReady(pod *corev1.Pod) runnerStatusKind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: functions titled isFoo should return bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to runnerContainerStatus(..)

if containerSpec.Name == "neonvm-runner" {
for _, env := range containerSpec.Env {
if env.Name == "RECEIVE_MIGRATION" && env.Value == "true" {
return runnerRunning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(blocking): Does this mean that we'll return runnerRunning even when the neonvm-runner container is not yet running? (or alternately: can this return ready even when the pod is still pending?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I introduce the new status and change the vmmigration controller to use it instead of running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the discussion with you and @Omrigan I changed the code here to check the pod references to determine if it is a target pod for migration, which looks like the most reliable way we can currently afford.

Comment on lines 885 to 889
for _, containerSpec := range pod.Spec.Containers {
if containerSpec.Name == "neonvm-runner" {
for _, env := range containerSpec.Env {
if env.Name == "RECEIVE_MIGRATION" && env.Value == "true" {
return runnerRunning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): Maybe extract the "is this a target pod" check into a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
@@ -713,6 +713,7 @@ func (r *VirtualMachineMigrationReconciler) targetPodForVirtualMachine(
pod.Name = migration.Status.TargetPodName

// add env variable to turn on migration receiver
// TODO: make it false or empty after the migration is done to enable correct readiness probe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(non-blocking): what's the plan here? I don't think we can modify container environments at runtime... And, is this TODO tracked somewhere?

Copy link
Contributor Author

@mikhail-sakhnov mikhail-sakhnov Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed it verbally with @Omrigan and I promised to create an issue for that, since I am a bit afraid of the scenario like:

  • we start VM
  • we migrate VM to a new pod
  • new pod has RECEIVE_MIGRATION eq true which means that in some hypothetical scenario of restarting pod (which tbh I am not even sure is possible in our lifecycle) qemu will not boot, but will wait for some migration (which is not actually happening at the moment).

Since we have already an issue to audit vmmigration controller, I will add it as a context there.

@sharnoff sharnoff assigned mikhail-sakhnov and unassigned sharnoff Feb 7, 2025
@mikhail-sakhnov mikhail-sakhnov marked this pull request as draft February 11, 2025 11:48
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from 22644c2 to b21b327 Compare February 11, 2025 12:10
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from b21b327 to e243d9d Compare February 11, 2025 12:11
@mikhail-sakhnov mikhail-sakhnov marked this pull request as ready for review February 11, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add readiness probe for the vm-runner process
3 participants