-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neonvm: add readiness probe for sysfs scaling #1190
base: main
Are you sure you want to change the base?
Conversation
No changes to the coverage.
HTML Report |
e5c9385
to
249d38b
Compare
249d38b
to
1ea3177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, left a couple questions
61c45f4
to
e71bf59
Compare
e71bf59
to
5825069
Compare
17b03f6
to
abf64fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions — high level, looks fine, just fussing over the details.
77d300a
to
9a4bb16
Compare
Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to check if runner pod and vm is ready. Runner's endpoint /ready does nothing in case of qmp scaling model. Move neonvm-daemon line in the inittab to start it right before vmstart. Modify logic in the migration controller to not waiting for the pod readiness - neonvm-daemon doesn't start until the migration is finished. De-facto, that doesn't change behavior for the migration at all since before the PR we had no readiness probe. Signed-off-by: Mikhail Sakhnov <[email protected]>
9a4bb16
to
04df9d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some blocking questions then will be good to merge
default: | ||
panic(fmt.Errorf("unknown pod phase: %q", pod.Status.Phase)) | ||
} | ||
} | ||
|
||
// isRunnerPodReady returns whether the runner pod is ready respecting the readiness probe of its containers. | ||
func isRunnerPodReady(pod *corev1.Pod) runnerStatusKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: functions titled isFoo
should return bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to runnerContainerStatus(..)
if containerSpec.Name == "neonvm-runner" { | ||
for _, env := range containerSpec.Env { | ||
if env.Name == "RECEIVE_MIGRATION" && env.Value == "true" { | ||
return runnerRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question(blocking): Does this mean that we'll return runnerRunning
even when the neonvm-runner container is not yet running? (or alternately: can this return ready even when the pod is still pending?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I introduce the new status and change the vmmigration controller to use it instead of running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the discussion with you and @Omrigan I changed the code here to check the pod references to determine if it is a target pod for migration, which looks like the most reliable way we can currently afford.
for _, containerSpec := range pod.Spec.Containers { | ||
if containerSpec.Name == "neonvm-runner" { | ||
for _, env := range containerSpec.Env { | ||
if env.Name == "RECEIVE_MIGRATION" && env.Value == "true" { | ||
return runnerRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(non-blocking): Maybe extract the "is this a target pod" check into a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -713,6 +713,7 @@ func (r *VirtualMachineMigrationReconciler) targetPodForVirtualMachine( | |||
pod.Name = migration.Status.TargetPodName | |||
|
|||
// add env variable to turn on migration receiver | |||
// TODO: make it false or empty after the migration is done to enable correct readiness probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question(non-blocking): what's the plan here? I don't think we can modify container environments at runtime... And, is this TODO tracked somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed it verbally with @Omrigan and I promised to create an issue for that, since I am a bit afraid of the scenario like:
- we start VM
- we migrate VM to a new pod
- new pod has
RECEIVE_MIGRATION
eq true which means that in some hypothetical scenario of restarting pod (which tbh I am not even sure is possible in our lifecycle) qemu will not boot, but will wait for some migration (which is not actually happening at the moment).
Since we have already an issue to audit vmmigration controller, I will add it as a context there.
…-for-sysfs-scaling
22644c2
to
b21b327
Compare
b21b327
to
e243d9d
Compare
…probe using MigrationOwnerForPod
Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to check if runner pod and vm is ready. Runner's endpoint /ready does nothing in case of qmp scaling model.
Move neonvm-daemon line in the inittab to have it starting later since we use it for the health checks.
It is a blocker for #1141 because working without ready probe introduces race in some e2e tests and kuttl tries to execute shell command
kill
while VM is not fully booted and acpid rules are not in action yet.It closes #1147 which is in the backlog but since it is a blocker I extracted it from the PR #1141
#1147