Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Check that volume is functional after live migration. #256

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

markylaing
Copy link
Contributor

Changes are as follows:

  • Increases the memory of the nested VM to 1GiB (the parent VMs have 2GiB each).
  • Waits indefinitely for the nested VM to start up by checking the number of process via lxc info.
  • Formats the volume with ext4, mounts it, and adds a file.
  • Checks that the file is present and has the same contents after live migrating the instance.

tomponline
tomponline previously approved these changes Aug 6, 2024
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@markylaing
Copy link
Contributor Author

Hmmm we'll it works locally but not in CI... what a surprise! 🤦

I'll have to investigate more tomorrow with tmate.

sleep 60

# Wait for a long time for it to boot (doubly nested VM takes a while).
while [ "$(lxc exec member1 -- sh -c "lxc info v1 | grep -F 'Processes:' | cut -d':' -f2 | tr -d '[:blank:]'")" -le 1 ]; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wrapper shell is not needed.

Suggested change
while [ "$(lxc exec member1 -- sh -c "lxc info v1 | grep -F 'Processes:' | cut -d':' -f2 | tr -d '[:blank:]'")" -le 1 ]; do
while [ "$(lxc exec member1 -- lxc info v1 | awk '{if ($1 == "Processes:") print $2}')" -le 1 ]; do

Also changed to use awk like we do elsewhere to get the process count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With awk I think I need the subshell:

$ [ lxc exec member1 -- lxc info v1 | awk '{if ($1 == "Processes:") print $2}' -le 1 ]
awk: fatal: cannot open file `-le' for reading: No such file or directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clear. With the subshell, I meant the sh -c bit used while doing the lxc exec.

lxc exec member1 -- lxc move v1 --target member2

# The VM is slow. So the agent isn't immediately available after the live migration.
while [ "$(lxc exec member2 -- sh -c "lxc info v1 | grep -F 'Processes:' | cut -d':' -f2 | tr -d '[:blank:]'")" -le 1 ]; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about subshell.

done

# The volume should be functional, still mounted, and the file we created should still be there with the same contents.
[ "$(lxc exec member2 -- sh -c "lxc exec v1 -- cat /mnt/vol1/bar")" = "foo" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, I think:

Suggested change
[ "$(lxc exec member2 -- sh -c "lxc exec v1 -- cat /mnt/vol1/bar")" = "foo" ]
[ "$(lxc exec member2 -- lxc exec v1 -- cat /mnt/vol1/bar)" = "v1" ]

Comment on lines 153 to 156
lxc exec member1 -- sh -c "lxc exec v1 -- mkfs -t ext4 /dev/sdb"
lxc exec member1 -- sh -c "lxc exec v1 -- mkdir /mnt/vol1"
lxc exec member1 -- sh -c "lxc exec v1 -- mount -t auto /dev/sdb /mnt/vol1"
lxc exec member1 -- sh -c "lxc exec v1 -- sh -c 'echo foo > /mnt/vol1/bar'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do without the subshells and also make it explicit that we expect an ext4 formatted disk rather than accept anything:

Suggested change
lxc exec member1 -- sh -c "lxc exec v1 -- mkfs -t ext4 /dev/sdb"
lxc exec member1 -- sh -c "lxc exec v1 -- mkdir /mnt/vol1"
lxc exec member1 -- sh -c "lxc exec v1 -- mount -t auto /dev/sdb /mnt/vol1"
lxc exec member1 -- sh -c "lxc exec v1 -- sh -c 'echo foo > /mnt/vol1/bar'"
lxc exec member1 -- lxc exec v1 -- mkfs -t ext4 /dev/sdb
lxc exec member1 -- lxc exec v1 -- mkdir /mnt/vol1
lxc exec member1 -- lxc exec v1 -- mount -t ext4 /dev/sdb /mnt/vol1
lxc exec member1 -- lxc exec v1 -- cp /etc/hostname /mnt/vol1/bar

@tomponline
Copy link
Member

ready for rebase

@markylaing markylaing force-pushed the live-migration-followup branch 2 times, most recently from 11eaed0 to 462a0c7 Compare August 7, 2024 08:51
@markylaing
Copy link
Contributor Author

markylaing commented Aug 7, 2024

I'm pretty sure that the doubly nested VM isn't booting because the CPU usage is at 100% for qemu:
image

I wonder if we can request a larger runner for this test?

Comment on lines 59 to 63
lxc launch "${TEST_IMG:-ubuntu-minimal-daily:24.04}" member1 --vm -c security.devlxd.images=true
lxc launch "${TEST_IMG:-ubuntu-minimal-daily:24.04}" member2 --vm -c security.devlxd.images=true
else
lxc launch "${TEST_IMG:-ubuntu-minimal-daily:24.04}" member1 --vm
lxc launch "${TEST_IMG:-ubuntu-minimal-daily:24.04}" member2 --vm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GHA runner VM have 16G of RAM (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories) so you can probably allocate more RAM to those memberX VMs.

Have you already tried with -c limits.cpu=2 -c limits.memory=4GiB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've bumped all the resources to what is used in microcloud and it's still not booting after over an hour. I'll go back to trying with containers and see how I get on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while loop checking for v1's state seems to have a great overhead as it is not being executed every 30s as it should. Over time it drifted by many seconds. One possible way to reduce that overhead would be to reduce the polling frequency and have the loop done inside member1 rather than from the GHA runner itself.

In other words, replace the while :; do lxc exec member1 -- lxc info v1 ...; sleep 30; done by lxc exec member1 -- sh -c 'while :; do lxc info v1 ...; sleep 60; done'.

Another thing that could possibly help would be to delay member2 start till after v1 is confirmed booted on member1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this would avoid needing to keep setting up the websockets over vsock for the exec sessions

@markylaing markylaing force-pushed the live-migration-followup branch 4 times, most recently from 30564cc to e7f3e43 Compare August 13, 2024 09:01
sleep 60

# Wait for a long time for it to boot (doubly nested VM takes a while).
lxc exec member1 -- sh -c 'while [ "$(lxc info v1 | awk '"'"'{if ($1 == "Processes:") print $2}'"'"')" -le 1 ]; do echo "Instance v1 still not booted, waiting 60s..." && sleep 60; done'

Check warning

Code scanning / shellcheck

SC2016 Warning test

Expressions don't expand in single quotes, use double quotes for that.
sleep 60

# Wait for a long time for it to boot (doubly nested VM takes a while).
lxc exec member1 -- sh -c 'while [ "$(lxc info v1 | awk '"'"'{if ($1 == "Processes:") print $2}'"'"')" -le 1 ]; do echo "Instance v1 still not booted, waiting 60s..." && sleep 60; done'

Check warning

Code scanning / shellcheck

SC2016 Warning test

Expressions don't expand in single quotes, use double quotes for that.
lxc exec member1 -- lxc move v1 --target member2

# The VM is slow. So the agent isn't immediately available after the live migration.
lxc exec member1 -- sh -c 'while [ "$(lxc info v1 | awk '"'"'{if ($1 == "Processes:") print $2}'"'"')" -le 1 ]; do echo "Instance v1 still not booted, waiting 60s..." && sleep 60; done'

Check warning

Code scanning / shellcheck

SC2016 Warning test

Expressions don't expand in single quotes, use double quotes for that.
lxc exec member1 -- lxc move v1 --target member2

# The VM is slow. So the agent isn't immediately available after the live migration.
lxc exec member1 -- sh -c 'while [ "$(lxc info v1 | awk '"'"'{if ($1 == "Processes:") print $2}'"'"')" -le 1 ]; do echo "Instance v1 still not booted, waiting 60s..." && sleep 60; done'

Check warning

Code scanning / shellcheck

SC2016 Warning test

Expressions don't expand in single quotes, use double quotes for that.
@markylaing markylaing force-pushed the live-migration-followup branch 2 times, most recently from 3139e54 to 28ad7f1 Compare August 13, 2024 13:38
@markylaing markylaing force-pushed the live-migration-followup branch 2 times, most recently from 13191d2 to 5e7488f Compare August 21, 2024 09:17
@simondeziel
Copy link
Member

simondeziel commented Aug 21, 2024

@markylaing it seems the vm-migration test is using a dir pool. The microcloud test suite uses a zfs one which might make it faster to boot the many VMs without having to hit the disk as hard due to CoW taking place. Maybe it's worth trying zfs here as well?

Also, on microcloud, we use the extra/ephemeral disk available on GHA VMs, see https://github.com/canonical/microcloud/blob/main/.github/workflows/tests.yml#L202-L209

@markylaing markylaing force-pushed the live-migration-followup branch 2 times, most recently from 60e2d21 to a4b95cf Compare August 22, 2024 13:47
tests/vm-migration Fixed Show fixed Hide fixed
@markylaing markylaing force-pushed the live-migration-followup branch 4 times, most recently from 02f31e0 to 98a9a13 Compare August 22, 2024 14:35
@markylaing markylaing force-pushed the live-migration-followup branch 3 times, most recently from 054e18d to e416b1d Compare August 28, 2024 09:46
@markylaing markylaing force-pushed the live-migration-followup branch from e416b1d to 93e3402 Compare August 28, 2024 09:49
@tomponline
Copy link
Member

@markylaing @simondeziel any progress on using container cluster members for ceph backed VMs?

@simondeziel
Copy link
Member

simondeziel commented Dec 6, 2024

@tomponline it's in my queue for next pulse, but no progress as of yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants