Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Expanding memory balloon causes VM to freeze #4990

Closed
2 of 3 tasks
maggie-lou opened this issue Jan 10, 2025 · 7 comments
Closed
2 of 3 tasks

[Bug] Expanding memory balloon causes VM to freeze #4990

maggie-lou opened this issue Jan 10, 2025 · 7 comments
Labels
Status: Awaiting author Indicates that an issue or pull request requires author action Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@maggie-lou
Copy link

maggie-lou commented Jan 10, 2025

Describe the bug

Even when there should be enough free memory in the VM, expanding the balloon sometimes causes the VM to freeze.

During a sample run (using the scripts linked below), after restoring the VM from a snapshot, free -h returned:

               total        used        free      shared  buff/cache   available
Mem:           108Mi        35Mi        45Mi       2.5Mi        35Mi        72Mi
Swap:             0B          0B          0B

Originally, the balloon was initialized to 5MB. When I inflated it to 20MB, it inflated successfully. When I inflated it to 30MB, the VM froze and there were a bunch of "Failed to update balloon stats, missing descriptor." errors.

To Reproduce

You can use the scripts in this branch: #4989

  1. Build firecracker with this patch: Fix unregistering memory ranges from UFFD when expanding the balloon #4988
  2. (Only needs to be run once): Prepare rootfs and guest kernel: get_rootfs_guest_kernel.sh
  3. Run firecracker: run_firecracker.sh
  4. Initialize a VM with a balloon and snapshot it: snapshot_vm.sh
  5. You will probably need to kill the former firecracker process and restart it: run_firecracker.sh
  6. Start the UFFD handler with the snapshot: run_uffd_handler.sh
  7. Expand the balloon. : trigger_remove_events.sh
  8. Expand the balloon even more : If you edit trigger_remove_events.sh. to inflate the balloon to 40MB, the VM will freeze and there are "Failed to update balloon stats, missing descriptor." errors

Expected behavior

I expected the balloon to be able to expand to 30MB because there is 72Mi of memory available.

Environment

Additional context

We are using UFFD to restore snapshots. The memory snapshots are quite large, so we're looking into using memory balloons with the goal of having the UFFD handler process removed memory ranges, so we don't have to save those memory ranges in the snapshot files. We've noticed that the VM will sometimes freeze when expanding the balloon, even when there should be sufficient memory.

Around the same time as the freeze, we always see the "Failed to update balloon stats, missing descriptor." errors as well as vsock connection errors VIRTIO_VSOCK_OP_RST.

I've tried disabling async page faults, in case the freezing was related to some sort of race condition in the kernel but the problem persists.

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
  • Are you certain the bug being reported is a Firecracker issue?

No - It could be a Linux bug as well. Though we've read cases where people seem to be successfully using UFFD + the balloon, so this use case seems like it should be possible now.

@JackThomson2
Copy link
Contributor

Hi @maggie-lou,

Thanks for raising the issue, with the helpful reproducer steps. I'll investigate myself and let you know what I find.

Thanks

@JackThomson2 JackThomson2 added Status: WIP Indicates that an issue is currently being worked on or triaged Type: Bug Indicates an unexpected problem or unintended behavior and removed Status: WIP Indicates that an issue is currently being worked on or triaged labels Jan 15, 2025
@roypat roypat added the Status: WIP Indicates that an issue is currently being worked on or triaged label Jan 22, 2025
@maggie-lou
Copy link
Author

In case this is helpful, when testing the balloon in a more production environment, I've noticed Kernel RCU stalls. I wonder if the balloon has a bad interaction with the RCU.

[  234.560950] rcu: INFO: rcu_preempt self-detected stall on CPU
[  234.562786] rcu: 	1-....: (14819 ticks this GP) idle=bb24/1/0x4000000000000000 softirq=4838/4841 fqs=5790
[  234.565430] 	(t=14750 jiffies g=5389 q=4 ncpus=2)
[  234.566859] CPU: 1 PID: 20 Comm: kworker/1:0 Not tainted 6.2.0 #1
[  234.568654] Workqueue: virtio_vsock virtio_transport_rx_work
[  234.570932] RIP: 0010:vring_unmap_one_split+0x41/0x50

roypat added a commit to roypat/firecracker that referenced this issue Jan 30, 2025
The UFFD handler might receive events out of order compared to how they
actually happened. For example, if the guest first frees a page to the
balloon device, and then immediately faults it in again, the UFFD
handler might see the page fault before the freeing. This is a problem,
as any pending `Remove` events in the queue will "block" the userfault
FD (all ioctls return -EAGAIN). Fix this by always draining all events
from the fd's queue, and gracefully handling -EAGAIN. Please see the
code comment for in-depth analysis of the flow.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
@roypat
Copy link
Contributor

roypat commented Jan 30, 2025

Hey! I've had a look into this today, and think what you're seeing is a combination of two issues: The first one is indeed that the Firecracker incorrectly handles balloon inflation events on restored VMs, and the fix from your PR for that is indeed the right one. However, the behavior you see after that (VM freezes) are not actually a bug in Firecracker, but rather a shortcoming of the simplistic UFFD handler used in our integration tests.

Essentially, the problem is with the handling of -EAGAIN when its returned by UFFDIO_COPY - it's not actually a race condition on one page (e.g. a fault and a free for the same page), but rather a race between any pair of fault and free: UFFD will have any ioctl return -EAGAIN if there's a remove message pending in the queue. So if we get EAGAIN, we cannot ignore the pagefault message its associated with, but rather we must first process all pending remove messages before we can proceed with the pagefault (and this is where the problem was: if we instead ignore the pagefault, then the guest thread that caused this page fault will never get awoken again). I've opened #5021 to fix all that up. Could you please give that a try to see if it resolves your issue?
Thanks!

roypat added a commit to roypat/firecracker that referenced this issue Jan 30, 2025
The crux of the issue was that UFFD gets blocked (all ioctls return
-EAGAIN) when there's any `remove` events pending in the queue, which
means during processing we not only need to look at the "head" of the
queue, but also make sure there's no `remove` events in the "tail".

Deal with these scenarios correctly by always greedily reading the
entire queue, to ensure there's nothing pending, and only then
processing things one-by-one. Please see the new code comments for
intricacies with this approach.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
roypat added a commit to roypat/firecracker that referenced this issue Jan 30, 2025
The crux of the issue was that UFFD gets blocked (all ioctls return
-EAGAIN) when there's any `remove` events pending in the queue, which
means during processing we not only need to look at the "head" of the
queue, but also make sure there's no `remove` events in the "tail".

Deal with these scenarios correctly by always greedily reading the
entire queue, to ensure there's nothing pending, and only then
processing things one-by-one. Please see the new code comments for
intricacies with this approach.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat added Status: Awaiting author Indicates that an issue or pull request requires author action and removed Status: WIP Indicates that an issue is currently being worked on or triaged labels Jan 30, 2025
roypat added a commit to roypat/firecracker that referenced this issue Jan 31, 2025
The crux of the issue was that UFFD gets blocked (all ioctls return
-EAGAIN) when there's any `remove` events pending in the queue, which
means during processing we not only need to look at the "head" of the
queue, but also make sure there's no `remove` events in the "tail".

Deal with these scenarios correctly by always greedily reading the
entire queue, to ensure there's nothing pending, and only then
processing things one-by-one. Please see the new code comments for
intricacies with this approach.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
roypat added a commit to roypat/firecracker that referenced this issue Jan 31, 2025
The crux of the issue was that UFFD gets blocked (all ioctls return
-EAGAIN) when there's any `remove` events pending in the queue, which
means during processing we not only need to look at the "head" of the
queue, but also make sure there's no `remove` events in the "tail".

Deal with these scenarios correctly by always greedily reading the
entire queue, to ensure there's nothing pending, and only then
processing things one-by-one. Please see the new code comments for
intricacies with this approach.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat closed this as completed in e92a7ff Jan 31, 2025
@maggie-lou
Copy link
Author

Thanks so much @roypat ! After applying a similar fix to our UFFD handler, it's resolved our issues.

In case anyone else hits this, I had to implement your suggestion A production handler will most likely want to ensure that remove events for a specific range are always handled before pagefault events. to fully resolve the issue.

@maggie-lou
Copy link
Author

@roypat Would you mind sharing how you debugged this? It would be helpful to have some strategies to debug similar issues in the future. Thanks!

@roypat
Copy link
Contributor

roypat commented Feb 3, 2025

Admittedly, there wasn't much finesse involved. I had the uffd handler print out all events it received (which showed that the guest didn't actually completely freeze, since page fault events still came in after inflating the balloon), and then I started looking at the EAGAIN return from uffdio_copy, because that was the only change done in the uffd handler. After reading the kernel code a bit to figure out why EAGAIN was being returned, the connection with pending remove events became clear.

@maggie-lou
Copy link
Author

Fair enough - thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting author Indicates that an issue or pull request requires author action Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants