Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UFFD handler never receiving remove messages when balloon device inflates #5021

Merged
merged 4 commits into from
Jan 31, 2025

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Jan 30, 2025

When the balloon inflates, and the guest gives us back some pages of memory, we need to free those pages. In booted VMs, we do this with madvise(MADV_DONTNEED), and in restored VMs we do it by MAP_FIXED-ing a new VMA on top of the range-to-free. This is because if guest memory is a MAP_PRIVATE of a memory file, madvise has no effect.

However, we also do this MAP_FIXED trick if the snapshot is restored with UFFD. In this case, its not needed (madvise works perfectly fine), and in fact, its wrong: If we map over the memory range, UFFD will not receive Remove events for the specified range.

Fix this by only using the mmap trick for file-based restored.

After doing this, fix our test uffd handler to actually handle remove events correctly (since it never received any before, we never noticed that its was completely broken - but now that we actually exercise this in the test, it needed fixing).

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

In test_valid_handler, we had a comment to validate that the guest still
works if we mess with the balloon after UFFD-based restoration. However,
we didn't actually do so. Fix this by running some simple SSH command
after both inflation and deflation.

Signed-off-by: Patrick Roy <[email protected]>
When the balloon inflates, and the guest gives us back some pages of
memory, we need to free those pages. In booted VMs, we do this with
madvise(MADV_DONTNEED), and in restored VMs we do it by MAP_FIXED-ing a
new VMA on top of the range-to-free. This is because if guest memory is
a MAP_PRIVATE of a memory file, madvise has no effect.

However, we also do this MAP_FIXED trick if the snapshot is restored
with UFFD. In this case, its not needed (madvise works perfectly fine),
and in fact, its wrong: If we map over the memory range, UFFD will not
receive Remove events for the specified range.

Fix this by only using the mmap trick for file-based restored.

Fixes firecracker-microvm#4988
Signed-off-by: Patrick Roy <[email protected]>
Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.14%. Comparing base (0d2713b) to head (5c2dec5).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/balloon/device.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5021      +/-   ##
==========================================
+ Coverage   83.13%   83.14%   +0.01%     
==========================================
  Files         245      245              
  Lines       26697    26704       +7     
==========================================
+ Hits        22194    22203       +9     
+ Misses       4503     4501       -2     
Flag Coverage Δ
5.10-c5n.metal 83.61% <94.11%> (+0.01%) ⬆️
5.10-m5n.metal 83.60% <94.11%> (+0.01%) ⬆️
5.10-m6a.metal 82.81% <94.11%> (+0.01%) ⬆️
5.10-m6g.metal 79.57% <94.11%> (+0.01%) ⬆️
5.10-m6i.metal 83.59% <94.11%> (+<0.01%) ⬆️
5.10-m7g.metal 79.57% <94.11%> (+0.01%) ⬆️
6.1-c5n.metal 83.61% <94.11%> (+0.01%) ⬆️
6.1-m5n.metal 83.59% <94.11%> (+<0.01%) ⬆️
6.1-m6a.metal 82.80% <94.11%> (+0.01%) ⬆️
6.1-m6g.metal 79.57% <94.11%> (+0.01%) ⬆️
6.1-m6i.metal 83.59% <94.11%> (+0.01%) ⬆️
6.1-m7g.metal 79.57% <94.11%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roypat roypat force-pushed the uffd-balloon-fix branch 2 times, most recently from 75fb5c5 to eb07e2f Compare January 30, 2025 16:27
@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jan 30, 2025
src/firecracker/examples/uffd/uffd_utils.rs Show resolved Hide resolved
src/firecracker/examples/uffd/valid_handler.rs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@roypat roypat requested a review from kalyazin January 31, 2025 10:55
src/firecracker/examples/uffd/valid_handler.rs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
The crux of the issue was that UFFD gets blocked (all ioctls return
-EAGAIN) when there's any `remove` events pending in the queue, which
means during processing we not only need to look at the "head" of the
queue, but also make sure there's no `remove` events in the "tail".

Deal with these scenarios correctly by always greedily reading the
entire queue, to ensure there's nothing pending, and only then
processing things one-by-one. Please see the new code comments for
intricacies with this approach.

Fixes firecracker-microvm#4990
Signed-off-by: Patrick Roy <[email protected]>
Add entry about the UFFD fix.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat merged commit f6bd4b6 into firecracker-microvm:main Jan 31, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants