Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FDT setup for CPU caches with high number of sets #4869

Merged
merged 3 commits into from
Oct 24, 2024

Conversation

ckatsak
Copy link
Contributor

@ckatsak ckatsak commented Oct 23, 2024

Changes

The first commit improves the logging/reporting of errors during FDT creation (e.g., failures while parsing CPU cache information from sysfs).

The second commit changes the type of field number_of_sets in struct vmm::arch::aarch64::cache_info::CacheEntry from u16 to u32, thus allowing support for hosts with CPU caches with a number of sets higher than u16::MAX (e.g., NVIDIA GH200 Grace Hopper).

Reason

When we attempted to boot a microVM on a GH200 host following the Getting Started guide (with a minor modification, to download the aarch64 kernel binary), building the microVM failed, and the following was logged:

 . . .
2024-10-22T16:11:02.966071476 [anonymous-instance:fc_api:INFO:src/firecracker/src/api_server/parsed_request.rs:69] The API server received a Put request on "/actions" with body "{\n        \"action_type\": \"InstanceStart\"\n    }".
2024-10-22T16:11:02.966135189 [anonymous-instance:main:DEBUG:src/vmm/src/builder.rs:399] event_start: build microvm for boot
2024-10-22T16:11:02.972842006 [anonymous-instance:main:DEBUG:src/vmm/src/devices/acpi/vmgenid.rs:61] vmgenid: building VMGenID device. Address: 0x801ffff0. IRQ: 36
2024-10-22T16:11:02.972877750 [anonymous-instance:main:DEBUG:src/vmm/src/devices/acpi/vmgenid.rs:69] vmgenid: writing new generation ID to guest: 0xc9a01e17708cc1e2bc0518f9d2cc685e
2024-10-22T16:11:02.973158455 [anonymous-instance:main:INFO:src/vmm/src/lib.rs:833] Vmm is stopping.
2024-10-22T16:11:02.986500410 [anonymous-instance:fc_api:ERROR:src/firecracker/src/api_server/parsed_request.rs:190] Received Error. Status code: 400 Bad Request. Message: Start microvm error: System configuration error: Failed to create a Flattened Device Tree for this aarch64 microVM.
2024-10-22T16:11:02.986543450 [anonymous-instance:fc_api:DEBUG:src/firecracker/src/api_server/mod.rs:114] Total previous API call duration: 20487 us.

while this was the response of the API server:

{"fault_message":"Start microvm error: System configuration error: Failed to create a Flattened Device Tree for this aarch64 microVM."}

The first commit improves the reporting of failures while setting up FDT, by expanding the doc comments of the vmm::arch::aarch64::ConfigurationError::SetupFDT variant (thus also its implementation of std::fmt::Display, thanks to crates thiserror and displaydoc) to include the underlying vmm::arch::aarch64::fdt::FdtError.
As a result, the improved logs end up being as follows:

 . . .
2024-10-22T16:16:59.912859809 [anonymous-instance:fc_api:INFO:src/firecracker/src/api_server/parsed_request.rs:69] The API server received a Put request on "/actions" with body "{\n        \"action_type\": \"InstanceStart\"\n    }".
2024-10-22T16:16:59.912930657 [anonymous-instance:main:DEBUG:src/vmm/src/builder.rs:399] event_start: build microvm for boot
2024-10-22T16:16:59.919450370 [anonymous-instance:main:DEBUG:src/vmm/src/devices/acpi/vmgenid.rs:61] vmgenid: building VMGenID device. Address: 0x801ffff0. IRQ: 36
2024-10-22T16:16:59.919484802 [anonymous-instance:main:DEBUG:src/vmm/src/devices/acpi/vmgenid.rs:69] vmgenid: writing new generation ID to guest: 0x9934a6187f502a62579e99b35bed8b8f
2024-10-22T16:16:59.919760995 [anonymous-instance:main:INFO:src/vmm/src/lib.rs:833] Vmm is stopping.
2024-10-22T16:16:59.930398232 [anonymous-instance:fc_api:ERROR:src/firecracker/src/api_server/parsed_request.rs:190] Received Error. Status code: 400 Bad Request. Message: Start microvm error: System configuration error: Failed to create a Flattened Device Tree for this aarch64 microVM: Read cache info error: Invalid cache configuration found for number_of_sets: number too large to fit in target type
2024-10-22T16:16:59.930448408 [anonymous-instance:fc_api:DEBUG:src/firecracker/src/api_server/mod.rs:114] Total previous API call duration: 17606 us.

API server's HTTP response becomes more descriptive as well:

{"fault_message":"Start microvm error: System configuration error: Failed to create a Flattened Device Tree for this aarch64 microVM: Read cache info error: Invalid cache configuration found for number_of_sets: number too large to fit in target type"}

The second commit fixes the actual bug for the GH200 host (unveiled by and shown in the improved logs above). In this case, host CPU's number of sets for the unified L3 cache is higher than u16::MAX:

$ bat --style header /sys/devices/system/cpu/cpu0/cache/index3/{level,size,number_of_sets}
File: /sys/devices/system/cpu/cpu0/cache/index3/level
3

File: /sys/devices/system/cpu/cpu0/cache/index3/size
116736K

File: /sys/devices/system/cpu/cpu0/cache/index3/number_of_sets
155648

Changing the type of the number_of_sets field in struct vmm::arch::aarch64::cache_info::CacheEntry from u16 to u32 suffices to fit a high number of CPU cache sets, thus allowing the microVM to be built and booted without any problem.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Change the doc comment (which also affects the implementation of
`std::fmt::Display`) for `ConfigurationError::SetupFDT` to include
the underlying `fdt::FtdError`.
This improves both the reporting of failures during FDT setup on the
logs, as well as the message in API server's corresponding HTTP
response.

Signed-off-by: Christos Katsakioris <[email protected]>
Signed-off-by: Filippos Tofalos <[email protected]>
@ckatsak ckatsak force-pushed the fix-fdt-cache_entry-no_sets branch from 7b6e3c8 to 1c49679 Compare October 23, 2024 11:08
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.07%. Comparing base (5ef02a8) to head (aa50499).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4869   +/-   ##
=======================================
  Coverage   84.07%   84.07%           
=======================================
  Files         251      251           
  Lines       28052    28052           
=======================================
  Hits        23586    23586           
  Misses       4466     4466           
Flag Coverage Δ
5.10-c5n.metal 84.71% <ø> (ø)
5.10-m5n.metal 84.69% <ø> (ø)
5.10-m6a.metal 84.00% <ø> (ø)
5.10-m6g.metal 80.70% <100.00%> (ø)
5.10-m6i.metal 84.69% <ø> (-0.01%) ⬇️
5.10-m7g.metal 80.70% <100.00%> (ø)
6.1-c5n.metal 84.71% <ø> (ø)
6.1-m5n.metal 84.69% <ø> (-0.01%) ⬇️
6.1-m6a.metal 84.00% <ø> (ø)
6.1-m6g.metal 80.69% <100.00%> (-0.01%) ⬇️
6.1-m6i.metal 84.69% <ø> (ø)
6.1-m7g.metal 80.70% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bchalios
Copy link
Contributor

Hey guys,

thanks a lot for your contribution. It generally LGTM. Could you fix the style issue? It's not in your changes, but your changes did make redundant a u32::from call (since the argument is now a u32) and clippy complains about it.

Change the type of `CacheEntry`'s `number_of_sets` field from `u16` to
`u32`, to allow correctly parsing CPU cache information from sysfs and
successfully setting up the FDT on hosts with CPU caches with a number
of sets that is higher than `u16::MAX`.

This also makes a couple of `u16`->`u32` conversions redundant, which
we therefore remove.

Signed-off-by: Christos Katsakioris <[email protected]>
Signed-off-by: Filippos Tofalos <[email protected]>
@ckatsak ckatsak force-pushed the fix-fdt-cache_entry-no_sets branch from 1c49679 to 9b48dee Compare October 23, 2024 14:57
Copy link
Contributor

@bchalios bchalios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution

@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 24, 2024
@bchalios bchalios self-assigned this Oct 24, 2024
@bchalios bchalios added Type: Fix Indicates a fix to existing code rust Pull requests that update Rust code labels Oct 24, 2024
@roypat roypat merged commit 0b9cf39 into firecracker-microvm:main Oct 24, 2024
9 of 10 checks passed
bchalios added a commit to bchalios/firecracker that referenced this pull request Oct 24, 2024
firecracker-microvm#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
pb8o pushed a commit that referenced this pull request Oct 24, 2024
#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
ShadowCurse pushed a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
firecracker-microvm#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
ShadowCurse pushed a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
firecracker-microvm#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
ShadowCurse pushed a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
firecracker-microvm#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
ShadowCurse pushed a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
firecracker-microvm#4869 changed the
type that holds the number of sets of a CPU cache from `u16` to `u32`.
This allows us to support ARM systems with CPU caches with sets more
than `u16::MAX`.

Add a CHANGELOG entry for this change

Signed-off-by: Babis Chalios <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Pull requests that update Rust code Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Fix Indicates a fix to existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants