-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275
base: main
Are you sure you want to change the base?
Support DDA on VA-backed OHCL VM: Configurable Bounce Buffer for DMA #275
Conversation
@microsoft-github-policy-service agree company="Microsoft" |
@@ -1246,12 +1246,89 @@ async fn new_underhill_vm( | |||
|
|||
let boot_info = runtime_params.parsed_openhcl_boot(); | |||
|
|||
// Determine if x2apic is supported so that the topology matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes indeed part of your commit? Or stale from a fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to move these codes up so we can calculate device_dma using processor_topology.vp_count() below.
0e6886c
to
d632d32
Compare
Thanks for the PR, Juan. Overall, this is not the approach we want to take for supporting this scenario. As your changes show us, having pinning be handled on a per-device basis adds a lot of complexity and scenario-specificity to the device driver, which is supposed to be a generic thing that we could use anywhere (even outside of OpenHCL). And you can imagine that in the future, we will want to have other actions that we take as part of preparing for an assigned device DMA, e.g., locking VTL protections for the pages, mapping the pages into an IOMMU, that kind of thing. Or we will have yet more scenarios where we need to double buffer, e.g., to support unenlightened guests running in a CVM. I think the approach that makes sense is to centralize a set of DMA operations, like an OS kernel would. So, then devices can prepare some memory for DMA by calling into the DMA API, issue the device transaction, and then release the memory for DMA after the transaction completes. In some scenarios, this will do nothing. In others, it might double buffer or pin or lock the memory. But the device driver doesn't need to know the details. I don't have a full design of what this DMA API should look like yet, but I think I am probably the person best suited to designing it. So I would suggest that you let me define a DMA API scaffolding and update the existing device drivers to use it, and then you or Bhargav can provide an implementation that does this pinning/double buffering thing in a subsequent change. |
meta: avoid using the term UH moving forwards. prefer OHCL. (re: the PR title) |
Got it. Thanks! |
Thank you John for your detailed feedback on the approach for supporting this scenario. I understand the need to centralize DMA operations to reduce complexity and ensure the device driver remains generic. Regarding the design of the DMA API, do you need a task to track this design work? Additionally, could you provide an estimated timeline for when you expect to complete the design? Looking forward to your response. |
let x2apic = if isolation.is_hardware_isolated() { | ||
// For hardware CVMs, always enable x2apic support at boot. | ||
vm_topology::processor::x86::X2ApicState::Enabled | ||
} else if safe_x86_intrinsics::cpuid(x86defs::cpuid::CpuidFunction::VersionAndFeatures.0, 0).ecx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, this will conflict with Eric's recent change that refactors this crate.
|
||
// TODO: determine actual memory usage by NVME/MANA. hardcode as 10MB | ||
let device_dma = 10 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a rough estimate of how much memory this will use?
I suspect that John's comments obviate this discussion, but it does seem like this is inefficient: we're allocating (what I assume is) a lot of memory for devices that may not be using them.
let mut prp_result = None; | ||
let mut is_pinned = false; | ||
if let Some(io_threshold) = self.io_threshold { | ||
if is_va_backed && self.partition.is_some() && mem.len() as u32 > io_threshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep is_va_backed out of this routine. The pinning code (whatever form that takes) should take care of this.
@@ -50,6 +51,8 @@ pub enum DiskError { | |||
ReservationConflict, | |||
#[error("unsupported eject")] | |||
UnsupportedEject, | |||
#[error("failed to pin/unpin guest memory {0}")] | |||
Hv(HvError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just "PinFailure" instead of Hv?
@@ -206,6 +214,8 @@ pub enum RequestError { | |||
Memory(#[source] GuestMemoryError), | |||
#[error("i/o too large for double buffering")] | |||
TooLarge, | |||
#[error("hv error")] | |||
Hv(#[source] HvError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto - why not "PinError" ?
let mut prp_result = None; | ||
let mut is_pinned = false; | ||
if let Some(io_threshold) = self.io_threshold { | ||
if is_va_backed && self.partition.is_some() && mem.len() as u32 > io_threshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, let's factor out the check for io size > io_threshold as well. The routine should either pin or buffer. It can take in the threshold as one config. It should also figure out if the memory is VA backed.
Okay, I still think I'm talking about what I think John will scaffold. But I want to give the feedback here as well: this code is challenging to follow, so I'm making point suggestions on how to factor the code so that (IMO) it'll be easier to read and maintain in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I believe the logic can be moved from the nvme_driver to the common DMA APIs once John's design is done. We might not need the io_threshold and can always buffer first if there's enough space. However, considering performance, adding the io_threshold could benefit larger IOs and provide flexibility for performance tuning later.
@yupavlen-ms , will anything here need to change after the nvme keepalive changes go in? |
@@ -228,6 +228,7 @@ fn map_disk_error(err: disk_backend::DiskError) -> NvmeError { | |||
disk_backend::DiskError::MemoryAccess(err) => { | |||
NvmeError::new(spec::Status::DATA_TRANSFER_ERROR, err) | |||
} | |||
disk_backend::DiskError::Hv(_) => spec::Status::DATA_TRANSFER_ERROR.into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive my lack of Rust-idiom knowledge, but shouldn't this be ...
disk_backend::DiskError::Hv(_) => {
NvmeError::new(spec::Status::DATA_TRANSFER_ERROR, err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There're two ways to convert spec::Status to NvmeError: 1)through NvmeError::new() or 2) using the implicit .into(). For this specific case, NvmeError::new requires err with extra restrict which HvError does not implement so I just use the .into().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement the required traits or other approach if there's a need to keep the source.
@@ -947,6 +947,7 @@ impl SimpleScsiDisk { | |||
| ScsiError::UnsupportedVpdPageCode(_) | |||
| ScsiError::SrbError | |||
| ScsiError::Disk(DiskError::InvalidInput) | |||
| ScsiError::Disk(DiskError::Hv(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error could rectify on retry, right? I don't think we want to return this as a INVALID REQUEST / INVALID CDB. We can use something like insufficient resources.
I think we want to return SCSI CHECK_CONDITION, Sense Key 0x2 NOT READY, Add'l Sense Key 0x04 (LOGICAL UNIT NOT READY, CAUSE NOT REPORTABLE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the pin/unpin may succeed if retry? I just follow the same logic when we fail on allocate double buffer today, we return InvalidInput (in nvme_driver the error is TooLarge).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it would.
@@ -1069,6 +1070,7 @@ impl SimpleScsiDisk { | |||
}, | |||
DiskError::InvalidInput | |||
| DiskError::MemoryAccess(_) | |||
| DiskError::Hv(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is handled above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I get it - all matches are unreachable. Then why have them here? (sorry for the n00b question)
@@ -27,6 +27,9 @@ message Vtl2SettingsFixed { | |||
optional uint32 io_ring_size = 2; | |||
// Specify the maximum number of bounce buffer pages allowed per cpu | |||
optional uint32 max_bounce_buffer_pages = 3; | |||
optional uint64 dma_bounce_buffer_pages_per_queue = 4; | |||
optional uint32 dma_bounce_buffer_pages_per_io_threshold = 5; | |||
optional uint32 max_nvme_drivers = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a moot point, but I want to get in the habit of asking: we need a test that ensures that not passing in these values results in acceptable behavior. I see the code handling it, but since this is our API surface: we need to be certain we aren't regressing it. Both JSON (until we delete ...) and PB.
This pull request introduces several enhancements to support Direct Device Assignment (DDA) on VA-backed OHCL VM. The key changes include:
Make dma_bounce_buffer_pages_per_queue, dma_bounce_buffer_pages_per_io_threshold, and max_nvme_drivers configurable in VTL2 settings.
Use the calculated device_dma to create the shared pool.
Pass dma_bounce_buffer_pages_per_queue and dma_bounce_buffer_pages_per_io_threshold from VTL2 settings to the NvmeDriver to create bounce buffers per queue.
Each transaction will use a common VA-backed memory helper trait (pending VID support) to determine if the GPA (Guest Physical Address) is VA-backed and consider using the bounce buffer if it is.
If the bounce buffer is full or the I/O size exceeds dma_bounce_buffer_pages_per_io_threshold, use the common VA-backed memory helper trait to pin the GPA.