Skip to content

Conversation

@likebreath
Copy link
Collaborator

@likebreath likebreath commented Jan 30, 2026

Motivation

Add infrastructure to enable VFIO devices to leverage hardware IOMMU acceleration through iommufd's uAPIs. This allows userspace VMMs to attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly enabling userspace to configure stage-1 (guest-managed) page tables that are composed with stage-2 (host-managed) page tables in hardware.

This depends on the IommufdVIOMMU and IommufdVDevice abstractions introduced in the iommufd-ioctls crate [1].

Architecture Overview

New Public Interfaces

  1. VfioIommufd::new() extended with nested hwpt configuration:

    • Added s1_hwpt_data_type: Option<iommu_hwpt_data_type> parameter
    • Signature:
     pub fn new(
         iommufd: Arc<IommuFd>,
         ioas_id: Option<u32>,
         device_fd: Option<VfioContainerDeviceHandle>,
         s1_hwpt_data_type: Option<iommu_hwpt_data_type>,
     ) -> Result<Self> 
    • When Some, enables nested translation mode for subsequently attached VFIO devices
    • Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1
  2. VfioDevice::new_with_iommufd():

    • New constructor for vfio devices backed by iommufd with hardware-accelerated nested HWPT support
    • Signature:
      pub fn new_with_iommufd(
          sysfspath: &Path,
          vfio_ops: Arc<dyn VfioOps>,
          viommu: &mut Option<Arc<IommufdVIommu>>,
          virt_sid: Option<u64>,
      ) -> Result<(Self, Option<IommufdVDevice>)>
    • Automatically creates IommufdVIommu/IommufdVDevice when nested mode is enabled via VfioIommufd
    • Supports sharing a single IommufdVIommu instance across multiple VFIO devices
    • Returns IommufdVDevice handle for subsequent S1 HWPT operations
    • Attaches device to bypass HWPT by default (until guest enables IOMMU)
  3. VfioDevice::install_s1_hwpt():

    • Install guest-configured stage-1 page tables into hardware
    • Signature:
      pub fn install_s1_hwpt(
          &self,
          vdevice: &mut IommufdVDevice,
          hwpt_data: &IommufdHwptData,
      ) -> Result<()>
    • Called when guest writes to virtual IOMMU stream table entries
    • Atomically replaces existing S1 HWPT if present
    • Uses IommufdHwptData enum for type-safe hardware-specific configuration
  4. VfioDevice::uninstall_s1_hwpt():

    • Revert device to bypass or abort mode
    • Signature:
      pub fn uninstall_s1_hwpt(
          &self,
          vdevice: &mut IommufdVDevice,
          abort: bool,
      ) -> Result<()>
    • abort=true: Use abort HWPT (fault all DMA)
    • abort=false: Use bypass HWPT (passthrough translation)
    • Called during guest IOMMU reset or shutdown

Dependencies on iommufd-ioctls:

This implementation builds upon three types from iommufd-ioctls [1]:

  • IommufdVIommu: Represents a physical IOMMU slice managing S2 HWPT and default S1 HWPTs (bypass/abort). Shared across devices behind the same virtual IOMMU.

  • IommufdVDevice: Represents a device attached to a IommufdVIommu. Handles dynamic S1 HWPT allocation and lifecycle management.

  • IommufdHwptData: Type-safe enum for architecture-specific HWPT configuration (SMMUv3 STE data, VT-d context entries).

Integration Notes for VMMs:

  1. VMM creates VfioIommufd with s1_hwpt_data_type if hardware accelerated virtual IOMMUs are enabled and used to manage VFIO devices
  2. VMM calls VfioDevice::new_with_iommufd() per passthrough device
    • The same instance of virtual IOMMU should reuse the same instance of IommufdVIommu
    • Each VFIO device will has its own VfioDevice and IommufdVDevice instance
  3. VMM need to make sure the virtual IOMMU is compatible with the physical IOMMU:
    • IommufdVDevice::get_hw_info is used to retrieve hardware information of the physical IOMMU
  4. VMM traps guest IOMMU commands and calls:
    • install_s1_hwpt() when guest enables IOMMU
    • uninstall_s1_hwpt() when guest disables IOMMU
    • IommufdVIommu::invalidate_hwpt() when guest invalidate IOTLB entries

This enables VMM to enable hardware-accelerated IOMMU to manage VFIO devices and use physical IOMMU hardware to directly process guest page tables.

[1] cloud-hypervisor/iommufd#5

Add infrastructure to enable VFIO devices to leverage hardware IOMMU
acceleration through iommufd's uAPIs. This allows userspace VMMs to
attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly
enabling userspace to configure stage-1 (guest-managed) page tables that
are composed with stage-2 (host-managed) page tables in hardware.

This depends on the IommufdVIOMMU and IommufdVDevice abstractions
introduced in the iommufd-ioctls crate [1].

New Public Interfaces:

1. VfioIommufd::new() signature change:
   - Added `s1_hwpt_data_type: Option<iommu_hwpt_data_type>` parameter
   - When `Some`, enables nested translation mode for subsequently attached
     VFIO devices
   - Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1

2. VfioDevice::new_with_iommufd():
   - New constructor for vfio devices backed by iommufd with
     hardware-accelerated nested HWPT support
   - Automatically creates IommufdVIommu/IommufdVDevice when nested mode
     is enabled via `VfioIommufd`
   - Supports sharing a single `IommufdVIommu` instance across multiple
     VFIO devices
   - Returns `IommufdVDevice` handle for subsequent S1 HWPT operations
   - Attaches device to bypass HWPT by default (until guest enables IOMMU)

3. VfioDevice::install_s1_hwpt():
   - Install guest-configured stage-1 page tables into hardware
   - Called when guest writes to virtual IOMMU stream table entries
   - Atomically replaces existing S1 HWPT if present
   - Uses `IommufdHwptData` enum for type-safe hardware-specific configuration

4. VfioDevice::uninstall_s1_hwpt():
   - Revert device to bypass or abort mode
   - abort=true: Use abort HWPT (fault all DMA)
   - abort=false: Use bypass HWPT (passthrough translation)
   - Called during guest IOMMU reset or shutdown

Dependencies on iommufd-ioctls:

This implementation builds upon three types from iommufd-ioctls [1]:

- `IommufdVIommu`: Represents a physical IOMMU slice managing S2 HWPT
  and default S1 HWPTs (bypass/abort). Shared across devices behind the
  same virtual IOMMU.

- `IommufdVDevice`: Represents a device attached to a `IommufdVIommu`.
  Handles dynamic S1 HWPT allocation and lifecycle management.

- `IommufdHwptData`: Type-safe enum for architecture-specific HWPT
  configuration (SMMUv3 STE data, VT-d context entries).

Integration Notes for VMMs:

1. VMM creates `VfioIommufd` with `s1_hwpt_data_type` if hardware
   accelerated virtual IOMMUs are enabled and used to manage
   VFIO devices
2. VMM calls `VfioDevice::new_with_iommufd()` per passthrough device
   - The same instance of virtual IOMMU should reuse the same instance
     of `IommufdVIommu`
   - Each VFIO device will has its own `VfioDevice` and `IommufdVDevice`
     instance
3. VMM need to make sure the virtual IOMMU is compatible with the
   physical IOMMU:
   - `IommufdVDevice::get_hw_info` is used to retrieve hardware
    information of the physical IOMMU
3. VMM traps guest IOMMU commands and calls:
   - `install_s1_hwpt()` when guest enables IOMMU
   - `uninstall_s1_hwpt()` when guest disables IOMMU
   - `IommufdVIommu::invalidate_hwpt()` when guest invalidate IOTLB
      entries

This enables VMM to enable hardware-accelerated IOMMU to manage VFIO
devices and use physical IOMMU hardware to directly process guest page
tables.

[1] cloud-hypervisor/iommufd#5

Signed-off-by: Bo Chen <[email protected]>
@likebreath likebreath force-pushed the 0130/rfc_iommufd_nested_hwpt branch from c3a9795 to f4bae29 Compare January 30, 2026 22:53
@likebreath likebreath changed the title [RFC] vfio-ioctls: Support hardware-accelerated nested HWPT via iommufd [RFC] Support hardware-accelerated nested translation via iommufd Jan 30, 2026
@likebreath likebreath marked this pull request as ready for review January 31, 2026 05:05
iommufd: Arc<IommuFd>,
ioas_id: Option<u32>,
device_fd: Option<VfioContainerDeviceHandle>,
s1_hwpt_data_type: Option<iommu_hwpt_data_type>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s1_hwpt_data_type: Option<iommu_hwpt_data_type>,
nested_hwpt: Option<iommu_hwpt_data_type>,

I think using nested_hwpt conveys more clearly that we're trying to use HWPT_NESTED with IOMMUFD.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion.

Comment on lines +1141 to +1142
/// - If `None` and nested HWPT is enabled, a new vIOMMU instance is created and returned.
/// - If `Some`, the provided instance is reused.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should allow so much flexibility. If we want to maintain a clear API, I'd rather expect the caller to always create the IommufdVIommu (when needed). That means this function should only return Result<Self>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good discussion here.

The goal is to provide a unified API that supports both use cases: standard mode (current behavior with iommufd where devices are not managed by a virtual IOMMU in userspace) and accelerated mode (nested HWPT with a hardware-accelerated virtual IOMMU).

With the current design, the VMM maintains a consistent workflow. The only variation is whether the VfioIommufd instance is initialized with nested_hwpt enabled.

While the interfaces could be decomposed into more primitive operations, this would significantly increase the management burden on the VMM without providing clear added value.

Comparison of the workflows from the caller (e.g. userspace VMM):

// 1. Current Proposal (Unified API)
// The VMM only handles high-level initialization.
let (vfio_device, iommufd_vdevice) = VfioDevice::new_with_iommufd(
    vfio_path,
    vfio_iommufd,
    &mut iommufd_viommu,
    virt_sid
);
// 2. Hypothetical "Primitive" API
// This forces the VMM to manually glue the components together.
let vfio_device = VfioDevice::new(vfio_path, vfio_iommufd);

// The VMM must manually extract IDs and link objects:
// new API for the accelerated mode only
let vfio_dev_id = vfio_device.get_dev_id();  
// new API from iommufd that VMM needs to interact with directly
let iommufd_viommu = IommufdVIommu::new(iommufd, vfio_dev_id); 
// new API from iommufd that VMM needs to interact with directly
let iommufd_vdevice = IommufdVDevice::new(iommufd_viommu, virt_sid); 

// And manually attach the Stage-1 page table:
vfio_device.attach_default_s1_hwpt(iommufd_viommu);  // new API for the accelerated mode only

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge that balancing API simplicity with effective encapsulation is always a trade-off.

It will be easier to gauge the trade-off with this design once we have a concrete implementation. We are currently working on that reference case: integrating accelerated vSMMUv3 support into Cloud Hypervisor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the decision comes down to the expectation we have from this crate. I always think of this crate as a simple Rust layer, which is why I'm expecting the implementation to be as simple as possible. But if others think it's a good idea to embed a bit more logic into it, I'm fine with it!

/// # Parameters
/// * `vdevice`: the `IommufdVDevice` instance associated with the vfio device.
/// * `hwpt_data`: the hwpt data to create s1 hwpt.
pub fn install_s1_hwpt(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd define the install function above the uninstall one.

hwpt_data: &IommufdHwptData,
) -> Result<()> {
// Uninstall existing s1 hwpt if exists
self.uninstall_s1_hwpt(vdevice, true)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be part of the install function. The function should fail if some page tables are already there (meaning the caller should be in charge of installing/uninstalling). The API should be as simple as possible, which means it shouldn't perform too many tasks (I think the caller should be in charge of driving the creation/cleanup).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this more or less falls into the same trade-off as discussed above, though I am much less opinionated in this case - given the uninstall_s1_hwpt() is always exposed to the caller and is very simple to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants