Skip to content

fix(pillar): unbind IOMMU group siblings during PCI passthrough reserve#5670

Draft
rucoder wants to merge 2 commits intolf-edge:masterfrom
rucoder:rucoder/fix-vfio-iommu-group-siblings
Draft

fix(pillar): unbind IOMMU group siblings during PCI passthrough reserve#5670
rucoder wants to merge 2 commits intolf-edge:masterfrom
rucoder:rucoder/fix-vfio-iommu-group-siblings

Conversation

@rucoder
Copy link
Contributor

@rucoder rucoder commented Mar 12, 2026

Description

When reserving a PCI device for VFIO passthrough, PCIReserveGeneric only bound the target device to vfio-pci, leaving other devices in the same IOMMU group with their kernel drivers. Any kernel driver bound to an IOMMU group sibling calls iommu_device_use_default_domain() during probe, which increments the group's DMA owner_cnt. This makes the VFIO group non-viable because iommu_group_dma_owner_claimed() returns true, and QEMU refuses to use the group for passthrough with: vfio: group <N> is not viable

This was exposed after upgrading from EVE-OS 14.5.3 (kernel 6.1.112) to 16.0.0 (kernel 6.12.49) because CONFIG_I2C_I801 was added as a module in the new kernel. The i801_smbus driver auto-loaded and bound to the SMBus controller (80:1f.4) sharing IOMMU group 19 with the target NIC (80:1f.6), making passthrough impossible.

Fix by enumerating actual IOMMU group members from sysfs and unbinding kernel drivers from all sibling devices before binding the target to vfio-pci. On release, re-probe siblings so their original drivers rebind.

The IOMMU group helpers are implemented on an iommuGroupContext struct with configurable sysfs paths, enabling unit testing with a fake sysfs tree.

Changes:

  • Add iommuGroupContext struct with configurable sysfs paths and methods:
    getIOMMUGroup, getMembers, isBoundToVfioPci, unbindSiblings, reprobeSiblings
  • Modify PCIReserveGeneric to unbind IOMMU group siblings before binding target to vfio-pci
  • Modify PCIReleaseGeneric to re-probe siblings after releasing the target device
  • Add 9 unit tests covering group discovery, member enumeration, driver detection,
    sibling unbind, and sibling reprobe using a fake sysfs tree

How to test and validate this PR

  1. Use hardware with a multi-function PCH device where multiple functions share the same IOMMU group (e.g. Intel C620 chipset with 80:1f.4 SMBus and 80:1f.6 NIC in group 19)
  2. Assign the NIC for PCI passthrough to a VM
  3. Verify the VM starts successfully and the NIC is accessible inside the VM
  4. Previously this failed with vfio: group 19 is not viable when i801_smbus was bound to a sibling device
  5. Shut down the VM and verify sibling device drivers (e.g. i801_smbus) rebind automatically
  6. Unit tests: go test ./hypervisor/... -run "TestGetIOMMU|TestGetMembers|TestIsBound|TestUnbind|TestReprobe" -v

Changelog notes

Fixed VFIO PCI passthrough failure ("group is not viable") that occurred when kernel drivers were bound to sibling devices in the same IOMMU group. This commonly affected systems after upgrading to EVE-OS 16.0.0 where the i801_smbus driver was newly enabled.

PR Backports

  • 16.0-stable: To be backported.
  • 14.5-stable: No, the issue does not manifest there (CONFIG_I2C_I801 not enabled in 6.1 kernel).
  • 13.4-stable: No, same reason.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

@rucoder rucoder requested review from rene, rouming and shjala as code owners March 12, 2026 14:30
@rucoder rucoder added the stable Should be backported to stable release(s) label Mar 12, 2026
rucoder added 2 commits March 12, 2026 15:51
When reserving a PCI device for VFIO passthrough, PCIReserveGeneric only
bound the target device to vfio-pci, leaving other devices in the same
IOMMU group with their kernel drivers. Any kernel driver bound to an
IOMMU group sibling calls iommu_device_use_default_domain() during probe,
which increments the group's DMA owner_cnt. This makes the VFIO group
non-viable because iommu_group_dma_owner_claimed() returns true, and
QEMU refuses to use the group for passthrough.

This was exposed after upgrading from EVE-OS 14.5.3 (kernel 6.1.112) to
16.0.0 (kernel 6.12.49) because CONFIG_I2C_I801 was added as a module
in the new kernel. The i801_smbus driver auto-loaded and bound to the
SMBus controller (80:1f.4) sharing IOMMU group 19 with the target NIC
(80:1f.6), making passthrough impossible.

Fix by enumerating actual IOMMU group members from sysfs and unbinding
kernel drivers from all sibling devices before binding the target to
vfio-pci. On release, re-probe siblings so their original drivers rebind.

The IOMMU group helpers are implemented on an iommuGroupContext struct
with configurable sysfs paths to enable unit testing with fake sysfs.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
Add 9 test cases covering IOMMU group operations using a fake sysfs
tree in a temp directory:
- IOMMU group discovery from sysfs symlinks
- group member enumeration (multi-device and single-device groups)
- vfio-pci driver detection via os.SameFile
- sibling unbind (kernel drivers unbound, vfio-pci and unbound skipped)
- sibling reprobe (unbound devices probed, already-bound skipped)

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
@rucoder rucoder force-pushed the rucoder/fix-vfio-iommu-group-siblings branch from 8e92dc8 to 6fd1719 Compare March 12, 2026 14:51
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (6fd1719).
⚠️ Report is 339 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5670      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rene
Copy link
Contributor

rene commented Mar 13, 2026

@rucoder I'm not sure if we should unbind siblings devices deliberated like this. You might ended up unbinding important devices without notice and let the system to freeze or crash. Users must be aware of devices under the same IOMMU group that cannot be split through the ACS patch, if they really want to perform the passthrough, then they should passthrough all the devices of the group. We have different cases with the same situation where the the sibling device was a system device or a Thunderbolt controller... so I think it might be error prone to take this approach... for sure I see the advantages as well, that's why I'm not against it, but I'd like to discuss a bit more....

☝️ @eriknordmark ....

@rucoder
Copy link
Contributor Author

rucoder commented Mar 13, 2026

@rucoder I'm not sure if we should unbind siblings devices deliberated like this. You might ended up unbinding important devices without notice and let the system to freeze or crash. Users must be aware of devices under the same IOMMU group that cannot be split through the ACS patch, if they really want to perform the passthrough, then they should passthrough all the devices of the group. We have different cases with the same situation where the the sibling device was a system device or a Thunderbolt controller... so I think it might be error prone to take this approach... for sure I see the advantages as well, that's why I'm not against it, but I'd like to discuss a bit more....

☝️ @eriknordmark ....

@rene yes, but EVE only has some devices in the model and we unbind only devices we know about. during kernel upgrade we introduced a new driver which did not exist so there was no problem to pass-through the whole group, now we still tryin to pass-through the whole group BUT one device got a driver and EVE doesn't know about it. so nothing changed in the way we treat the group : "all or nothing", and the group content is exactly the same on both eve versions but since we do not care about device unknown to eve we cannot pass-through the whole group anymore -- driver prevents it

Copy link
Contributor

@rene rene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rucoder rucoder marked this pull request as draft March 14, 2026 12:56
@rucoder
Copy link
Contributor Author

rucoder commented Mar 14, 2026

@rene fix works for pass-through issue but WD reset was reported. converting to draft for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stable Should be backported to stable release(s)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants