CXL Type-2 device passthrough via vfio-pci#17
Conversation
Sort sparse mmap regions by offset during region setup to ensure predictable mapping order, avoid overlaps and a proper handling of the gaps between sub-regions. Add validation to detect overlapping sparse regions early during setup before any mapping operations begin. The sorting is performed on the subregions ranges during vfio_setup_region_sparse_mmaps(). This also ensures that subsequent mapping code can rely on subregions being in ascending offset order. This is preparatory work for alignment adjustments needed to support hugepfnmap on systems where device memory (e.g., Grace-based systems) may have non-power-of-2 sizes. cc: Alex Williamson <alex@shazbot.org> Reviewed-by: Alex Williamson <alex@shazbot.org> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-2-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit da02b21) Signed-off-by: Jiandi An <jan@nvidia.com>
Add an Error **errp parameter to vfio_region_setup() and vfio_setup_region_sparse_mmaps to allow proper error handling instead of just returning error codes. The function sets errors via error_setg() when failure occur. Suggested-by: Cedric Le Goater <clg@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-3-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit c420101) [jan: [jan: Update vfio_region_setup() caller in platform.c; file was removed upstream (762c855) before this commit landed, but nvidia_unstable-10.1 still has it]] Signed-off-by: Jiandi An <jan@nvidia.com>
CXL VFIO passthrough needs a stable guest physical address range for device memory (DPA) that falls inside a CFMWS entry the guest discovers from ACPI CEDT. Without a dedicated range in the address map, the HDM decoder has nowhere to point. Add VIRT_HIGH_CXL_MMIO immediately after the second PCIe MMIO window. It gets its own highmem_cxl_mmio flag in VirtMachineState rather than sharing highmem_cxl, so the two slots are independently controllable even though both are currently tied to CXL bridge presence. The base and size flow through GPEXConfig.cxl_mmio to acpi_dsdt_add_gpex(), which carves out a QWord memory descriptor in the first CXL root bridge's _CRS. The CFMWS window is system-wide, so only the first CXL bridge gets the descriptor - subsequent ones would produce duplicate resource claims for the same range. build_crs() already emits the bridge's own 64-bit ranges into crs. The CFMWS window is a separate system-wide range, so only that window is appended as a new QWord descriptor; the bridge ranges are not re-emitted. A warn_report() fires if the CFMWS window overlaps any existing bridge 64-bit range, since that would indicate an address layout conflict. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) [jan: Resolve #include conflict with hw/acpi/acpi_egm_memory.h added by nvidia_unstable-10.1 base; keep both includes] Signed-off-by: Jiandi An <jan@nvidia.com>
Before this patch, pxb-cxl bridges had no _DSM method at all. When the OS called _DSM on a CXL host bridge, ACPI returned an error and the OS defaulted to reassigning resources across suspend/resume. On machines where firmware pre-commits the HDM decoder, that reassignment breaks the DPA mapping. Wire preserve_config through GPEXConfig into build_cxl_osc_method() so pxb-cxl host bridges get a _DSM method that signals the OS to keep resource assignments stable when needed. The _DSM function 5 (preserve firmware PCI configuration) is the mechanism used to convey this. build_pci_host_bridge_dsm_method() is promoted from static to exported so cxl.c can call it without duplicating the AML. The x86 build_cxl_osc_method() call site passes false since x86 does not use firmware-committed HDM decoders. build_cxl_osc_method is renamed to acpi_dsdt_add_cxl_host_bridge_methods The function now appends both the CXL _OSC method and the _DSM method, so its old name is misleading. Renamed it to match the pxb-pcie analogue acpi_dsdt_add_host_bridge_methods(), making the two root bridge code paths symmetric. No AML change. Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…sthrough Sync the VFIO UAPI additions from the kernel CXL Type-2 passthrough series. VFIO_DEVICE_FLAGS_CXL (bit 9) marks a device as CXL Type-2 and guarantees the capability chain includes a vfio_device_info_cap_cxl entry (cap id 6). That capability carries the BAR index holding the CXL component registers, flags for firmware-committed and cache-capable devices, the byte offset to the HDM Decoder Capability block within that BAR, and region indices for both the DPA memory region and the Component Register shadow. Two new region subtypes: VFIO_REGION_SUBTYPE_CXL (1): mmappable DPA memory VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2): HDM decoder shadow, r/w only Note: UAPI headers are normally kept in sync via scripts/update-linux-headers.sh once upstream kernel changes merge. This patch manually adds the CXL Type-2 additions as a temporary measure to unblock QEMU development. It should be dropped and replaced with a proper header sync once the kernel series is accepted. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…ustom region ops vfio_region_setup() always initializes the region MemoryRegion with vfio_region_ops. CXL needs custom pread/pwrite ops for the Component Register shadow region. Add vfio_region_setup_with_ops() which accepts a const MemoryRegionOps * parameter. When non-NULL it is passed to memory_region_init_io(); when NULL the existing vfio_region_ops is used. vfio_region_setup() is retained unchanged as a thin wrapper for all existing callers. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…n setup When VFIO_DEVICE_FLAGS_CXL is set, the kernel has identified a CXL Type-2 device and populated the capability chain with a vfio_device_info_cap_cxl entry. Read that entry to locate the DPA and CXL Component Register shadow regions, then call vfio_region_setup() for each. DPA covers the device's host-managed memory and is faulted in lazily by the VMM. The CXL Component Register shadow gives the VMM access to the HDM Decoder Capability block so it can intercept decoder commits without touching the hardware register page directly. vfio_cxl_derive_hdm_info() walks the CXL Capability Array inside the Component Register shadow to find the HDM Decoder capability (ID 0x5) and extracts hdm_decoder_offset and hdm_count. All reads use le32_to_cpu() since the capability array is little-endian per the CXL spec. Dword 0 is the array header; capability entries start at dword 1, which is why the loop begins at i = 1. CXL register constants are defined here using names that mirror <linux/cxl.h> to make cross-referencing straightforward. Add the VFIOCXL struct embedded in VFIOPCIDevice. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) [jan: Resolve include path conflict (nvidia_unstable-10.1 uses hw/hw.h and hw/iommu.h not yet renamed to hw/core/); add VMChangeStateEntry vmstate field from base] Signed-off-by: Jiandi An <jan@nvidia.com>
…_REGS overlay
The CXL Component Register BAR contains two types of ranges that need
different handling:
- Accelerator register windows: passed through as direct hardware
mmaps for performance. The kernel reports the real BAR size and
lists mmappable windows via VFIO_REGION_INFO_CAP_SPARSE_MMAP,
excluding the HDM Decoder Capability block. vfio_region_mmap()
creates hardware-backed sub-regions for each sparse area.
- HDM Decoder Capability block: guest accesses must go through
emulated ops so QEMU can observe and program decoder state. The
kernel blocks direct mmap of this range.
vfio_bar_register(): after the normal mmap path, overlay the COMP_REGS
emulation region at hdm_regs_offset with priority 1. In QEMU's
MemoryRegion model, overlapping subregions are resolved by priority;
the default is 0. Priority 1 ensures guest accesses to the HDM range
always dispatch through the emulated COMP_REGS ops regardless of any
hardware-backed sub-region at a neighbouring offset.
vfio_pci_bars_exit(): remove the COMP_REGS overlay before the normal
BAR teardown path.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
[jan: Resolve pci_register_bar conflict; nvidia_unstable-10.1 uses &vdev->pdev not local pdev variable]
Signed-off-by: Jiandi An <jan@nvidia.com>
… firmware-committed devices setup_locked_hdm() runs as a machine_done notifier after all devices have been realized. It programs HDM decoder 0 with the CFMWS base address so the guest can fault into device memory from the first instruction. The notifier is only registered when the kernel reports the device as firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED). The host is responsible for HDM decoder programming; the guest has no mechanism to remap host physical address mappings. The function uses cxl->fmws_base (set by the optional cxl-fmws-base device property) if non-zero; otherwise it falls back to the cxl_fmws_base global captured by cxl_fmws_set_memmap() during machine memory-map init. If neither is set, it warns and returns without programming anything. If COMMIT_LOCK is set in decoder 0 CTRL at machine_done time (left-over from a prior FLR?), it is cleared before writing BASE so the subsequent write is not blocked. COMMIT_LOCK is re-set after programming so the hardware enforces the committed base. read_region() return is checked; failure aborts programming rather than leaving ctrl uninitialized. All write_region() failures are propagated. The function exits cleanly rather than leaving the decoder half-programmed. Add cxl_fmws_base as a hwaddr global in cxl-host.c (and a stub in cxl-host-stubs.c). It is set once by cxl_fmws_set_memmap() and read later at machine_done time. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
The SMMUv3 primary bus check only accepted pxb-pcie as a valid root. pxb-cxl uses the same PCIe-compatible bus implementation; reject it and CXL devices behind it cannot reach the IOMMU. Extend the check to also accept CXL buses so SMMUv3 translation applies to passthrough CXL devices. Update the comment above the check to mention pxb-cxl alongside pxb-pcie. Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
…ice regions vfio_container_region_add() attempts an IOMMU DMA mapping for every RAM section that enters the guest address space. For VFIO mmap-backed regions (PCI BAR windows, CXL.mem regions), this mapping always fails: the backing VMAs carry VM_IO | VM_PFNMAP flags and pin_user_pages() refuses to pin VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT. CPU access to these regions goes through KVM Stage-2 page faults independently of the SMMU/IOMMU, so no IOMMU entry is required for correct operation. Add an early return for RAM-device sections owned by a VFIO device. vfio_get_vfio_device(memory_region_owner(section->mr)) returns non-NULL for any mmap subregion created by vfio_region_mmap(), since memory_region_init_ram_device_ptr() propagates the VFIOPCIDevice owner from the containing region. Matching on ownership covers both normal PCI BAR windows and CXL.mem regions uniformly; non-VFIO RAM-device regions such as NVDIMMs are unaffected and continue through the normal mapping path. Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>
|
bf63a03 vfio: Add Error ** parameter to vfio_region_setup() Can you change the trailer from “cherry picked from” to “backported from” ? Also, instead of [jan: [jan: …]] can you just use [jan: …] ? Codex found 2 "high" findings (I'm not considering anything lower than high since this series is still WIP and this is an internal branch):
The code sets up CXL VFIO regions for any device with VFIO_DEVICE_FLAGS_CXL. In vfio_cxl_setup(), it creates the DPA memory region and the COMP_REGS region, then derives HDM decoder info. That part happens regardless of The actual DPA mapping into guest physical memory only happens in setup_locked_hdm(): memory_region_add_subregion(sys_mem, cxl->fmws_base, cxl->region.mem); But setup_locked_hdm() is only registered here: if (cap->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED) { So for a non-firmware-committed CXL device, QEMU accepts the device and exposes the component-register overlay, but never inserts the DPA region into the guest address space. Guest HDM decoder writes go through the If the supported scope is firmware-committed only, this should be enforced explicitly. Otherwise users can boot a VM with a device that appears partially configured but whose memory is not reachable. Recommended resolution: in vfio_cxl_setup(), fail early unless VFIO_CXL_CAP_FIRMWARE_COMMITTED is set.
Firmware-committed devices register a machine-init-done notifier: cxl->machine_done.notify = setup_locked_hdm; That notifier is stored in QEMU’s global machine_init_done_notifiers list. QEMU provides a matching cleanup API: qemu_remove_machine_init_done_notifier(&cxl->machine_done); but this series never calls it. That matters because vfio_cxl_setup() runs relatively early in vfio_pci_realize(). After the notifier is registered, later realize steps can still fail: config setup, IOMMU attachment, PCI capability setup, quirks, When machine init later completes, QEMU walks that global list and calls the stale notifier. setup_locked_hdm() then does: VFIOCXL *cxl = container_of(notifier, VFIOCXL, machine_done); If the VFIO device was finalized, that is a use-after-free risk. If it was partially torn down, it can still access invalid VFIO region/device state. The same lifecycle problem exists for hot-unplug or normal device exit: vfio_exitfn() tears down BARs, interrupts, migration, and IOMMU state, but does not remove the CXL notifier. Recommended resolution: add a boolean such as machine_done_registered, set it after registration, and remove the notifier in all teardown paths before the CXL/VFIO state is finalized. |
CXL Type-2 Device Passthrough via VFIO-PCI
Summary
Port RFC series for CXL Type-2 device passthrough to the
nvidia_unstable-10.1branch. This enables VFIO-based passthrough of CXLType-2 (accelerator) devices with host-managed device memory (DPA) to
arm64 virtual machines.
Upstream series: https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/
Base branch:
nvidia_unstable-10.1(1:10.1.0+nvidia-unstable8-1)Prerequisites cherry-picked from upstream QEMU
Two upstream commits were cherry-picked as prerequisites because Manish's
series depends on the
Error **errpparameter invfio_region_setup():da02b21cc7c42010197eBackport note for commit 2: Upstream removed
vfio-platform(762c855439)before this commit landed, so
platform.cwas not modified in the original.nvidia_unstable-10.1still hasplatform.c, so thevfio_region_setup()caller there was updated to pass
errp.Patch series (9 patches from the RFC)
hw/arm/virt.c,hw/pci-host/gpex-acpi.c,include/hw/arm/virt.hhw/acpi/cxl.c,hw/pci-host/gpex-acpi.c,include/hw/acpi/cxl.hlinux-headers/linux/vfio.hhw/vfio/region.c,include/hw/vfio/vfio-region.hhw/vfio/pci.c,hw/vfio/pci.h,hw/vfio/trace-eventshw/vfio/pci.chw/vfio/pci.c,hw/cxl/cxl-host.c,include/hw/cxl/cxl_host.hhw/arm/smmu-common.chw/vfio/listener.cConflict resolutions during backport
The following conflicts were manually resolved when applying to
nvidia_unstable-10.1:Patch 1 — hw/arm/virt: Add CXL FMWS PA window
hw/pci-host/gpex-acpi.c#include "hw/acpi/acpi_egm_memory.h"added bynvidia_unstable-10.1conflicts with new
#include "hw/acpi/cxl.h".Patch 5 — hw/vfio/pci: Add CXL Type-2 device detection
File:
hw/vfio/pci.cIssue:
nvidia_unstable-10.1useshw/hw.handhw/iommu.h(not yetrenamed to
hw/core/hw-error.handhw/core/iommu.h).Resolution: Use old include names matching the base branch.
File:
hw/vfio/pci.hIssue:
nvidia_unstable-10.1hasVMChangeStateEntry *vmstateinVFIOPCIDevice; upstream does not.Resolution: Keep
vmstatefrom base, addVFIOCXL cxlfrom patch.Patch 6 — Wire CXL component-register BAR
hw/vfio/pci.cnvidia_unstable-10.1uses&vdev->pdevinpci_register_bar()calls, not a local
pdevvariable.&vdev->pdevto match the base.Testing
dpkg-buildpackage -b -uc)