CXL Type-2 device passthrough via vfio-pci by JiandiAnNVIDIA · Pull Request #17 · NVIDIA/QEMU

JiandiAnNVIDIA · 2026-05-22T04:37:42Z

CXL Type-2 Device Passthrough via VFIO-PCI

Summary

Port RFC series for CXL Type-2 device passthrough to the
nvidia_unstable-10.1 branch. This enables VFIO-based passthrough of CXL
Type-2 (accelerator) devices with host-managed device memory (DPA) to
arm64 virtual machines.

Upstream series: https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/

Base branch: nvidia_unstable-10.1 (1:10.1.0+nvidia-unstable8-1)

Prerequisites cherry-picked from upstream QEMU

Two upstream commits were cherry-picked as prerequisites because Manish's
series depends on the Error **errp parameter in vfio_region_setup():

#	Upstream SHA	Subject
1	`da02b21cc7`	hw/vfio: sort and validate sparse mmap regions by offset
2	`c42010197e`	vfio: Add Error ** parameter to vfio_region_setup()

Backport note for commit 2: Upstream removed vfio-platform (762c855439)
before this commit landed, so platform.c was not modified in the original.
nvidia_unstable-10.1 still has platform.c, so the vfio_region_setup()
caller there was updated to pass errp.

Patch series (9 patches from the RFC)

#	Subject	Key files
1	hw/arm/virt: Add CXL FMWS PA window for device memory	`hw/arm/virt.c`, `hw/pci-host/gpex-acpi.c`, `include/hw/arm/virt.h`
2	cxl: Add preserve_config to pxb-cxl OSC method	`hw/acpi/cxl.c`, `hw/pci-host/gpex-acpi.c`, `include/hw/acpi/cxl.h`
3	linux-headers: Update vfio.h for CXL Type-2 device passthrough	`linux-headers/linux/vfio.h`
4	hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops	`hw/vfio/region.c`, `include/hw/vfio/vfio-region.h`
5	hw/vfio/pci: Add CXL Type-2 device detection and region setup	`hw/vfio/pci.c`, `hw/vfio/pci.h`, `hw/vfio/trace-events`
6	hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay	`hw/vfio/pci.c`
7	hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices	`hw/vfio/pci.c`, `hw/cxl/cxl-host.c`, `include/hw/cxl/cxl_host.h`
8	hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus	`hw/arm/smmu-common.c`
9	vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions	`hw/vfio/listener.c`

Conflict resolutions during backport

The following conflicts were manually resolved when applying to nvidia_unstable-10.1:

Patch 1 — hw/arm/virt: Add CXL FMWS PA window

File: hw/pci-host/gpex-acpi.c
Issue: #include "hw/acpi/acpi_egm_memory.h" added by nvidia_unstable-10.1
conflicts with new #include "hw/acpi/cxl.h".
Resolution: Keep both includes.

Patch 5 — hw/vfio/pci: Add CXL Type-2 device detection

File: hw/vfio/pci.c
Issue: nvidia_unstable-10.1 uses hw/hw.h and hw/iommu.h (not yet
renamed to hw/core/hw-error.h and hw/core/iommu.h).
Resolution: Use old include names matching the base branch.
File: hw/vfio/pci.h
Issue: nvidia_unstable-10.1 has VMChangeStateEntry *vmstate in
VFIOPCIDevice; upstream does not.
Resolution: Keep vmstate from base, add VFIOCXL cxl from patch.

Patch 6 — Wire CXL component-register BAR

File: hw/vfio/pci.c
Issue: nvidia_unstable-10.1 uses &vdev->pdev in pci_register_bar()
calls, not a local pdev variable.
Resolution: Use &vdev->pdev to match the base.

Testing

Build on arm64 (dpkg-buildpackage -b -uc)
Boot arm64 VM with CXL Type-2 device passed through via VFIO
Verify CFMWS window appears in guest ACPI CEDT
Verify HDM decoder 0 is programmed at boot for firmware-committed devices
Verify DPA memory is accessible from guest
Verify SMMUv3 translation works for CXL devices behind pxb-cxl

Sort sparse mmap regions by offset during region setup to ensure predictable mapping order, avoid overlaps and a proper handling of the gaps between sub-regions. Add validation to detect overlapping sparse regions early during setup before any mapping operations begin. The sorting is performed on the subregions ranges during vfio_setup_region_sparse_mmaps(). This also ensures that subsequent mapping code can rely on subregions being in ascending offset order. This is preparatory work for alignment adjustments needed to support hugepfnmap on systems where device memory (e.g., Grace-based systems) may have non-power-of-2 sizes. cc: Alex Williamson <alex@shazbot.org> Reviewed-by: Alex Williamson <alex@shazbot.org> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-2-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit da02b21) Signed-off-by: Jiandi An <jan@nvidia.com>

Add an Error **errp parameter to vfio_region_setup() and vfio_setup_region_sparse_mmaps to allow proper error handling instead of just returning error codes. The function sets errors via error_setg() when failure occur. Suggested-by: Cedric Le Goater <clg@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-3-ankita@nvidia.com Signed-off-by: Cédric Le Goater <clg@redhat.com> (cherry picked from commit c420101) [jan: [jan: Update vfio_region_setup() caller in platform.c; file was removed upstream (762c855) before this commit landed, but nvidia_unstable-10.1 still has it]] Signed-off-by: Jiandi An <jan@nvidia.com>

CXL VFIO passthrough needs a stable guest physical address range for device memory (DPA) that falls inside a CFMWS entry the guest discovers from ACPI CEDT. Without a dedicated range in the address map, the HDM decoder has nowhere to point. Add VIRT_HIGH_CXL_MMIO immediately after the second PCIe MMIO window. It gets its own highmem_cxl_mmio flag in VirtMachineState rather than sharing highmem_cxl, so the two slots are independently controllable even though both are currently tied to CXL bridge presence. The base and size flow through GPEXConfig.cxl_mmio to acpi_dsdt_add_gpex(), which carves out a QWord memory descriptor in the first CXL root bridge's _CRS. The CFMWS window is system-wide, so only the first CXL bridge gets the descriptor - subsequent ones would produce duplicate resource claims for the same range. build_crs() already emits the bridge's own 64-bit ranges into crs. The CFMWS window is a separate system-wide range, so only that window is appended as a new QWord descriptor; the bridge ranges are not re-emitted. A warn_report() fires if the CFMWS window overlaps any existing bridge 64-bit range, since that would indicate an address layout conflict. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) [jan: Resolve #include conflict with hw/acpi/acpi_egm_memory.h added by nvidia_unstable-10.1 base; keep both includes] Signed-off-by: Jiandi An <jan@nvidia.com>

Before this patch, pxb-cxl bridges had no _DSM method at all. When the OS called _DSM on a CXL host bridge, ACPI returned an error and the OS defaulted to reassigning resources across suspend/resume. On machines where firmware pre-commits the HDM decoder, that reassignment breaks the DPA mapping. Wire preserve_config through GPEXConfig into build_cxl_osc_method() so pxb-cxl host bridges get a _DSM method that signals the OS to keep resource assignments stable when needed. The _DSM function 5 (preserve firmware PCI configuration) is the mechanism used to convey this. build_pci_host_bridge_dsm_method() is promoted from static to exported so cxl.c can call it without duplicating the AML. The x86 build_cxl_osc_method() call site passes false since x86 does not use firmware-committed HDM decoders. build_cxl_osc_method is renamed to acpi_dsdt_add_cxl_host_bridge_methods The function now appends both the CXL _OSC method and the _DSM method, so its old name is misleading. Renamed it to match the pxb-pcie analogue acpi_dsdt_add_host_bridge_methods(), making the two root bridge code paths symmetric. No AML change. Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…sthrough Sync the VFIO UAPI additions from the kernel CXL Type-2 passthrough series. VFIO_DEVICE_FLAGS_CXL (bit 9) marks a device as CXL Type-2 and guarantees the capability chain includes a vfio_device_info_cap_cxl entry (cap id 6). That capability carries the BAR index holding the CXL component registers, flags for firmware-committed and cache-capable devices, the byte offset to the HDM Decoder Capability block within that BAR, and region indices for both the DPA memory region and the Component Register shadow. Two new region subtypes: VFIO_REGION_SUBTYPE_CXL (1): mmappable DPA memory VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2): HDM decoder shadow, r/w only Note: UAPI headers are normally kept in sync via scripts/update-linux-headers.sh once upstream kernel changes merge. This patch manually adds the CXL Type-2 additions as a temporary measure to unblock QEMU development. It should be dropped and replaced with a proper header sync once the kernel series is accepted. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…ustom region ops vfio_region_setup() always initializes the region MemoryRegion with vfio_region_ops. CXL needs custom pread/pwrite ops for the Component Register shadow region. Add vfio_region_setup_with_ops() which accepts a const MemoryRegionOps * parameter. When non-NULL it is passed to memory_region_init_io(); when NULL the existing vfio_region_ops is used. vfio_region_setup() is retained unchanged as a thin wrapper for all existing callers. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…n setup When VFIO_DEVICE_FLAGS_CXL is set, the kernel has identified a CXL Type-2 device and populated the capability chain with a vfio_device_info_cap_cxl entry. Read that entry to locate the DPA and CXL Component Register shadow regions, then call vfio_region_setup() for each. DPA covers the device's host-managed memory and is faulted in lazily by the VMM. The CXL Component Register shadow gives the VMM access to the HDM Decoder Capability block so it can intercept decoder commits without touching the hardware register page directly. vfio_cxl_derive_hdm_info() walks the CXL Capability Array inside the Component Register shadow to find the HDM Decoder capability (ID 0x5) and extracts hdm_decoder_offset and hdm_count. All reads use le32_to_cpu() since the capability array is little-endian per the CXL spec. Dword 0 is the array header; capability entries start at dword 1, which is why the loop begins at i = 1. CXL register constants are defined here using names that mirror <linux/cxl.h> to make cross-referencing straightforward. Add the VFIOCXL struct embedded in VFIOPCIDevice. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) [jan: Resolve include path conflict (nvidia_unstable-10.1 uses hw/hw.h and hw/iommu.h not yet renamed to hw/core/); add VMChangeStateEntry vmstate field from base] Signed-off-by: Jiandi An <jan@nvidia.com>

…_REGS overlay The CXL Component Register BAR contains two types of ranges that need different handling: - Accelerator register windows: passed through as direct hardware mmaps for performance. The kernel reports the real BAR size and lists mmappable windows via VFIO_REGION_INFO_CAP_SPARSE_MMAP, excluding the HDM Decoder Capability block. vfio_region_mmap() creates hardware-backed sub-regions for each sparse area. - HDM Decoder Capability block: guest accesses must go through emulated ops so QEMU can observe and program decoder state. The kernel blocks direct mmap of this range. vfio_bar_register(): after the normal mmap path, overlay the COMP_REGS emulation region at hdm_regs_offset with priority 1. In QEMU's MemoryRegion model, overlapping subregions are resolved by priority; the default is 0. Priority 1 ensures guest accesses to the HDM range always dispatch through the emulated COMP_REGS ops regardless of any hardware-backed sub-region at a neighbouring offset. vfio_pci_bars_exit(): remove the COMP_REGS overlay before the normal BAR teardown path. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) [jan: Resolve pci_register_bar conflict; nvidia_unstable-10.1 uses &vdev->pdev not local pdev variable] Signed-off-by: Jiandi An <jan@nvidia.com>

… firmware-committed devices setup_locked_hdm() runs as a machine_done notifier after all devices have been realized. It programs HDM decoder 0 with the CFMWS base address so the guest can fault into device memory from the first instruction. The notifier is only registered when the kernel reports the device as firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED). The host is responsible for HDM decoder programming; the guest has no mechanism to remap host physical address mappings. The function uses cxl->fmws_base (set by the optional cxl-fmws-base device property) if non-zero; otherwise it falls back to the cxl_fmws_base global captured by cxl_fmws_set_memmap() during machine memory-map init. If neither is set, it warns and returns without programming anything. If COMMIT_LOCK is set in decoder 0 CTRL at machine_done time (left-over from a prior FLR?), it is cleared before writing BASE so the subsequent write is not blocked. COMMIT_LOCK is re-set after programming so the hardware enforces the committed base. read_region() return is checked; failure aborts programming rather than leaving ctrl uninitialized. All write_region() failures are propagated. The function exits cleanly rather than leaving the decoder half-programmed. Add cxl_fmws_base as a hwaddr global in cxl-host.c (and a stub in cxl-host-stubs.c). It is set once by cxl_fmws_set_memmap() and read later at machine_done time. Signed-off-by: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

The SMMUv3 primary bus check only accepted pxb-pcie as a valid root. pxb-cxl uses the same PCIe-compatible bus implementation; reject it and CXL devices behind it cannot reach the IOMMU. Extend the check to also accept CXL buses so SMMUv3 translation applies to passthrough CXL devices. Update the comment above the check to mention pxb-cxl alongside pxb-pcie. Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…ice regions vfio_container_region_add() attempts an IOMMU DMA mapping for every RAM section that enters the guest address space. For VFIO mmap-backed regions (PCI BAR windows, CXL.mem regions), this mapping always fails: the backing VMAs carry VM_IO | VM_PFNMAP flags and pin_user_pages() refuses to pin VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT. CPU access to these regions goes through KVM Stage-2 page faults independently of the SMMU/IOMMU, so no IOMMU entry is required for correct operation. Add an early return for RAM-device sections owned by a VFIO device. vfio_get_vfio_device(memory_region_owner(section->mr)) returns non-NULL for any mmap subregion created by vfio_region_mmap(), since memory_region_init_ram_device_ptr() propagates the VFIOPCIDevice owner from the containing region. Matching on ownership covers both normal PCI BAR windows and CXL.mem regions uniformly; non-VFIO RAM-device regions such as NVDIMMs are unaffected and continue through the normal mapping path. Signed-off-by: Manish Honap <mhonap@nvidia.com> (backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

nvmochs · 2026-05-22T21:00:48Z

@JiandiAnNVIDIA

bf63a03 vfio: Add Error ** parameter to vfio_region_setup()

Can you change the trailer from “cherry picked from” to “backported from” ?

Also, instead of [jan: [jan: …]] can you just use [jan: …] ?

Codex found 2 "high" findings (I'm not considering anything lower than high since this series is still WIP and this is an internal branch):

Non-Firmware-Committed Devices Are Accepted But Not Functional

The code sets up CXL VFIO regions for any device with VFIO_DEVICE_FLAGS_CXL. In vfio_cxl_setup(), it creates the DPA memory region and the COMP_REGS region, then derives HDM decoder info. That part happens regardless of
whether the kernel reports VFIO_CXL_CAP_FIRMWARE_COMMITTED.

The actual DPA mapping into guest physical memory only happens in setup_locked_hdm():

memory_region_add_subregion(sys_mem, cxl->fmws_base, cxl->region.mem);
cxl->dpa_in_system_mem = true;

But setup_locked_hdm() is only registered here:

if (cap->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED) {
cxl->machine_done.notify = setup_locked_hdm;
qemu_add_machine_init_done_notifier(&cxl->machine_done);
}

So for a non-firmware-committed CXL device, QEMU accepts the device and exposes the component-register overlay, but never inserts the DPA region into the guest address space. Guest HDM decoder writes go through the
COMP_REGS ops and are forwarded with pwrite(), but QEMU does not intercept COMMIT to map DPA later.

If the supported scope is firmware-committed only, this should be enforced explicitly. Otherwise users can boot a VM with a device that appears partially configured but whose memory is not reachable.

Recommended resolution: in vfio_cxl_setup(), fail early unless VFIO_CXL_CAP_FIRMWARE_COMMITTED is set.

CXL machine_done Notifier Is Registered Without Cleanup

Firmware-committed devices register a machine-init-done notifier:

cxl->machine_done.notify = setup_locked_hdm;
qemu_add_machine_init_done_notifier(&cxl->machine_done);

That notifier is stored in QEMU’s global machine_init_done_notifiers list. QEMU provides a matching cleanup API:

qemu_remove_machine_init_done_notifier(&cxl->machine_done);

but this series never calls it.

That matters because vfio_cxl_setup() runs relatively early in vfio_pci_realize(). After the notifier is registered, later realize steps can still fail: config setup, IOMMU attachment, PCI capability setup, quirks,
interrupt setup, display probing, migration setup, and so on. If any of those fail before machine-ready, the device can be torn down while the global notifier list still contains &vdev->cxl.machine_done.

When machine init later completes, QEMU walks that global list and calls the stale notifier. setup_locked_hdm() then does:

VFIOCXL *cxl = container_of(notifier, VFIOCXL, machine_done);
VFIORegion *region = &cxl->comp_regs_region;
...
region->vbasedev->name

If the VFIO device was finalized, that is a use-after-free risk. If it was partially torn down, it can still access invalid VFIO region/device state.

The same lifecycle problem exists for hot-unplug or normal device exit: vfio_exitfn() tears down BARs, interrupts, migration, and IOMMU state, but does not remove the CXL notifier.

Recommended resolution: add a boolean such as machine_done_registered, set it after registration, and remove the notifier in all teardown paths before the CXL/VFIO state is finalized.

ankita-nv and others added 11 commits May 21, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CXL Type-2 device passthrough via vfio-pci#17

CXL Type-2 device passthrough via vfio-pci#17
JiandiAnNVIDIA wants to merge 11 commits into
NVIDIA:nvidia_unstable-10.1from
JiandiAnNVIDIA:vfio-cxl-2026-05-21

JiandiAnNVIDIA commented May 22, 2026 •

edited

Loading

Uh oh!

nvmochs commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JiandiAnNVIDIA commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CXL Type-2 Device Passthrough via VFIO-PCI

Summary

Prerequisites cherry-picked from upstream QEMU

Patch series (9 patches from the RFC)

Conflict resolutions during backport

Patch 1 — hw/arm/virt: Add CXL FMWS PA window

Patch 5 — hw/vfio/pci: Add CXL Type-2 device detection

Patch 6 — Wire CXL component-register BAR

Testing

Uh oh!

nvmochs commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JiandiAnNVIDIA commented May 22, 2026 •

edited

Loading