Skip to content

CXL Type-2 device passthrough via vfio-pci#17

Open
JiandiAnNVIDIA wants to merge 11 commits into
NVIDIA:nvidia_unstable-10.1from
JiandiAnNVIDIA:vfio-cxl-2026-05-21
Open

CXL Type-2 device passthrough via vfio-pci#17
JiandiAnNVIDIA wants to merge 11 commits into
NVIDIA:nvidia_unstable-10.1from
JiandiAnNVIDIA:vfio-cxl-2026-05-21

Conversation

@JiandiAnNVIDIA
Copy link
Copy Markdown
Collaborator

@JiandiAnNVIDIA JiandiAnNVIDIA commented May 22, 2026

CXL Type-2 Device Passthrough via VFIO-PCI

Summary

Port RFC series for CXL Type-2 device passthrough to the
nvidia_unstable-10.1 branch. This enables VFIO-based passthrough of CXL
Type-2 (accelerator) devices with host-managed device memory (DPA) to
arm64 virtual machines.

Upstream series: https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/

Base branch: nvidia_unstable-10.1 (1:10.1.0+nvidia-unstable8-1)

Prerequisites cherry-picked from upstream QEMU

Two upstream commits were cherry-picked as prerequisites because Manish's
series depends on the Error **errp parameter in vfio_region_setup():

# Upstream SHA Subject
1 da02b21cc7 hw/vfio: sort and validate sparse mmap regions by offset
2 c42010197e vfio: Add Error ** parameter to vfio_region_setup()

Backport note for commit 2: Upstream removed vfio-platform (762c855439)
before this commit landed, so platform.c was not modified in the original.
nvidia_unstable-10.1 still has platform.c, so the vfio_region_setup()
caller there was updated to pass errp.

Patch series (9 patches from the RFC)

# Subject Key files
1 hw/arm/virt: Add CXL FMWS PA window for device memory hw/arm/virt.c, hw/pci-host/gpex-acpi.c, include/hw/arm/virt.h
2 cxl: Add preserve_config to pxb-cxl OSC method hw/acpi/cxl.c, hw/pci-host/gpex-acpi.c, include/hw/acpi/cxl.h
3 linux-headers: Update vfio.h for CXL Type-2 device passthrough linux-headers/linux/vfio.h
4 hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops hw/vfio/region.c, include/hw/vfio/vfio-region.h
5 hw/vfio/pci: Add CXL Type-2 device detection and region setup hw/vfio/pci.c, hw/vfio/pci.h, hw/vfio/trace-events
6 hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay hw/vfio/pci.c
7 hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices hw/vfio/pci.c, hw/cxl/cxl-host.c, include/hw/cxl/cxl_host.h
8 hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus hw/arm/smmu-common.c
9 vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions hw/vfio/listener.c

Conflict resolutions during backport

The following conflicts were manually resolved when applying to nvidia_unstable-10.1:

Patch 1 — hw/arm/virt: Add CXL FMWS PA window

  • File: hw/pci-host/gpex-acpi.c
  • Issue: #include "hw/acpi/acpi_egm_memory.h" added by nvidia_unstable-10.1
    conflicts with new #include "hw/acpi/cxl.h".
  • Resolution: Keep both includes.

Patch 5 — hw/vfio/pci: Add CXL Type-2 device detection

  • File: hw/vfio/pci.c

  • Issue: nvidia_unstable-10.1 uses hw/hw.h and hw/iommu.h (not yet
    renamed to hw/core/hw-error.h and hw/core/iommu.h).

  • Resolution: Use old include names matching the base branch.

  • File: hw/vfio/pci.h

  • Issue: nvidia_unstable-10.1 has VMChangeStateEntry *vmstate in
    VFIOPCIDevice; upstream does not.

  • Resolution: Keep vmstate from base, add VFIOCXL cxl from patch.

Patch 6 — Wire CXL component-register BAR

  • File: hw/vfio/pci.c
  • Issue: nvidia_unstable-10.1 uses &vdev->pdev in pci_register_bar()
    calls, not a local pdev variable.
  • Resolution: Use &vdev->pdev to match the base.

Testing

  • Build on arm64 (dpkg-buildpackage -b -uc)
  • Boot arm64 VM with CXL Type-2 device passed through via VFIO
  • Verify CFMWS window appears in guest ACPI CEDT
  • Verify HDM decoder 0 is programmed at boot for firmware-committed devices
  • Verify DPA memory is accessible from guest
  • Verify SMMUv3 translation works for CXL devices behind pxb-cxl

ankita-nv and others added 11 commits May 21, 2026 21:04
Sort sparse mmap regions by offset during region setup to ensure
predictable mapping order, avoid overlaps and a proper handling
of the gaps between sub-regions.

Add validation to detect overlapping sparse regions early during
setup before any mapping operations begin.

The sorting is performed on the subregions ranges during
vfio_setup_region_sparse_mmaps(). This also ensures that subsequent
mapping code can rely on subregions being in ascending offset order.

This is preparatory work for alignment adjustments needed to support
hugepfnmap on systems where device memory (e.g., Grace-based systems)
may have non-power-of-2 sizes.

cc: Alex Williamson <alex@shazbot.org>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-2-ankita@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
(cherry picked from commit da02b21)
Signed-off-by: Jiandi An <jan@nvidia.com>
Add an Error **errp parameter to vfio_region_setup() and
vfio_setup_region_sparse_mmaps to allow proper error handling
instead of just returning error codes.

The function sets errors via error_setg() when failure occur.

Suggested-by: Cedric Le Goater <clg@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20260217153010.408739-3-ankita@nvidia.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
(cherry picked from commit c420101)
[jan: [jan: Update vfio_region_setup() caller in platform.c; file was removed upstream (762c855) before this commit landed, but nvidia_unstable-10.1 still has it]]
Signed-off-by: Jiandi An <jan@nvidia.com>
CXL VFIO passthrough needs a stable guest physical address range for
device memory (DPA) that falls inside a CFMWS entry the guest discovers
from ACPI CEDT. Without a dedicated range in the address map, the HDM
decoder has nowhere to point.

Add VIRT_HIGH_CXL_MMIO immediately after the second PCIe MMIO window.
It gets its own highmem_cxl_mmio flag in VirtMachineState rather than
sharing highmem_cxl, so the two slots are independently controllable
even though both are currently tied to CXL bridge presence.

The base and size flow through GPEXConfig.cxl_mmio to
acpi_dsdt_add_gpex(), which carves out a QWord memory descriptor in the
first CXL root bridge's _CRS. The CFMWS window is system-wide, so only
the first CXL bridge gets the descriptor - subsequent ones would
produce duplicate resource claims for the same range.

build_crs() already emits the bridge's own 64-bit ranges into crs.
The CFMWS window is a separate system-wide range, so only that window
is appended as a new QWord descriptor; the bridge ranges are not
re-emitted. A warn_report() fires if the CFMWS window overlaps any
existing bridge 64-bit range, since that would indicate an address
layout conflict.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
[jan: Resolve #include conflict with hw/acpi/acpi_egm_memory.h added by nvidia_unstable-10.1 base; keep both includes]
Signed-off-by: Jiandi An <jan@nvidia.com>
Before this patch, pxb-cxl bridges had no _DSM method at all. When the
OS called _DSM on a CXL host bridge, ACPI returned an error and the OS
defaulted to reassigning resources across suspend/resume. On machines
where firmware pre-commits the HDM decoder, that reassignment breaks the
DPA mapping.

Wire preserve_config through GPEXConfig into build_cxl_osc_method() so
pxb-cxl host bridges get a _DSM method that signals the OS to keep
resource assignments stable when needed. The _DSM function 5 (preserve
firmware PCI configuration) is the mechanism used to convey this.

build_pci_host_bridge_dsm_method() is promoted from static to exported
so cxl.c can call it without duplicating the AML.

The x86 build_cxl_osc_method() call site passes false since x86 does
not use firmware-committed HDM decoders.

build_cxl_osc_method is renamed to acpi_dsdt_add_cxl_host_bridge_methods
The function now appends both the CXL _OSC method and the _DSM method,
so its old name is misleading. Renamed it to match the pxb-pcie analogue
acpi_dsdt_add_host_bridge_methods(), making the two root bridge code
paths symmetric. No AML change.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…sthrough

Sync the VFIO UAPI additions from the kernel CXL Type-2 passthrough
series.

VFIO_DEVICE_FLAGS_CXL (bit 9) marks a device as CXL Type-2 and
guarantees the capability chain includes a vfio_device_info_cap_cxl
entry (cap id 6). That capability carries the BAR index holding the
CXL component registers, flags for firmware-committed and cache-capable
devices, the byte offset to the HDM Decoder Capability block within
that BAR, and region indices for both the DPA memory region and the
Component Register shadow.

Two new region subtypes:
  VFIO_REGION_SUBTYPE_CXL (1): mmappable DPA memory
  VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2): HDM decoder shadow, r/w only

Note: UAPI headers are normally kept in sync via
scripts/update-linux-headers.sh once upstream kernel changes merge.
This patch manually adds the CXL Type-2 additions as a temporary
measure to unblock QEMU development. It should be dropped and
replaced with a proper header sync once the kernel series is accepted.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…ustom region ops

vfio_region_setup() always initializes the region MemoryRegion with
vfio_region_ops. CXL needs custom pread/pwrite ops for the Component
Register shadow region.

Add vfio_region_setup_with_ops() which accepts a const MemoryRegionOps *
parameter. When non-NULL it is passed to memory_region_init_io(); when
NULL the existing vfio_region_ops is used. vfio_region_setup() is
retained unchanged as a thin wrapper for all existing callers.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…n setup

When VFIO_DEVICE_FLAGS_CXL is set, the kernel has identified a CXL
Type-2 device and populated the capability chain with a
vfio_device_info_cap_cxl entry. Read that entry to locate the DPA
and CXL Component Register shadow regions, then call vfio_region_setup()
for each.

DPA covers the device's host-managed memory and is faulted in lazily
by the VMM. The CXL Component Register shadow gives the VMM access to
the HDM Decoder Capability block so it can intercept decoder commits
without touching the hardware register page directly.

vfio_cxl_derive_hdm_info() walks the CXL Capability Array inside the
Component Register shadow to find the HDM Decoder capability (ID 0x5)
and extracts hdm_decoder_offset and hdm_count. All reads use
le32_to_cpu() since the capability array is little-endian per the CXL
spec.  Dword 0 is the array header; capability entries start at dword 1,
which is why the loop begins at i = 1.

CXL register constants are defined here using names that mirror
<linux/cxl.h> to make cross-referencing straightforward.

Add the VFIOCXL struct embedded in VFIOPCIDevice.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
[jan: Resolve include path conflict (nvidia_unstable-10.1 uses hw/hw.h and hw/iommu.h not yet renamed to hw/core/); add VMChangeStateEntry vmstate field from base]
Signed-off-by: Jiandi An <jan@nvidia.com>
…_REGS overlay

The CXL Component Register BAR contains two types of ranges that need
different handling:

  - Accelerator register windows: passed through as direct hardware
    mmaps for performance. The kernel reports the real BAR size and
    lists mmappable windows via VFIO_REGION_INFO_CAP_SPARSE_MMAP,
    excluding the HDM Decoder Capability block. vfio_region_mmap()
    creates hardware-backed sub-regions for each sparse area.

  - HDM Decoder Capability block: guest accesses must go through
    emulated ops so QEMU can observe and program decoder state. The
    kernel blocks direct mmap of this range.

vfio_bar_register(): after the normal mmap path, overlay the COMP_REGS
emulation region at hdm_regs_offset with priority 1. In QEMU's
MemoryRegion model, overlapping subregions are resolved by priority;
the default is 0. Priority 1 ensures guest accesses to the HDM range
always dispatch through the emulated COMP_REGS ops regardless of any
hardware-backed sub-region at a neighbouring offset.

vfio_pci_bars_exit(): remove the COMP_REGS overlay before the normal
BAR teardown path.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
[jan: Resolve pci_register_bar conflict; nvidia_unstable-10.1 uses &vdev->pdev not local pdev variable]
Signed-off-by: Jiandi An <jan@nvidia.com>
… firmware-committed devices

setup_locked_hdm() runs as a machine_done notifier after all devices
have been realized. It programs HDM decoder 0 with the CFMWS base
address so the guest can fault into device memory from the first
instruction.

The notifier is only registered when the kernel reports the device as
firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED). The host is
responsible for HDM decoder programming; the guest has no mechanism to
remap host physical address mappings.

The function uses cxl->fmws_base (set by the optional cxl-fmws-base
device property) if non-zero; otherwise it falls back to the
cxl_fmws_base global captured by cxl_fmws_set_memmap() during machine
memory-map init. If neither is set, it warns and returns without
programming anything.

If COMMIT_LOCK is set in decoder 0 CTRL at machine_done time (left-over
from a prior FLR?), it is cleared before writing BASE so the subsequent
write is not blocked. COMMIT_LOCK is re-set after programming so the
hardware enforces the committed base.

read_region() return is checked; failure aborts programming rather than
leaving ctrl uninitialized. All write_region() failures are propagated.
The function exits cleanly rather than leaving the decoder half-programmed.

Add cxl_fmws_base as a hwaddr global in cxl-host.c (and a stub in
cxl-host-stubs.c). It is set once by cxl_fmws_set_memmap() and read
later at machine_done time.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
The SMMUv3 primary bus check only accepted pxb-pcie as a valid root.
pxb-cxl uses the same PCIe-compatible bus implementation; reject it
and CXL devices behind it cannot reach the IOMMU.

Extend the check to also accept CXL buses so SMMUv3 translation applies
to passthrough CXL devices. Update the comment above the check to
mention pxb-cxl alongside pxb-pcie.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…ice regions

vfio_container_region_add() attempts an IOMMU DMA mapping for every
RAM section that enters the guest address space. For VFIO mmap-backed
regions (PCI BAR windows, CXL.mem regions), this mapping always fails:
the backing VMAs carry VM_IO | VM_PFNMAP flags and pin_user_pages()
refuses to pin VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT.

CPU access to these regions goes through KVM Stage-2 page faults
independently of the SMMU/IOMMU, so no IOMMU entry is required for
correct operation.

Add an early return for RAM-device sections owned by a VFIO device.
vfio_get_vfio_device(memory_region_owner(section->mr)) returns non-NULL
for any mmap subregion created by vfio_region_mmap(), since
memory_region_init_ram_device_ptr() propagates the VFIOPCIDevice owner
from the containing region. Matching on ownership covers both normal
PCI BAR windows and CXL.mem regions uniformly; non-VFIO RAM-device
regions such as NVDIMMs are unaffected and continue through the normal
mapping path.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/all/20260427181235.3003865-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 22, 2026

@JiandiAnNVIDIA

bf63a03 vfio: Add Error ** parameter to vfio_region_setup()

Can you change the trailer from “cherry picked from” to “backported from” ?

Also, instead of [jan: [jan: …]] can you just use [jan: …] ?


Codex found 2 "high" findings (I'm not considering anything lower than high since this series is still WIP and this is an internal branch):

  1. Non-Firmware-Committed Devices Are Accepted But Not Functional

The code sets up CXL VFIO regions for any device with VFIO_DEVICE_FLAGS_CXL. In vfio_cxl_setup(), it creates the DPA memory region and the COMP_REGS region, then derives HDM decoder info. That part happens regardless of
whether the kernel reports VFIO_CXL_CAP_FIRMWARE_COMMITTED.

The actual DPA mapping into guest physical memory only happens in setup_locked_hdm():

memory_region_add_subregion(sys_mem, cxl->fmws_base, cxl->region.mem);
cxl->dpa_in_system_mem = true;

But setup_locked_hdm() is only registered here:

if (cap->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED) {
cxl->machine_done.notify = setup_locked_hdm;
qemu_add_machine_init_done_notifier(&cxl->machine_done);
}

So for a non-firmware-committed CXL device, QEMU accepts the device and exposes the component-register overlay, but never inserts the DPA region into the guest address space. Guest HDM decoder writes go through the
COMP_REGS ops and are forwarded with pwrite(), but QEMU does not intercept COMMIT to map DPA later.

If the supported scope is firmware-committed only, this should be enforced explicitly. Otherwise users can boot a VM with a device that appears partially configured but whose memory is not reachable.

Recommended resolution: in vfio_cxl_setup(), fail early unless VFIO_CXL_CAP_FIRMWARE_COMMITTED is set.

  1. CXL machine_done Notifier Is Registered Without Cleanup

Firmware-committed devices register a machine-init-done notifier:

cxl->machine_done.notify = setup_locked_hdm;
qemu_add_machine_init_done_notifier(&cxl->machine_done);

That notifier is stored in QEMU’s global machine_init_done_notifiers list. QEMU provides a matching cleanup API:

qemu_remove_machine_init_done_notifier(&cxl->machine_done);

but this series never calls it.

That matters because vfio_cxl_setup() runs relatively early in vfio_pci_realize(). After the notifier is registered, later realize steps can still fail: config setup, IOMMU attachment, PCI capability setup, quirks,
interrupt setup, display probing, migration setup, and so on. If any of those fail before machine-ready, the device can be torn down while the global notifier list still contains &vdev->cxl.machine_done.

When machine init later completes, QEMU walks that global list and calls the stale notifier. setup_locked_hdm() then does:

VFIOCXL *cxl = container_of(notifier, VFIOCXL, machine_done);
VFIORegion *region = &cxl->comp_regs_region;
...
region->vbasedev->name

If the VFIO device was finalized, that is a use-after-free risk. If it was partially torn down, it can still access invalid VFIO region/device state.

The same lifecycle problem exists for hot-unplug or normal device exit: vfio_exitfn() tears down BARs, interrupts, migration, and IOMMU state, but does not remove the CXL notifier.

Recommended resolution: add a boolean such as machine_done_registered, set it after registration, and remove the notifier in all teardown paths before the CXL/VFIO state is finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants