Skip to content

[linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support#407

Open
JiandiAnNVIDIA wants to merge 51 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl-vfio_2026-04-23
Open

[linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support#407
JiandiAnNVIDIA wants to merge 51 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl-vfio_2026-04-23

Conversation

@JiandiAnNVIDIA
Copy link
Copy Markdown

@JiandiAnNVIDIA JiandiAnNVIDIA commented May 6, 2026

Description

This patch series adds VFIO CXL Type-2 device passthrough support to the nvidia-6.17 kernel, enabling CXL-capable accelerator devices to be assigned to virtual machines via VFIO. It includes:

  1. VFIO get_region_info refactoring - Upstream series that splits VFIO_DEVICE_GET_REGION_INFO into its own driver op and introduces get_region_info_caps, which is a prerequisite for the CXL VFIO region implementation
  2. VFIO CXL Type-2 passthrough - Manish Honap's series adding CXL awareness to vfio-pci-core, including HDM decoder register emulation, DPA region mapping with demand-fault mmap, CXL DVSEC config virtualization, and CXL region management
  3. VFIO CXL guest-initiated reset - Manish Honap's RFC-v2 series enabling guest-initiated CXL protocol reset with HDM decoder base address preservation and DVSEC STATUS2 virtualization

Key Features Added:

  • CXL Type-2 device detection and initialization within vfio-pci-core
  • HDM decoder register emulation framework for guest access
  • DPA (Device Physical Address) VFIO region with demand-fault mmap and reset zap
  • CXL DVSEC configuration space write virtualization
  • CXL component BAR sparse mmap advertisement to userspace
  • Guest-initiated CXL protocol reset (cxl_dev_reset)
  • HDM decoder base address preservation across reset
  • DVSEC STATUS2 register virtualization in vconfig shadow
  • Module parameter disable_cxl for per-device opt-out
  • UAPI header (include/uapi/cxl/cxl_regs.h) for CXL register defines

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2152222


Justification

VFIO CXL passthrough is required for assigning CXL Type-2 accelerator devices (GPUs, SmartNICs) to virtual machines:


Source

Patch Breakdown (51 patches):

# Category Count Source
1 Upstream VFIO prerequisites (Hisilicon + nvgrace-gpu fix) 3 Upstream torvalds/master (merged)
2 VFIO get_region_info series 22 Upstream torvalds/master (merged in v6.19)
3 Manish Honap's VFIO CXL Type-2 series v2 (19/20, selftest skipped) 19 LKML (v2, not yet merged)
4 Manish Honap's VFIO CXL reset series RFC-v2 6 Internal (RFC-v2, not yet merged)
5 Config annotations update 1 OOT (build config)
TOTAL 51

Notes on upstream prerequisites (item 1):

Three upstream commits cherry-picked:

  • 4868d2d52df6 — crypto: hisilicon - qm updates BAR configuration
  • 2131c1517f30 — hisi_acc_vfio_pci: adapt to new migration configuration
  • 767b1ed8b980 — vfio/nvgrace-gpu: fix grammatical error

The first two resolve a dependency for e238f147d517 ("vfio/hisi: Convert
to the get_region_info op"). The third fixes a pre-existing comment typo in
the nvgrace-gpu driver that would otherwise cause a patch-ID mismatch with
upstream 1b0ecb5baf4a ("vfio/pci: Convert all PCI drivers to
get_region_info_caps").

Notes on the VFIO get_region_info series (item 2):

22 upstream commits from Jason Gunthorpe's series, already merged in v6.19:

https://lore.kernel.org/all/0-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com/

These refactor the VFIO region info infrastructure that the CXL VFIO
passthrough series depends on.

Notes on Manish's VFIO CXL series (item 3):

19 out of 20 patches ported from:

https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/

Patch 20/20 (selftests) was skipped as the upstream VFIO selftest
infrastructure (tools/testing/selftests/vfio/) is not present in
the NV-Kernels base.

Conflict resolutions were required for 10 of 19 patches due to the
NV-Kernels base diverging from upstream in two ways:

  1. CXL PCI function declarations (cxl_find_regblock,
    cxl_probe_component_regs, cxl_await_range_active,
    cxl_regblock_get_bar_info) are in include/cxl/pci.h
    unconditionally (per Srirangan/Alejandro convention from PR [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support #342),
    rather than in include/cxl/cxl.h with CONFIG_CXL_BUS guards
    as Manish's patches expect.
  2. Missing upstream xe driver, dmabuf, and p2pdma support causes
    context mismatches in Kconfig, Makefiles, and VFIO headers.

Notes on Manish's CXL reset series (item 4):

6 patches from internal RFC-v2 posting:

  • [RFC-v2 0/6] vfio/cxl: Guest-initiated CXL protocol reset

Patch 1/6 had a conflict resolution identical to item 3 (declarations
added to include/cxl/pci.h instead of include/cxl/cxl.h).

Lore Links:

Upstream Status:

Series Status
3 upstream prerequisites (Hisilicon + nvgrace-gpu) ✅ Merged in torvalds/master
22 VFIO get_region_info commits ✅ Merged in torvalds/master (v6.19)
Manish VFIO CXL v2 (19 patches) ⏳ Under review, not yet merged
Manish VFIO CXL reset RFC-v2 (6 patches) ⏳ Internal, not yet posted upstream

Testing

Build Validation:

  • Built successfully for ARM64 4K page size kernel
  • Built successfully for ARM64 64K page size kernel

Config Verification:

CXL VFIO config enabled:

CONFIG_VFIO_CXL_CORE=y

Runtime Testing:

  • Boot test on ARM64 system
  • CXL Type-2 device enumeration via VFIO
  • CXL DPA region mmap from guest
  • CXL guest-initiated reset test

Notes

  • CONFIG_VFIO_CXL_CORE is a new bool config enabled for both amd64 and
    arm64. It depends on VFIO_PCI_CORE (module), CXL_BUS (built-in), and
    CXL_MEM (built-in). As a bool, it compiles into the vfio-pci-core module.
  • This series depends on the CXL infrastructure established in PR [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support #342
    (Alejandro's v23, Srirangan's save/restore and reset series).
  • A new UAPI header include/uapi/cxl/cxl_regs.h is introduced for CXL
    component and HDM register defines, using UAPI-safe macros (__GENMASK,
    _BITUL) and raw hex sizes instead of kernel-internal SZ_* macros.
  • Patch 20/20 of Manish's series (CXL Type-2 VFIO assignment selftest) was
    intentionally skipped as the upstream VFIO selftest infrastructure is not
    present in the NV-Kernels base.

Longfang Liu added 2 commits May 4, 2026 20:44
On new platforms greater than QM_HW_V3, the configuration region for the
live migration function of the accelerator device is no longer
placed in the VF, but is instead placed in the PF.

Therefore, the configuration region of the live migration function
needs to be opened when the QM driver is loaded. When the QM driver
is uninstalled, the driver needs to clear this configuration.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 4868d2d)
Signed-off-by: Jiandi An <jan@nvidia.com>
On new platforms greater than QM_HW_V3, the migration region has been
relocated from the VF to the PF. The VF's own configuration space is
restored to the complete 64KB, and there is no need to divide the
size of the BAR configuration space equally. The driver should be
modified accordingly to adapt to the new hardware device.

On the older hardware platform QM_HW_V3, the live migration configuration
region is placed in the latter 32K portion of the VF's BAR2 configuration
space. On the new hardware platform QM_HW_V4, the live migration
configuration region also exists in the same 32K area immediately following
the VF's BAR2, just like on QM_HW_V3.

However, access to this region is now controlled by hardware. Additionally,
a copy of the live migration configuration region is present in the PF's
BAR2 configuration space. On the new hardware platform QM_HW_V4, when an
older version of the driver is loaded, it behaves like QM_HW_V3 and uses
the configuration region in the VF, ensuring that the live migration
function continues to work normally. When the new version of the driver is
loaded, it directly uses the configuration region in the PF. Meanwhile,
hardware configuration disables the live migration configuration region
in the VF's BAR2: reads return all 0xF values, and writes are silently
ignored.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 2131c15)
Signed-off-by: Jiandi An <jan@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Validation Report

PR Lint ✅ All checks passed

Details
Checking 51 commits...

Cherry-pick digest:
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 74b6b99bcd80 │ [SAUCE] config: enable config_vfio_cxl_core for cxl type-2 passt │ N/A        │ N/A     │ jan                       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 67c66e735df5 │ [SAUCE] vfio/cxl: implement vfio_cxl_reset()                     │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 14fbdcb4d592 │ [SAUCE] vfio/cxl: virtualize dvsec status2 register in vconfig s │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9e0e291bfc29 │ [SAUCE] vfio/cxl: preserve hdm decoder base addresses across res │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5071d3b07627 │ [SAUCE] vfio/cxl: ensure pci memory space is enabled before post │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0bd9c4c7ab7c │ [SAUCE] vfio/pci: wire cxl dpa reset handling                    │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2d40efbb4f42 │ [SAUCE] cxl: export the cxl reset helpers for vfio users         │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 696f0b100f03 │ docs: vfio-pci: document cxl type-2 device passthrough           │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 595c1ad9c3cf │ vfio/cxl: provide opt-out for cxl feature                        │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9cd924807287 │ vfio/pci: advertise cxl cap and sparse component bar to userspac │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6e2d9e5f273d │ vfio/cxl: register regions with vfio layer                       │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 3ff6c19fc517 │ vfio/cxl: virtualize cxl dvsec config writes                     │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ f5e419121227 │ vfio/cxl: dpa vfio region with demand fault mmap and reset zap   │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 799c46dc1495 │ vfio/cxl: cxl region management support                          │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 537d8a2414cf │ vfio/cxl: wait for hdm ranges and create memdev                  │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 4ab495542be1 │ vfio/cxl: introduce hdm decoder register emulation framework     │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 07d714144702 │ vfio/pci: export config access helpers                           │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 939ebb73d430 │ vfio/cxl: detect cxl dvsec and probe hdm block                   │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 336a1448463a │ vfio/pci: add config_vfio_cxl_core and stub cxl hooks            │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 87b80cc08c26 │ vfio/pci: add cxl state to vfio_pci_core_device                  │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c0f4d247a0e7 │ vfio: uapi for cxl-capable pci device assignment                 │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 947749bd1b8d │ cxl: record bir and bar offset in cxl_register_map               │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 023bae337329 │ cxl: split cxl_await_range_active() from media-ready wait        │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 52ead24ed8ad │ cxl: move component/hdm register defines to uapi/cxl/cxl_regs.h  │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e02c1b7ac02a │ cxl: declare cxl_find_regblock and cxl_probe_component_regs in p │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fd317b86093e │ cxl: add cxl_get_hdm_info() for hdm decoder metadata             │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 54d50bbc6111 │ 56c069307dfd vfio: Remove the get_region_info op                 │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 21085759fbcd │ dc10734610e2 vfio: Move the remaining drivers to get_region_info │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c0ad388ba741 │ 182c62861ba5 vfio/platform: Convert to get_region_info_caps      │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2bf5a2cbb154 │ 1b0ecb5baf4a vfio/pci: Convert all PCI drivers to get_region_inf │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ bc1c993e783d │ 973af0c40eaf vfio/ccw: Convert to get_region_info_caps           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0282af066b10 │ 93165757c023 vfio/gvt: Convert to get_region_info_caps           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 29e1217fd909 │ 45f9fa18109d vfio/mbochs: Convert mbochs to use vfio_info_add_ca │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 7dd77b841190 │ 775f726a742a vfio: Add get_region_info_caps op                   │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e7da10685f7f │ f97859503859 vfio: Require drivers to implement get_region_info  │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6c250ce18f9e │ e664067b6035 vfio/gvt: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 76b5171d117d │ 61b3f7b5a729 vfio/ccw: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 619333df0ce8 │ b9827eff6b4a vfio/cdx: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 8ba94bf6a94e │ 6cdae5d0c326 vfio/fsl: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 073f13c17982 │ d4635df279f5 vfio/platform: Provide a get_region_info op         │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 554dca9a1de1 │ 8339fccda837 vfio/mbochs: Provide a get_region_info op           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0fbfd736592c │ cf16acc0af09 vfio/mdpy: Provide a get_region_info op             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 4df20815cb64 │ 078775527109 vfio/mtty: Provide a get_region_info op             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e54b8e086acd │ f3fddb71dd50 vfio/pci: Fill in the missing get_region_info ops   │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 702622746ce4 │ 5ac720647477 vfio/nvgrace: Convert to the get_region_info op     │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fad0d0d38ca4 │ c044eefa4786 vfio/virtio: Convert to the get_region_info op      │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6b97c1b33bef │ e238f147d517 vfio/hisi: Convert to the get_region_info op        │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 897cefa739f7 │ 113557b04068 vfio: Provide a get_region_info op                  │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 449e051b54c2 │ 767b1ed8b980 vfio/nvgrace-gpu: fix grammatical error             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0c7d38232410 │ 2131c1517f30 hisi_acc_vfio_pci: adapt to new migration configura │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 38c6eb3eed52 │ 4868d2d52df6 crypto: hisilicon - qm updates BAR configuration    │ match      │ match   │ preserved + jan added     │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@JiandiAnNVIDIA JiandiAnNVIDIA changed the title CXL VFIO: Add CXL Type-2 device passthrough support [linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support May 6, 2026
Morduan Zang and others added 25 commits May 6, 2026 02:03
The word "as" in the comment should be replaced with "is",
and there is an extra space in the comment.

Signed-off-by: Morduan Zang <zhangdandan@uniontech.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/54E1ED6C5A2682C8+20250814110358.285412-1-zhangdandan@uniontech.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
(cherry picked from commit 767b1ed)
Signed-off-by: Jiandi An <jan@nvidia.com>
Instead of hooking the general ioctl op, have the core code directly
decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it.

This is intended to allow mechanical changes to the drivers to pull their
VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve
the function signature to consolidate more code.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 113557b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the function signature of hisi_acc_vfio_pci_ioctl()
and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(backported from commit e238f14)
[jan: resolve minor conflict in hisi_acc_vfio_pci_ioctl()]
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove virtiovf_vfio_pci_core_ioctl() and change the signature of
virtiovf_pci_ioctl_get_region_info().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit c044eef)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the signature of nvgrace_gpu_ioctl_get_region_info()

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 5ac7206)
Signed-off-by: Jiandi An <jan@nvidia.com>
Now that every variant driver provides a get_region_info op remove the
ioctl based dispatch from vfio_pci_core_ioctl().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit f3fddb7)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mtty_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 0787755)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mdpy_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit cf16acc)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mbochs_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 8339fcc)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_platform_ioctl() and re-indent it. Add it to all
platform drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit d4635df)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_fsl_mc_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 6cdae5d)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to
the op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit b9827ef)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_ccw_mdev_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/12-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 61b3f7b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of intel_vgpu_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit e664067)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the fallback through the ioctl callback, no drivers use this now.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit f978595)
Signed-off-by: Jiandi An <jan@nvidia.com>
This op does the copy to/from user for the info and can return back
a cap chain through a vfio_info_cap * result.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 775f726)
Signed-off-by: Jiandi An <jan@nvidia.com>
This driver open codes the cap chain manipulations. Instead use
vfio_info_add_capability() and the get_region_info_caps() op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/16-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 45f9fa1)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/17-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 9316575)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and flatten the call chain.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/18-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 973af0c)
Signed-off-by: Jiandi An <jan@nvidia.com>
Since the core function signature changes it has to flow up to all
drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 1b0ecb5)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 182c628)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit dc10734)
Signed-off-by: Jiandi An <jan@nvidia.com>
No driver uses it now, all are using get_region_info_caps().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 56c0693)
Signed-off-by: Jiandi An <jan@nvidia.com>
cxl_probe_component_regs() finds the HDM decoder block during device probe
and caches its location, but does not record the decoder count and does
not expose the result outside drivers/cxl/.

vfio-cxl needs the decoder count and the byte offset and size of the HDM
block without re-running the probe sequence. Record decoder_cnt in
rmap->count when parsing the HDM capability in cxl_probe_component_regs(),
extend struct cxl_reg_map with a count member, and add cxl_get_hdm_info()
to return offset, size, and count from the cached map.

Export under the CXL namespace; stub to -EOPNOTSUPP when CONFIG_CXL_BUS
is off.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…nent_regs in public header

vfio-cxl lives outside drivers/cxl/ but still needs to locate the
component register block and fill cxl_component_reg_map. Those
prototypes were stuck in the internal drivers/cxl/cxl.h.

Move the declarations to include/cxl/cxl.h next to the other
vfio-facing hooks, with stubs when CXL bus support is disabled.
Drop the duplicate prototypes from the private header.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Move cxl_probe_component_regs() to include/cxl/pci.h instead of include/cxl/cxl.h to align with existing Srirangan/Alejandro convention; skip cxl_find_regblock() move as it is already in include/cxl/pci.h; add struct cxl_component_reg_map forward declaration]
Signed-off-by: Jiandi An <jan@nvidia.com>
@JiandiAnNVIDIA JiandiAnNVIDIA force-pushed the cxl-vfio_2026-04-23 branch from 3eede80 to 502020b Compare May 6, 2026 07:04
@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented May 13, 2026

I didn't spot anything concerning.
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>

@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented May 15, 2026

Boro watcher review skipped

The GitHub watcher skips automatic boro reviews for PRs with more than 50 commits. This PR currently has 51 commits.

To run the review anyway, ask BaseOS_Kernel_Bot in #baseos-kernel:

review https://github.com/NVIDIA/NV-Kernels/pull/407

Head: 74b6b99bcd80

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher sees a newer PR head.

mmhonap and others added 24 commits May 20, 2026 19:00
…xl/cxl_regs.h

VFIO and other code outside the CXL core needs the same offset/mask
constants the core uses for the component register block and HDM
decoders.

Pull them into a new include/uapi/cxl/cxl_regs.h
(GPL-2.0 WITH Linux-syscall-note) and include it from
include/cxl/cxl.h. Use the uapi-friendly __GENMASK helpers where
needed. Section comments in the new file reference CXL spec r4.0 numbering.

For UAPI change, replaced the SZ_64K with actual size as the macro
will not be available for userspace programs.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Remove defines from include/cxl/cxl.h instead of drivers/cxl/cxl.h as they were already moved there by Srirangan's SAUCE commit, Add #include <asm/bitsperlong.h> needed by __GENMASK() in uapi header]
Signed-off-by: Jiandi An <jan@nvidia.com>
…dy wait

Before accessing CXL device memory after reset/power-on, the driver
must ensure media is ready. Not every CXL device implements the CXL
Memory Device register group (many Type-2 devices do not).
cxl_await_media_ready() reads cxlds->regs.memdev. Access to the
memory device registers on a Type-2 device may result in kernel
panic.

Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new function, cxl_await_range_active(). Type-2 devices often
lack the CXLMDEV status register, so they need the range check
without the memdev read. cxl_await_media_ready() now calls
cxl_await_range_active() for the DVSEC poll, then reads the memory
device status as before.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Add cxl_await_range_active() declaration to include/cxl/pci.h unconditionally instead of include/cxl/cxl.h with CONFIG_CXL_BUS guards, consistent with existing convention]
Signed-off-by: Jiandi An <jan@nvidia.com>
The Register Locator DVSEC (CXL 4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR. CXL core currently only
stores the resolved HPA (resource + offset) in struct cxl_register_map,
so callers that need to use pci_iomap() or report the BAR to userspace
must reverse-engineer the BAR from the HPA.

Add bar_index and bar_offset to struct cxl_register_map and fill them
in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
index and offset directly and use pci_iomap() instead of ioremap(HPA).

Add cxl_regblock_get_bar_info() to return those fields; -EINVAL if
the map is not BAR-backed.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Add cxl_regblock_get_bar_info() declaration to include/cxl/pci.h unconditionally instead of include/cxl/cxl.h with CONFIG_CXL_BUS guards, consistent with existing convention, Add BIR range validation (reject BIR >= PCI_STD_NUM_BARS) and bar_index bounds check in cxl_regblock_get_bar_info()]
Signed-off-by: Jiandi An <jan@nvidia.com>
Vendor GPUs and accelerators can expose CXL.mem (HDM-D or HDM-DB)
without using PCI class code 0x0502. VMMs need a stable way to learn
DPA sizing, firmware commit state, and where the extra VFIO regions live.

Add VFIO_DEVICE_FLAGS_CXL (bit 9) and VFIO_DEVICE_INFO_CAP_CXL (cap ID 6).
The capability struct carries:

  hdm_regs_bar_index       PCI BAR containing the component register block
  hdm_regs_offset          byte offset within that BAR to the CXL.mem area
                           (comp_reg_offset + CXL_CM_OFFSET)
  dpa_region_index         VFIO region index for the DPA window
  comp_regs_region_index   VFIO region index for the emulated COMP_REGS

HDM decoder count and the HDM block offset within COMP_REGS are
intentionally absent; both are derivable from the CXL Capability Array at
COMP_REGS offset 0. Locate cap ID 0x5 (HDM) and read bits[31:20] of its
entry for the byte offset. Then read bits[3:0] of the HDM Decoder Capability
register for the count: count = (field == 0) ? 1 : field * 2.

Two flags accompany the capability:

  VFIO_CXL_CAP_FIRMWARE_COMMITTED
    A decoder covering @dpa_size bytes was programmed and committed by
    platform firmware before device open. The VMM can use the DPA region
    immediately without re-committing.

  VFIO_CXL_CAP_CACHE_CAPABLE
    The device is HDM-DB (CXL.mem + CXL.cache). HDM-DB requires a
    Write-Back Invalidation sequence before FLR to flush dirty cache
    lines; HDM-D (CXL.mem only) does not. QEMU uses this flag to
    schedule WBI and to report Back-Invalidation capability accurately
    in the virtual CXL topology. Mirrors the Cache_Capable bit from
    the CXL DVSEC Capability register.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Add struct vfio_pci_cxl_state and hang a pointer to it off
vfio_pci_core_device.  vdev->cxl stays NULL for non-CXL devices, so
existing vfio-pci-core paths just pay a NULL check.

The new struct embeds struct cxl_dev_state by value (CXL core uses
container_of() against this field) and stores pointers to the
cxl_memdev, root decoder, and endpoint decoder that the CXL core
owns.  cxl_region is not introduced here; it is added later when
region management lands.

The series builds the CXL Type-2 passthrough path inside
vfio-pci-core rather than in a separate variant driver.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatch in vfio_pci_core.h; add #include <cxl/pci.h> to vfio_cxl_priv.h for cxl_find_regblock/cxl_probe_component_regs declarations]
Signed-off-by: Jiandi An <jan@nvidia.com>
Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
build rules to compile CXL.mem passthrough infrastructure for
vendor-specific CXL devices into the vfio-pci-core module.  The new
option depends on VFIO_PCI_CORE, CXL_BUS and CXL_MEM.

Wire up the detection and cleanup entry-point stubs in
vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
so that subsequent patches can fill in the CXL-specific logic without
touching the vfio-pci-core flow again.

The vfio_cxl_core.c file added here is an empty skeleton; the actual
CXL detection and initialisation code is introduced in the following
patch to keep this build-system patch reviewable on its own.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in Kconfig, Makefile, and vfio_pci_priv.h due to missing upstream xe/dmabuf support in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
Detect a vendor-specific CXL device at vfio-pci bind time and probe
its HDM decoder register block.

vfio_cxl_create_device_state() allocates per-device state via devm and
reads MEM_CAPABLE and CACHE_CAPABLE from the CXL DVSEC.

vfio_cxl_setup_regs() locates the component register block, temporarily
maps it, calls cxl_probe_component_regs() to find the HDM block, then
releases the mapping.

vfio_pci_cxl_detect_and_init() chains these two steps. If either fails,
vdev->cxl stays NULL and the device falls back to plain vfio-pci.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Use pci_get_dsn() instead of pdev->dev.id for cxlds serial; expand comment explaining why]
Signed-off-by: Jiandi An <jan@nvidia.com>
Promote vfio_raw_config_write() and vfio_raw_config_read() to non-static so
that the CXL DVSEC write handler in the next patch can call them.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
… framework

Add HDM decoder register emulation for CXL devices assigned to a guest.

New file vfio_cxl_emu.c allocates comp_reg_virt[] covering the full
component register block (CXL_COMPONENT_REG_BLOCK_SIZE), snapshots it
from MMIO after probe, and registers a VFIO device region
(VFIO_REGION_SUBTYPE_CXL_COMP_REGS) with read/write ops but no mmap,
so every access hits the emulated buffer and write dispatchers.

vfio_cxl_setup_virt_regs() is called from the tail of
vfio_cxl_setup_regs(); vfio_cxl_clean_virt_regs() runs on cleanup.

HDM decoder register defines come from include/uapi/cxl/cxl_regs.h.
Bits with no hardware equivalent stay in vfio_cxl_priv.h.

hdm_decoder_n_ctrl_write() allows the guest to clear the LOCK bit.
A firmware-committed decoder arrives with LOCK=1; the guest driver
must clear it before reprogramming BASE and SIZE with the VM's GPA.
Such a write clears the bit in the shadow while preserving all other
fields.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve Makefile context mismatch due to missing upstream dmabuf support in NV-Kernels base, Add CTRL LOCK enforcement in BASE_LO/SIZE_LO writes, BI bit masking for non-cache-capable devices, pass max_size to vfio_cxl_setup_virt_regs() for bounds check, add vfio_pci_cxl_cleanup() in registration error path]
Signed-off-by: Jiandi An <jan@nvidia.com>
After HDM registers are mapped, call cxl_await_range_active() so we
only proceed when DVSEC ranges report active without touching the
memdev register group Type-2 may lack.

Re-snapshot component regs (vfio_cxl_reinit_comp_regs) once
MEM_ACTIVE so firmware final SIZE_HIGH etc. land in comp_reg_virt.

Read committed decoder size from hardware, set capacity via
cxl_set_capacity(), and devm_cxl_add_memdev().

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Line offset adjustments only (cascading from 0011 changes)]
Signed-off-by: Jiandi An <jan@nvidia.com>
Region Management makes use of APIs provided by CXL_CORE as below:

CREATE_REGION flow:
1. Validate request (size, decoder availability)
2. Allocate HPA via cxl_get_hpa_freespace()
3. Allocate DPA via cxl_request_dpa()
4. Create region via cxl_create_region() - commits HDM decoder
5. Get HPA range via cxl_get_region_range()

DESTROY_REGION flow:
1. Detach decoder via cxl_decoder_detach()
2. Free DPA via cxl_dpa_free()
3. Release root decoder via cxl_put_root_decoder()

Use DEFINE_FREE scope helpers so error paths unwind cleanly.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan:Add borrowed-reference comment for precommitted decoders, init region to NULL, don't unregister precommitted regions in teardown]
Signed-off-by: Jiandi An <jan@nvidia.com>
…nd reset zap

Wire the CXL DPA range up as a VFIO demand-paged region so QEMU can
mmap guest device memory directly. Faults call vmf_insert_pfn() to
insert one PFN at a time rather than mapping the full range upfront.

CXL region lifecycle:
- The CXL memory region is registered with VFIO layer during
  vfio_pci_open_device
- mmap() establishes the VMA with vm_ops but inserts no PTEs
- Each guest page fault calls vfio_cxl_region_page_fault() which
  inserts a single PFN under the memory_lock read side
- On device reset, vfio_cxl_zap_region_locked() sets region_active=false
  and calls unmap_mapping_range() to invalidate all DPA PTEs atomically
  while holding memory_lock for writing
- Faults racing with reset see region_active==false and return
  VM_FAULT_SIGBUS
- vfio_cxl_reactivate_region() restores region_active after successful
  hardware reset

Also integrate the zap/reactivate calls into vfio_pci_ioctl_reset() so
that FLR correctly invalidates DPA mappings and restores them on success.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in vfio_pci_core.c and vfio_pci_priv.h due to missing upstream dmabuf support in NV-Kernels base, Add vdev back-pointer in cxl_state, hold memory_lock read-side in fault/rw paths, advance *ppos in region rw, add vfio_direct_config_read export and use it instead of vfio_raw_config_read in DVSEC fallback]
Signed-off-by: Jiandi An <jan@nvidia.com>
CXL devices expose DVSEC registers in PCI configuration space.  Several
of them affect device behavior (CXL.io/CXL.mem/CXL.cache enables, lock
state, range bases) and must be virtualised so the guest cannot disturb
host-owned policy.

Add CXL-aware read and write handlers that operate on vdev->vconfig:

  - DVSEC reads come back from the vconfig shadow that vfio_config_init()
    already populates via vfio_ecap_init().
  - DVSEC writes go through per-register handlers (cxl_dvsec_*_write)
    which apply the spec-defined reserved-bit and lock-bit masking
    before updating the shadow.
  - The handlers are wired in via vdev->dvsec_readfn / dvsec_writefn,
    which the global ecap_perms[PCI_EXT_CAP_ID_DVSEC] dispatcher routes
    to when the device is a CXL device.  Non-CXL devices with a DVSEC
    capability fall through to direct hardware access.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in Makefile and vfio_pci_core.h due to missing upstream dmabuf/p2pdma forward declarations in NV-Kernels base, Carry Disable_Caching into Cache WBI hardware write, use vfio_direct_config_read fallback, add byte-aligned read/write routing for DVSEC registers, handle partial-byte W1C writes for STATUS/STATUS2, add PM_INIT_COMPLETION RW1CS handling]
Signed-off-by: Jiandi An <jan@nvidia.com>
Register the DPA and component register region with VFIO layer.
Region indices for both these regions are cached for quick lookup.

vfio_cxl_register_cxl_region()
- memremap(WB) the region HPA (treat CXL.mem as RAM, not MMIO)
- Register VFIO_REGION_SUBTYPE_CXL
- Records dpa_region_idx.

vfio_cxl_register_comp_regs_region()
- Registers VFIO_REGION_SUBTYPE_CXL_COMP_REGS with size
  hdm_reg_offset + hdm_reg_size
- Records comp_reg_region_idx.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Check HDM COMMITTED bit before activating DPA region on precommitted decoders, add pm_runtime/memory-enabled gate in fault and rw paths, split vfio_cxl_zap_dpa() from prepare_reset(), add DPA zap in vfio_pci_zap_and_down_write_memory_lock(), add hot-reset CXL prepare/finish passes]
Signed-off-by: Jiandi An <jan@nvidia.com>
…AR to userspace

Expose CXL device capability through the VFIO device info ioctl and give
userspace mmap access to the GPU/accelerator register windows in the
component BAR while keeping the CXL component register block off-limits
to user mappings.

vfio_cxl_get_info() fills VFIO_DEVICE_INFO_CAP_CXL with the HDM register
BAR index and byte offset, commit flags, and VFIO region indices for the
DPA and COMP_REGS regions.  HDM decoder count and the HDM block offset
within COMP_REGS are not populated; both are derivable from the CXL
Capability Array in the COMP_REGS region itself.

vfio_cxl_get_region_info() handles VFIO_DEVICE_GET_REGION_INFO for the
component register BAR.  It builds a sparse-mmap capability that
advertises only the GPU/accelerator register windows, carving out the
CXL component register block.  Three physical layouts are handled:

  Topology A  comp block at BAR end:    one area [0, comp_reg_offset)
  Topology B  comp block at BAR start:  one area [comp_end, bar_len)
  Topology C  comp block in the middle: two areas, one on each side

vfio_cxl_mmap_overlaps_comp_regs() checks whether an mmap request overlaps
[comp_reg_offset, comp_reg_offset + comp_reg_size).  vfio_pci_core_mmap()
calls it to reject mmap of the component register block while allowing
mmap of the GPU register windows in the sparse capability.  This replaces
the earlier blanket rejection of any mmap on the component BAR index.

vfio_pci_bar_rw() applies the same overlap check, so fd pread()/pwrite()
on the component BAR is also rejected when it would touch the component
register subrange.  All access to those registers goes through the
dedicated COMP_REGS region, where the emulated HDM shadow lives.

Hook both helpers into vfio_pci_ioctl_get_info() and
vfio_pci_ioctl_get_region_info() in vfio_pci_core.c.

The component BAR cannot be claimed exclusively since the CXL subsystem
holds persistent sub-range iomem claims during HDM decoder setup.
pci_request_selected_regions() returns EBUSY; pass bars=0 to skip the
request and map directly via pci_iomap().  Physical ownership is assured
by driver binding.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Add BAR bounds check for component block, handle full-BAR component reg case, add bar_mmap_supported gate, block BAR fd read/write and ioeventfd in component reg subrange]
Signed-off-by: Jiandi An <jan@nvidia.com>
This commit provides an opt-out mechanism to disable the CXL
support from vfio module. The opt-out is provided both
build time and module load time.

Build time option CONFIG_VFIO_CXL_CORE is used to enable/disable
CXL support in vfio-pci module.

For runtime disabling the CXL support, use the module parameter
disable_cxl. This is a per-device opt-out on the core device
set by the driver before registration.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatch in vfio_pci.c probe function due to missing upstream pci_ops assignment in NV-Kernels base, Wrap disable_cxl field in #if IS_ENABLED(CONFIG_VFIO_CXL_CORE), update MODULE_PARM_DESC wording]
Signed-off-by: Jiandi An <jan@nvidia.com>
…ough

Add Documentation/driver-api/vfio-pci-cxl.rst describing the architecture,
VFIO interfaces, and operational constraints for CXL Type-2 (cache-coherent
accelerator) passthrough via vfio-pci-core, and link it from the driver-api
index.

The document covers:
- VFIO_DEVICE_FLAGS_CXL and VFIO_DEVICE_INFO_CAP_CXL: what the capability
  struct contains and what the FIRMWARE_COMMITTED and CACHE_CAPABLE flags mean
- How to derive hdm_decoder_offset and hdm_count from the COMP_REGS region
  by traversing the CXL Capability Array to find cap ID 0x5 and reading the
  HDM Decoder Capability register
- Topology-aware sparse mmap on the component BAR (topologies A, B, C
  covering comp block at end, start, or middle of the BAR)
- Two extra VFIO device regions: COMP_REGS for the emulated HDM register
  state and the DPA memory window
- DVSEC config write virtualization: what the guest sees vs. hardware
- FLR coordination: DPA PTEs zapped before reset, restored after

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Rename vfio_cxl_zap_region_locked to vfio_cxl_prepare_reset and vfio_cxl_reactivate_region to vfio_cxl_finish_reset in docs]
Signed-off-by: Jiandi An <jan@nvidia.com>
Export two helpers for VFIO:
  - pci_cxl_reset_capable()
  - cxl_dev_reset()

The change does not alter the reset flow itself, the capability checks,
or the sysfs ABI. It only lifts the helper out of the private path so
later VFIO patches can call the same code.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
This change adds/renames the vfio-cxl code nuggets to better suite
the cxl-reset handling mechanism in later patches.

- Rename the CXL DPA region helpers to prepare_reset() and finish_reset
  so call sites read as a matched pair around pci_try_reset_function
  Also call prepare_reset()/finish_reset() around
  pci_try_reset_function() in both the PCIe BCR FLR path and the
  Function FLR path, matching the logic already used on the
  VFIO_DEVICE_RESET ioctl path.

- When pci_try_reset_function() fails: finish_reset() consults the
  hardware COMMITTED state before re-enabling the DPA mapping, so it is
  safe on error and avoids leaving the DPA region wedged off after a
  transient reset failure.

- Add vfio_cxl_reset_capable(), a small wrapper over
  pci_cxl_reset_capable()

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…e post-reset BAR access

A reset caller may disable Memory Space to quiesce device DMA before
issuing the reset. pci_try_reset_function() saves and restores
PCI_COMMAND around the FLR. If the memory space was disabled before FLR,
it will be restored in disabled state.

vfio_cxl_finish_reset() reads HDM decoder registers through the
component register BAR immediately after reset. Accessing a BAR with
Memory Space disabled produces an Unsupported Request completion; on
platforms that promote UR to a fatal error this triggers DPC.

Add vfio_cxl_enable_memory_space() and call it at the start of
vfio_cxl_finish_reset() before touching any BAR.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…ss reset

After FLR, reinit_comp_regs() re-reads HDM decoder registers from
hardware into comp_reg_virt[].  Hardware is not all-zeros at this
point: pci_dev_restore() ran first and re-committed the pre-reset
host-physical decoder bases into the registers.  reinit_comp_regs()
therefore overwrites the emulated guest-physical bases that the device
manager programmed with the host-physical bases used by the host CXL
core.  The kernel provides no notification that BASE was overwritten,
so the emulated GPA bases are silently lost.

The same issue affects the CTRL LOCK bit: FLR clears it in hardware
and pci_dev_restore() does not re-apply it, so a decoder that the
guest had locked re-emerges from reset with LOCK clear in shadow.

Add vfio_cxl_reinit_hdm_shadow() which snapshots BASE_LOW, BASE_HIGH,
and the CTRL LOCK bit from the shadow before calling
reinit_comp_regs(), then writes them back after, keeping the emulated
decoder consistent with what the guest programmed.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…nfig shadow

STATUS2 was read directly from hardware while all other DVSEC registers
were served from the vconfig shadow. This created two problems:

1. VOLATILE_HDM_PRES_ERROR (RW1CS, bit 3): guest writes cleared the
   hardware bit but the shadow was not updated, so subsequent reads still
   returned the set bit from hardware (which the hardware had cleared).

2. CXL_RESET_COMPLETE and CXL_RESET_ERROR (bits 1-2): these outcome bits
   will be written by vfio_cxl_reset() into the shadow after a protocol
   reset. Hardware does not update them on its own; serving reads from
   hardware would hide the outcome from the guest.

Add STATUS2 to the read switch so reads come from the shadow, and update
cxl_dvsec_status2_write() to mirror VOLATILE_HDM_PRES_ERROR clears into
the shadow after forwarding to hardware.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
Add vfio_cxl_reset() to drive a CXL protocol reset on behalf of a guest.

Unlike cxl_do_reset(), this path skips host memory offlining since the
DPA region is guest memory.  The function takes memory_lock for the full
sequence, calls vfio_cxl_prepare_reset() to zap DPA region PTEs, drives
the hardware via pci_dev_save_and_disable() + cxl_dev_reset() +
pci_dev_restore(), then calls vfio_cxl_finish_reset() to reinitialise
emulated state.

STATUS2 outcome bits (CXL_RESET_COMPLETE / CXL_RESET_ERROR) are written
back to vconfig after the reset so the guest can poll for the result
without reading hardware.  cxl_save_dvsec() / cxl_restore_dvsec() cover
CTRL, CTRL2, range_base_*, and LOCK; STATUS2 is not saved or restored
across the reset, so the hardware value is re-read after restore (it
will have both outcome bits clear) and the outcome is stamped on top.

When the guest writes INIT_CXL_RST into DVSEC CONTROL2, invoke
vfio_cxl_reset() to perform a CXL protocol reset.  The bit is not
forwarded to hardware; cxl_dev_reset() drives the reset sequence
directly.  Silently drop writes on devices that do not advertise
RST_CAPABLE to avoid log noise for the reserved-bit case.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
… passthrough

Enable VFIO CXL core support on amd64 and arm64 to allow CXL Type-2
device passthrough via vfio-pci.

Signed-off-by: Jiandi An <jan@nvidia.com>
@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

The code looks good to me now, so just a few commit-hygiene comments.

These commits seem to all be missing the "(backported from..." or "(cherry picked from..." information:

  2d40efbb4f42  cxl: Export the CXL reset helpers for VFIO users                         
  0bd9c4c7ab7c  vfio/pci: Wire CXL DPA reset handling                          
  5071d3b07627  vfio/cxl: Ensure PCI Memory Space enabled before post-reset BAR access                                  
  9e0e291bfc29  vfio/cxl: preserve HDM decoder base addresses across reset                                              
  14fbdcb4d592  vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow                                           
  67c66e735df5  vfio/cxl: Implement vfio_cxl_reset()    

@JiandiAnNVIDIA
Copy link
Copy Markdown
Author

The code looks good to me now, so just a few commit-hygiene comments.

These commits seem to all be missing the "(backported from..." or "(cherry picked from..." information:

  2d40efbb4f42  cxl: Export the CXL reset helpers for VFIO users                         
  0bd9c4c7ab7c  vfio/pci: Wire CXL DPA reset handling                          
  5071d3b07627  vfio/cxl: Ensure PCI Memory Space enabled before post-reset BAR access                                  
  9e0e291bfc29  vfio/cxl: preserve HDM decoder base addresses across reset                                              
  14fbdcb4d592  vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow                                           
  67c66e735df5  vfio/cxl: Implement vfio_cxl_reset()    

These 6 patches are Manish's vfio cxl reset series that has not been posted upstream. He sent me via tarball. I believe it's been posted for review internally via the linux-upstream email alias. There is no source I could specify for now. So treating them out of tree Nvidia developed patches for not.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 21, 2026

I re-reviewed with codex comparing the latest branch with the snapshot from my prior review. No issues or concerns from me; my prior ack still stands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants