Skip to content

Fix/ttm deadlock and rdma pin limit#212

Open
chun-wan wants to merge 2670 commits into
ROCm:masterfrom
chun-wan:fix/ttm-deadlock-and-rdma-pin-limit
Open

Fix/ttm deadlock and rdma pin limit#212
chun-wan wants to merge 2670 commits into
ROCm:masterfrom
chun-wan:fix/ttm-deadlock-and-rdma-pin-limit

Conversation

@chun-wan
Copy link
Copy Markdown

Motivation

Fix two critical issues on MI300X/MI308X with multi-process RDMA workloads under VRAM pressure:

  1. TTM gtt_window_lock live-lock (amdgpu_ttm.c): Replace mutex_lock() with mutex_lock_interruptible() in amdgpu_ttm_copy_mem_to_mem() to prevent permanent D2H hang.
  2. Unbounded RDMA BO pinning (amdgpu_amdkfd_gpuvm.c, amdgpu_amdkfd.h): Add dedicated rdma_pinned_bytes counter and enforce dmabuf_pin_max_mb to cap RDMA-pinned VRAM per GPU.
  3. PeerDirect error propagation (kfd_peerdirect.c): Distinguish quota rejections (-ENOSPC) from other pin errors.
  4. Debug parameter (amdgpu_drv.c, amdgpu.h): Add rdma_pin_debug module parameter.

Technical Details

Root Cause

  • amdgpu_ttm_copy_mem_to_mem() uses non-interruptible mutex_lock() on gtt_window_lock. When processes deadlock during eviction restore, neither can back off → permanent hang after 30-40 min.
  • amdgpu_amdkfd_gpuvm_pin_bo() has no RDMA-specific accounting or limit → unbounded pin growth starves other processes.

Test Plan

  • Test 1 (dmabuf_pin_max_mb=0): 0 hangs, 0 stalls, 15.7M D2H ops
  • Test 2 (dmabuf_pin_max_mb=512): RDMA pins capped at 512 MB, 126 excess rejected with -ENOSPC, 0 hangs

Test Result

Submission Checklist

Chengjun Yao and others added 30 commits October 28, 2025 02:27
It's caused by the commit:  1fcbc0b6a8
"drm/amd: Fix hybrid sleep"

Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com>
Reviewed-by: Bob Zhou <Bob.Zhou@amd.com>
Signed-off-by: Yang Su <Yang.Su2@amd.com>
Signed-off-by: Yang Su <Yang.Su2@amd.com>
Add gpu metrics definition which is only a set of gpu metrics
attributes. A field is encoded by its id, type and number of instances.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
a. hmm_range is either NULL or a valid pointer so we
do not need to set range to NULL ever.

b. keep the hmm_range_free in the end irrespective of
the other conditions to avoid some additional checks
and also avoid double free issue.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
These were not set so soft recovery was inadvertantly
disabled.

Fixes: 6ac55ea ("drm/amdgpu: move reset support type checks into the caller")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Update asic_invalidate_hdp and asic_flush_hdp function to check if ip
function exist, if not return void

v2: Use else/if (Kevin)
    Update function name (Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Move everything to the supported resets masks rather than
having an explicit misc checks for this.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove the NULL check from amdgpu_hmm_range_free for hmm_pfns
as caller is responsible not to call amdgpu_hmm_range_free
more than once.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
…EDID"

This reverts commit 11b66b2.
It's caused the Jira ticket: SWDEV-563655, so revert it temporarily.

Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com>
Read CPER raw data from debugfs node "/sys/kernel/debug/dri/*/
amdgpu_ring_cper".

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Change-Id: I01753bf4a1052a22144f6c2758a39d2b91c2212d
Remove amdgpu_asic_flush_hdp & amdgpu_asic_invalidate_hdp functions and
directly use the mapped ones

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Fix the error:
drivers/gpu/drm/amd/amdgpu/../ras/ras_mgr/amdgpu_ras_mgr.c:132:undefined reference to `__udivdi3'

Fixs:b5bae0f01786d("drm/amd/ras: Add amdgpu ras management function")
Reported-by: kernel test robot <lkp@intel.com>
Closes:https://lore.kernel.org/oe-kbuild-all/202510272144.6SUHUoWx-lkp@intel.com/
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Fix error injection parameter error.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
The IPID register value for bad page threshold CPER holds socket_id info
now according to the latest definition.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Change-Id: I847509f0282e246a171194c4fdbe1dfe0b297bb0
When the EDID of an analog display is not available, we can't
know the possible modes supported by the display. However, we
still need to offer the user to select from a variety of common
modes. It will be up to the user to select the best one, though.

This is how it works on other operating systems as well as the
legacy display code path in amdgpu.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Check if we have an amdgpu_dm_connector->dc_sink first before
adding common modes for analog outputs. If we don't have a
sink yet we can safely skip this.

Fixes: 0c9f9ca99238 ("drm/amd/display: Add common modes to analog displays without EDID")
Signed-off-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Yang Su <Yang.Su2@amd.com>
Signed-off-by: Yang Su <Yang.Su2@amd.com>
v1:
the driver should handle return value of smu_v13_0_6_printk_clk_levels()
to return the correct size for sysfs reads.

v2:
fix the issue of size calculation error in smu_v13_0_6_print_clks()

Fixes: 0354cd650daa ("drm/amd/pm: Avoid writing nulls into `pp_od_clk_voltage`")

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Use the correct label to complete all cleanup work.

Fixes: 4d154b1 ("drm/amd/pm: Add support for DPM policies")
Fixes: d2e690ff5d3cf ("drm/amd/pm: Add temperature metrics sysfs entry")

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Add helper macros to define metrics struct definitions. It will define
structs with field type followed by actual field. A helper macro is also
added to initialize the field encoding for all fields and to initialize
the field members to 0xFFs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Fill and publish GPU metrics in v1.9 format for SMUv13.0.6 SOCs

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
The contents of si_dpm.h seem to have been copied from the
old radeon driver, including a lot of structs and fields which
were only relevant to GPU generations even older than SI.

A lot of these can be deleted without causing much churn to the
actual SI DPM code. Let's delete them to make the code easier
to understand.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No functional modification involved.

./drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c:1674:3-4: Unneeded semicolon.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No functional modification involved.

./drivers/gpu/drm/amd/display/dc/resource/dcn32/dcn32_resource.c:1850:3-4: Unneeded semicolon.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No functional modification involved.

./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:7392:3-4: Unneeded semicolon.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Commit e6a8a00 ("drm/amd/display: Rename dml2 to dml2_0 folder")
renames the directory dml2 to dml2_0 in ./drivers/gpu/drm/amd/display/dc,
but misses to adjust the file entry in AMD DISPLAY CORE - DML.

Adjust the file entry after this directory renaming.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Correct valid_bits and ms_chk_bits of section info field for bad page
threshold exceed CPER to match OOB's behavior.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
The severity of CPER for BP threshold exceed event should be set as
FATAL to match the OOB implementation.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Perry Yuan and others added 29 commits February 6, 2026 14:46
Ensure GFX engine is idle before switching PTL state to prevent
register access violations and CP hang. This addresses the race
condition where in-flight GPU commands could conflict with PTL
state changes.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
KIQ access is not guaranteed to work reliably under all reset
situations. Avoid flooding dmesg with HDP flush failure messages.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
sdma ring reset is not supported in SRIOV. kfd driver does not check
reset mask, and could queue sdma ring reset during unmap_queues_cpsch.

Avoid the ring reset for sriov.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
v4:
use func "amdgpu_gfx_get_hdp_flush_mask" to get ref_and_mask for
gfx9 through gfx12.

v3:
Unify the get_ref_and_mask function in amdgpu_gfx_funcs,
to support both GFX11 and earlier generations

v2:
place "get_ref_and_mask" in amdgpu_gfx_funcs instead of amdgpu_ring,
since this function only assigns the cp entry.

v1:
both gfx ring and mes ring use cp0 to flush hdp, cause conflict.

use function get_ref_and_mask to assign the cp entry.
reassign mes to use cp8 instead.

Signed-off-by: chong li <chongli2@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
This patch allows kfd driver function correctly when AMD gpu devices got
unplug/replug at run time.

When an AMD gpu device got unplug kfd driver gracefully terminates existing
kfd processes after stops all queues by sending SIGBUS to user process. After
that user space can still use remaining AMD gpu devices. When all AMD gpu
devices at system got removed kfd driver will not response new requests.

Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices
to function as usual.

The purpose of this patch is having kfd driver behavior as expected during and
after AMD gpu devices unplug/replug at run time.

Signed-off-by: Xiaogang Chen<Xiaogang.Chen@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a482d054b7e7c7f2a35161d79e6629fa0f7f29d1)

Change-Id: Ie33ea428914708546f7f96a627747f01bc6fcfdd
… paths

Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU
register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all()
and amdgpu_fence_driver_hw_fini().

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
(cherry picked from commit 33fe740db26e0c94791dde8b2926d3ab36c9e6ae)

Change-Id: I09895beaff20b3caf125b15e17bc330392552393
This reverts commit c61fab0.

Reason for revert: revert the patch as it causes performance drop,tested by CE team. revert this patch requested by management team.

Change-Id: I0db3f9f819554566e259bbb1292e7690db958ced
…ntation

Separate the PTL (Peak Tops Limiter) control logic into a stable public
API layer and an internal implementation layer.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Add F8 and VECTOR to amdgpu_ptl_fmt and PSP format mapping.
Update PTL format strings and GFX format enum to keep PSP/KFD in sync.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Use a bitmap to track PTL disable requests from sysfs and profiler.
PTL is only re-enabled once all sources have released their disable
requests, avoiding premature enablement.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Add a new kernel module parameter 'amdgpu.ptl' to allow
users to enable or disable PTL feature at driver loading time.

Parameter values:
  *) 0 or -1: disable PTL (default)
  *) 1: enable PTL
  *) 2: permanently disable PTL

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Add an explicit check on cmd->resp.status after psp_cmd_submit_buf()
returns to ensure PTL state is only updated on actual success.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
…ces"

This reverts commit 1562589.

Reason for revert: <this patch breaks dkms install>

Change-Id: If54b6c5f703eb136db61770d5ddafcba22bf4620
Invalidating a dmabuf will impact other users of the shared BO.
In the scenario where process A moves the BO, it needs to inform
process B about the move and process B will need to update its
page table.

The commit fixes a synchronisation bug caused by the use of the
ticket: it made amdgpu_vm_handle_moved behave as if updating
the page table immediately was correct but in this case it's not.

An example is the following scenario, with 2 GPUs and glxgears
running on GPU0 and Xorg running on GPU1, on a system where P2P
PCI isn't supported:

glxgears:
  export linear buffer from GPU0 and import using GPU1
  submit frame rendering to GPU0
  submit tiled->linear blit
Xorg:
  copy of linear buffer

The sequence of jobs would be:
  drm_sched_job_run                       # GPU0, frame rendering
  drm_sched_job_queue                     # GPU0, blit
  drm_sched_job_done                      # GPU0, frame rendering
  drm_sched_job_run                       # GPU0, blit
  move linear buffer for GPU1 access      #
  amdgpu_dma_buf_move_notify -> update pt # GPU0

It this point the blit job on GPU0 is still running and would
likely produce a page fault.

Cc: stable@vger.kernel.org
Fixes: a448cb0 ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Firmware and monitoring tools may not be ready to receive a CPER when we
read the bad pages, so send the CPERs at the end of RAS initialization
to ensure that the FW is ready to receive and process the CPER. This
removes the previous CPER submission that was added during bad page
load, and sends both in-band and out-of-band at the same time.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
This patch allows kfd driver function correctly when AMD gpu devices got
unplug/replug at run time.

When an AMD gpu device got unplug kfd driver gracefully terminates existing
kfd processes after stops all queues by sending SIGBUS to user process. After
that user space can still use remaining AMD gpu devices. When all AMD gpu
devices at system got removed kfd driver will not response new requests.

Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices
to function as usual.

The purpose of this patch is having kfd driver behavior as expected during and
after AMD gpu devices unplug/replug at run time.

Signed-off-by: Xiaogang Chen<Xiaogang.Chen@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
GC 9.4.4 uses SPI busy status for idle detection instead of GRBM GUI_ACTIVE.
Add version check to use SPI_BUSY for 9.4.4 while keeping GRBM_STATUS
GUI_ACTIVE check for other GC versions.

v2: move this check into amdgpu_ptl_perf_monitor_ctrl(Lijo)

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Move amdgpu_amdkfd_stop/start_sched calls from kfd_ptl_control()
into amdgpu_ptl_perf_monitor_ctrl() so all PTL callers (KFD ioctl,
sysfs, GFX init) get consistent scheduling management.

Add amdgpu_amdkfd_stop/start_sched_all() wrappers to stop and
restart KFD scheduling on all nodes without assuming node ID ordering.

v3:
 * call start/stop for PTL Set Only
v2:
 * move the stop/start sched function to
   amdgpu_ptl_perf_monitor_ctrl(Lijo)
 * add wrapper amdgpu_amdkfd_stop_sched_all and
   amdgpu_amdkfd_start_sched_all (Lijo)

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Set the AMDGPU_PTL_DISABLE_SYSFS bit in adev->psp.disable_bitmap during
gfx_v9_4_3_perf_monitor_ptl_init(). This ensures that PTL is initially
disabled via the SYSFS mechanism, matching the intended default state
and preventing unintended PTL enablement before explicit user action.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Create PTL sysfs in xgmi_reset_on_init restore path for MINIMAL_XGMI

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Downgrade unhalt_cpsch warning to dev_dbg when sched is already stopped

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Only set the bit when PTL is actually being disabled (state=0)

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
end the function flow when ras table checksum is error

Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Handle RAS eeprom record when UMC_CHANNEL_IDX_V2 is set.

v2: get UMC_CHANNEL_IDX_V2 flag before the clear of it.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Make eeprom data and its counter consistent.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fix this by skipping the sysfs disable mapping when the GPU is
currently undergoing a reset or suspend flow.
Additionally, add debug logging in psp_ptl_invoke() to better
trace PTL state and format queries/updates cmd.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
…mem_to_mem

When multiple processes trigger KFD BO eviction and subsequent restore
simultaneously, the non-interruptible mutex_lock() on gtt_window_lock in
amdgpu_ttm_copy_mem_to_mem() can cause a live-lock: Process A holds the
lock waiting for Process B to release a BO, while Process B waits for
the same lock. Since mutex_lock() is not interruptible, neither process
can back off, resulting in a permanent D2H hang.

This was observed on MI300X systems running multi-process RDMA workloads
under VRAM pressure, where the hang typically occurs after 30-40 minutes
of sustained operation.

Replace mutex_lock() with mutex_lock_interruptible() so the wait can be
interrupted by signals, returning -ERESTARTSYS to allow the TTM
subsystem to retry or abort gracefully.

Tested on MI300X/MI308X (gfx942) with 8 GPUs under extreme VRAM
pressure (192/196 GB utilized) for 40+ minutes with zero hangs.

Signed-off-by: Chun Wan <chun-wan@amd.com>
RDMA PeerDirect operations pin GPU buffer objects in VRAM, making them
ineligible for eviction. Without any limit, a misbehaving or compromised
RDMA peer can pin all available VRAM, starving other processes and
triggering cascading eviction failures that lead to system hangs.

Add a dedicated atomic64_t rdma_pinned_bytes counter in amdgpu_kfd_dev
to track RDMA-pinned VRAM independently from the general vram_pinned
counter. In amdgpu_amdkfd_gpuvm_pin_bo(), enforce the existing
dmabuf_pin_max_mb module parameter using atomic64_add_return() for
race-free accounting. If the total would exceed the configured limit,
roll back and return -ENOSPC.

Also improve PeerDirect error logging in kfd_peerdirect.c to distinguish
quota rejections (-ENOSPC) from other pin errors, and add a
rdma_pin_debug module parameter for optional runtime logging.

Tested on MI300X/MI308X with 128 RDMA pin attempts (32 GB total):
- Without limit (dmabuf_pin_max_mb=0): 120 pins succeed (30 GB pinned)
- With limit (dmabuf_pin_max_mb=512): only 2 pins succeed (512 MB),
  126 correctly rejected with -ENOSPC in dmesg
- No hangs or GPU resets in either configuration over 40 minutes

Signed-off-by: Chun Wan <chun-wan@amd.com>
amdgpu_amdkfd_gpuvm_pin_bo() increments rdma_pinned_bytes before
amdgpu_bo_pin() when the domain includes VRAM. If pinning succeeds but
the buffer ends up outside VRAM, unpin_bo() never subtracts from
rdma_pinned_bytes (it only does so for TTM_PL_VRAM), leaking quota.

Roll back the pre-accounted bytes in that case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.