Skip to content

Add 64-bit indexing fallback for large multi_tensor_l2norm kernels#1989

Open
SongXiaoXi wants to merge 3 commits intoNVIDIA:masterfrom
SongXiaoXi:master
Open

Add 64-bit indexing fallback for large multi_tensor_l2norm kernels#1989
SongXiaoXi wants to merge 3 commits intoNVIDIA:masterfrom
SongXiaoXi:master

Conversation

@SongXiaoXi
Copy link
Copy Markdown

Summary

This PR adds a 64-bit indexing fallback for the multi_tensor_l2norm kernel family when any input tensor has numel() above INT_MAX.

The existing int32 fast path is preserved for normal tensor sizes, while large tensors are dispatched to an int64-indexed path.

Problem

Apex's multi-tensor metadata stores tensor sizes in int64, but the l2norm family still narrows sizes and chunk indexing to int32 inside the device functors.

For tensors larger than INT_MAX elements, this can produce incorrect norm results and may also lead to out-of-bounds accesses once chunk offsets overflow 32-bit indexing.

Fix

  • add a shared helper to detect when tensor lists require 64-bit indexing
  • template the l2norm family functors on index type
  • dispatch to int64 indexing only for large tensors
  • preserve the existing int32 fast path for the common case

Affected ops

  • multi_tensor_l2norm
  • multi_tensor_l2norm_mp
  • multi_tensor_l2norm_scale

Testing

Added large-tensor regression tests covering:

  • multi_tensor_l2norm
  • multi_tensor_l2norm_mp
  • multi_tensor_l2norm_scale

The new tests verify correctness for tensors larger than INT_MAX elements while keeping the existing small/normal tensor path unchanged.

Comment thread tests/L0/run_optimizers/test_large_tensor_l2norm.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a safe 64-bit indexing fallback for Apex’s multi-tensor L2-norm CUDA kernels when tensor sizes exceed INT_MAX, preventing incorrect results and potential OOB accesses while preserving the existing int32 fast path for typical tensor sizes.

Changes:

  • Introduces a shared host-side helper to detect when any tensor list requires 64-bit indexing.
  • Templates the L2-norm kernel functors on an index_t and dispatches to int64_t only when needed.
  • Adds large-tensor regression tests covering multi_tensor_l2norm, multi_tensor_l2norm_mp, and multi_tensor_l2norm_scale.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/L0/run_optimizers/test_large_tensor_l2norm.py Adds regression tests for tensors with numel() > INT_MAX across the L2-norm kernel family.
csrc/multi_tensor_l2norm_scale_kernel.cu Adds index_t-templated functor + runtime dispatch to int64 indexing for large tensors.
csrc/multi_tensor_l2norm_kernel_mp.cu Adds index_t-templated functor + runtime dispatch to int64 indexing for large tensors.
csrc/multi_tensor_l2norm_kernel.cu Adds index_t-templated functors + runtime dispatch (including unscale + norm_out paths).
csrc/multi_tensor_apply.cuh Adds tensor_lists_require_64bit_indexing(...) helper used by updated kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L0/run_optimizers/test_large_tensor_l2norm.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants