Skip to content

Fix Half2 UVM performance regression with vectorized store#5499

Open
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D96381300
Open

Fix Half2 UVM performance regression with vectorized store#5499
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D96381300

Conversation

@q10
Copy link
Copy Markdown
Contributor

@q10 q10 commented Mar 19, 2026

Summary:
Apply vectorized store optimization pattern for fbgemm_gpu::rocm::Half2 class. This ensures Half2 store operations use efficient 32-bit memory operations instead of scalar element-by-element access.

With UVM (managed memory), each separate store can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead.

Reviewed By: henrylhtsang

Differential Revision: D96381300

Summary:
Apply vectorized store optimization pattern for fbgemm_gpu::rocm::Half2 class. This ensures Half2 store operations use efficient 32-bit memory operations instead of scalar element-by-element access.

With UVM (managed memory), each separate store can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead.

Reviewed By: henrylhtsang

Differential Revision: D96381300
@meta-cla meta-cla bot added the cla signed label Mar 19, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 19, 2026

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96381300.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant