Fix Half2 UVM performance regression with vectorized store by q10 · Pull Request #5499 · pytorch/FBGEMM

q10 · 2026-03-19T23:40:37Z

Summary:
Apply vectorized store optimization pattern for fbgemm_gpu::rocm::Half2 class. This ensures Half2 store operations use efficient 32-bit memory operations instead of scalar element-by-element access.

With UVM (managed memory), each separate store can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead.

Reviewed By: henrylhtsang

Differential Revision: D96381300

Summary: Apply vectorized store optimization pattern for fbgemm_gpu::rocm::Half2 class. This ensures Half2 store operations use efficient 32-bit memory operations instead of scalar element-by-element access. With UVM (managed memory), each separate store can trigger a page fault, causing significant slowdown. Using vectorized operations reduces this overhead. Reviewed By: henrylhtsang Differential Revision: D96381300

meta-codesync · 2026-03-19T23:40:44Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96381300.

meta-cla bot added the cla signed label Mar 19, 2026

facebook-github-tools bot added the module: rocm label Mar 19, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Half2 UVM performance regression with vectorized store#5499

Fix Half2 UVM performance regression with vectorized store#5499
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D96381300

q10 commented Mar 19, 2026

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

q10 commented Mar 19, 2026

Uh oh!

meta-codesync bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant