Skip to content

Add: port comm + deferred completion to a5 onboard#823

Open
jvjhfhg wants to merge 1 commit into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5
Open

Add: port comm + deferred completion to a5 onboard#823
jvjhfhg wants to merge 1 commit into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5

Conversation

@jvjhfhg
Copy link
Copy Markdown
Collaborator

@jvjhfhg jvjhfhg commented May 19, 2026

  • Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY IPC windows). SDMA workspace overlay is added in the follow-up commit so this base alone does not depend on PTO_ISA_ROOT or libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at comm_init -- which keeps non-SDMA comm demos unaffected by the current CANN-9.x SDMA-on-a5 gap.
  • Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on acl_ready_ in finalize(); preserve raw rtDeviceReset for pure rt-layer callers.
  • Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding implementations; comm_* C ABI now comes from comm_hccl.cpp.
  • Upgrade a5 trb deferred-completion runtime from counter-only to pluggable backend-ops design: CompletionCondition gains completion_type/addr/retired fields, CompletionBackendOps table routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler invalidates counter cache lines before polling and retires satisfied conditions.
  • Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant until a kernel registers a SDMA condition; a5 pto-isa already exposes SDMA via PTO_NPU_ARCH_A5).
  • a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on miss).
  • Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier to disambiguate from bisheng's enum class Stride).
  • Enable a5 in allreduce_distributed and test_platform_comm platform marks; parametrize the latter via st_platform.
  • Convert ported runtime headers to #pragma once on both arches so aicore_completion_mailbox.h / pto_completion_token.h / pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte- identical across a2a3 and a5.

Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
@jvjhfhg jvjhfhg force-pushed the feat/comm-a5 branch 9 times, most recently from c6524eb to 3f5d0d5 Compare May 20, 2026 07:50
- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY
  IPC windows). SDMA workspace overlay is added in the follow-up
  commit so this base alone does not depend on PTO_ISA_ROOT or
  libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at
  comm_init -- which keeps non-SDMA comm demos unaffected by the
  current CANN-9.x SDMA-on-a5 gap.
- Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream
  into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on
  acl_ready_ in finalize(); preserve raw rtDeviceReset for pure
  rt-layer callers.
- Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding
  implementations; comm_* C ABI now comes from comm_hccl.cpp.
- Upgrade a5 trb deferred-completion runtime from counter-only to
  pluggable backend-ops design: CompletionCondition gains
  completion_type/addr/retired fields, CompletionBackendOps table
  routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler
  invalidates counter cache lines before polling and retires
  satisfied conditions.
- Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant
  until a kernel registers a SDMA condition; a5 pto-isa already
  exposes SDMA via PTO_NPU_ARCH_A5).
- a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on
  miss).
- Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier
  to disambiguate from bisheng's enum class Stride).
- Enable a5 in allreduce_distributed and test_platform_comm platform
  marks; parametrize the latter via st_platform.
- Convert ported runtime headers to #pragma once on both arches so
  aicore_completion_mailbox.h / pto_completion_token.h /
  pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte-
  identical across a2a3 and a5.

Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all
clean. No hardware tests run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant