Feat: AICPU launch via dispatcher upload + Mode A type 2#537
Open
puddingfjz wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an AicpuLoader abstraction to support both legacy and new CANN 7.0+ interfaces for launching AICPU kernels across the a2a3 and a5 platforms. The implementation includes build system updates, runtime JSON descriptor generation, and integration into the DeviceRunner. Feedback focuses on improving build portability by avoiding hardcoded architecture paths and enhancing the robustness of manual JSON construction. Additionally, the removal of a default parameter in the a2a3 platform's header is identified as a breaking change that violates cross-platform consistency. Suggestions were also made to reduce coupling in the kernel name mapping.
puddingfjz
added a commit
to puddingfjz/simpler
that referenced
this pull request
Apr 13, 2026
- Revert hardcoded aarch64-linux path in CMakeLists.txt, use portable paths - Restore default parameter for launch_aicpu_num in device_runner.h - Add documentation explaining JSON construction and name_mapping design The JSON construction uses manual string concatenation without a library. This is safe because kernel names are controlled strings without special characters, matching pypto's approach for similar AICPU op descriptors. The name_mapping from opType to functionName is specific to the Ascend tile framework kernels and is unlikely to change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5c35216 to
f30e69c
Compare
d4e918c to
3567417
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
3567417 to
90e71ed
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
90e71ed to
7b9e506
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
7b9e506 to
b4dd9b1
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
b4dd9b1 to
bb65c0c
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
bb65c0c to
f173a99
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
f173a99 to
473d8f6
Compare
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
473d8f6 to
2c220d3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without tar.gz / sudo pre-deployment, and without per-task indirection through the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs(kernel_type =KERNEL_TYPE_AICPU_KFC) targeting CANN's preinstalledlibaicpu_extend_kernels.so.libaicpu_extend_kernelsdlopens our dispatcher and invokes its Init; the dispatcher reads the runtime SO bytes from extendedDeviceArgs(newinner_so_bin/inner_so_lenfields at offsets 120/128, whichlibaicpu_extend_kernelsignores) and writes them to:…using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two host processes uploading the same runtime SO produce the same file. Writes go via atomic tmp+rename — no truncation window visible to concurrent
aicpu_schedulerreaders. A process-level fingerprint cache inLoadAicpuOpskips redundantlibaicpu_extend_kernelsinvocations within a single host process — each runtime is bootstrapped at most once per process.Per-task launches (direct Mode A type 2, no dispatcher hop)
Host calls
rtAicpuKernelLaunchExWithArgswithkernel_type = KERNEL_TYPE_AICPU,so_name = "simpler_inner_<fp>.so",kernel_name = "simpler_aicpu_init"/"simpler_aicpu_exec". The mainaicpu_schedulerdlopens the preinstall file on first invocation and caches the handle; subsequent launches reuse it.No JSON descriptors, no
rtsBinaryLoadFromFile/rtsFuncGetByNamelifecycle, no global op registry, no per-launch handle bookkeeping.Cleanup
BUILD_WITH_NEW_CANNCMake option and all ifdef branches.AicpuLoaderstub (src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}) — its only role was the OFF-path fallback and nothing tested that path.so_info_allocation on the new path (the runtime SO no longer readsdevice_args.aicpu_so_bin/aicpu_so_len). Saves ~inner-SO-size device memory per DeviceRunner; previously this accumulated across many ChipWorker/DeviceRunner instances and triggered AICORE OOM in long test sessions.aicpu_op_timeoutregression test to accept the new error code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler path can race the STARS watchdog and return 507018/507000 before the AICore stream sync emits 507046).Why this design (vs. earlier Mode B)
Earlier revisions of this PR routed per-task launches through Mode B (
rtsBinaryLoadFromFile+rtsFuncGetByName+rtsLaunchCpuKernel). Making Mode B work across multi-process / multi-runtime / long-test scenarios required several global-state workarounds (per-process JSON paths, opType collision avoidance via fingerprint suffix, atomic rename on preinstall writes). With Mode A type 2:binary_handle_/rtFuncHandlelifecycleThe dispatcher SO + Mode A KFC bootstrap is the only thing we keep from the previous approach — it remains the cleanest way to get our runtime SO into the preinstall path without sudo.
Testing
BUILD_WITH_NEW_CANNgrep results acrosssrc/= 0Fixes #356.