feat(runtime): add L3/L2 host-device communication by PKUZHOU · Pull Request #803 · hw-native-sys/simpler

PKUZHOU · 2026-05-18T10:04:46Z

Summary

add L3/L2 send/recv channels with multi-lane SPSC shared-ring semantics
add host/device shared-memory read/write regions with software 64-bit notify/wait signals
expose both features through the runtime C ABI, ChipWorker, Worker, Orchestrator, and Python bindings
wire a2a3/a5 sim backends, a2a3 onboard host-mapped memory, and a5 onboard unsupported stubs
add focused C++ coverage for channel and shared-memory core behavior

Tests

git diff --check
python3 -m py_compile python/simpler/worker.py python/simpler/orchestrator.py python/simpler/task_interface.py
cmake -S tests/ut/cpp -B /tmp/simpler-ut-cpp-hdmem
cmake --build /tmp/simpler-ut-cpp-hdmem --target test_host_device_memory -j4
/tmp/simpler-ut-cpp-hdmem/test_host_device_memory
cmake --build /tmp/simpler-ut-cpp-hdmem --target test_host_device_channel -j4 && /tmp/simpler-ut-cpp-hdmem/test_host_device_channel
cmake -S . -B /tmp/simpler-bind-hdmem -Dnanobind_DIR=/home/ntlab/miniconda3/lib/python3.13/site-packages/nanobind/cmake
cmake --build /tmp/simpler-bind-hdmem --target _task_interface -j4
python3 simpler_setup/build_runtimes.py --lib-dir /tmp/simpler-hdmem-lib --cache-dir /tmp/simpler-hdmem-cache
PYTHONPATH=python python3 -S a2a3 sim shared-memory smoke

gemini-code-assist

Code Review

This pull request implements a bounded host-device message channel (HDCH) using a ring-buffer design in shared memory, providing platform-specific backends for hardware and simulation along with Python bindings for Orchestrator and ChipWorker. The review feedback identifies critical security vulnerabilities, specifically potential buffer overflows and TOCTOU issues when reading from shared memory. Additionally, recommendations were made to optimize performance by eliminating redundant data copies in the binding and orchestration layers and to resolve a data race in the HAL library initialization using atomic synchronization.

gemini-code-assist · 2026-05-18T10:09:40Z

+    uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us
+) {
+    if (lanes == nullptr || cursor == nullptr || (data == nullptr && nbytes != 0)) return HDCH_ERR_INVALID;
+    if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE;


The nbytes check only validates against max_message_bytes, which is read from shared memory and could be corrupted. To prevent a buffer overflow in the subsequent memcpy into slot->inline_data, you must also explicitly validate nbytes against the hard-coded HDCH_MAX_INLINE_BYTES capacity.

Suggested change

if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE;

if (nbytes > max_message_bytes || nbytes > HDCH_MAX_INLINE_BYTES) return HDCH_ERR_MSG_TOO_LARGE;

References

Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows caused by potential memory corruption or builder bypasses.

gemini-code-assist · 2026-05-18T10:09:41Z

+            if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE;
+            if (slot->payload_bytes != 0) {
+                memcpy(dst, slot->inline_data, slot->payload_bytes);
+            }
+            *out_nbytes = slot->payload_bytes;


Reading slot->payload_bytes multiple times from shared memory is vulnerable to TOCTOU issues if the device-side modifies it. Furthermore, it must be validated against HDCH_MAX_INLINE_BYTES to prevent reading past the end of the inline_data buffer. In this polling context, clamping to a safe range is preferred to avoid noisy failures.

Suggested change

if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE;

if (slot->payload_bytes != 0) {

memcpy(dst, slot->inline_data, slot->payload_bytes);

}

*out_nbytes = slot->payload_bytes;

uint32_t n = slot->payload_bytes;

if (n > HDCH_MAX_INLINE_BYTES || n > dst_capacity) n = 0;

if (n != 0) {

memcpy(dst, slot->inline_data, n);

}

*out_nbytes = n;

References

Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows.

In a polling loop that reads from shared memory, prefer clamping invalid values to a safe range over throwing an exception for transient inconsistencies.

gemini-code-assist · 2026-05-18T10:09:41Z

+namespace {
+void *g_hdch_hal_handle = nullptr;
+constexpr unsigned int HDCH_DEV_SVM_MAP_HOST = 2U;
+
+using HdchHalHostRegisterFn = int (*)(void *dev_ptr, size_t size, unsigned int flags, int device_id, void **host_ptr);
+using HdchHalHostUnregisterFn = int (*)(void *host_ptr, int device_id);
+
+int hdch_load_hal_if_needed() {
+    if (g_hdch_hal_handle != nullptr) return 0;
+    g_hdch_hal_handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_LOCAL);
+    return g_hdch_hal_handle == nullptr ? -1 : 0;
+}
+
+HdchHalHostRegisterFn hdch_get_halHostRegister() {
+    if (g_hdch_hal_handle == nullptr) return nullptr;
+    return reinterpret_cast<HdchHalHostRegisterFn>(dlsym(g_hdch_hal_handle, "halHostRegister"));
+}
+
+HdchHalHostUnregisterFn hdch_get_halHostUnregister() {
+    if (g_hdch_hal_handle == nullptr) return nullptr;
+    return reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(g_hdch_hal_handle, "halHostUnregister"));
+}
+
+}  // namespace


The initialization of g_hdch_hal_handle lacks synchronization, leading to a data race. Use compare_exchange_strong for concurrent idempotent initialization. Per repository rules, use RTLD_GLOBAL with dlopen for symbol visibility and avoid scope-bound RAII for cached handles, managing their lifecycle explicitly. Also, ensure dlclose is called if initialization fails and cache the results of dlsym calls to minimize overhead.

static std::atomic<void*> g_hdch_hal_handle{nullptr}; static HdchHalHostRegisterFn g_halHostRegister = nullptr; static HdchHalHostUnregisterFn g_halHostUnregister = nullptr; int hdch_load_hal_if_needed() { if (g_hdch_hal_handle.load() != nullptr) return 0; void* handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_GLOBAL); if (!handle) return -1; try { auto reg = reinterpret_cast<HdchHalHostRegisterFn>(dlsym(handle, "halHostRegister")); auto unreg = reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(handle, "halHostUnregister")); void* expected = nullptr; if (g_hdch_hal_handle.compare_exchange_strong(expected, handle)) { g_halHostRegister = reg; g_halHostUnregister = unreg; return 0; } dlclose(handle); return 0; } catch (...) { dlclose(handle); return -1; } }

References

Use compare_exchange_strong for concurrent idempotent initialization of shared members to ensure the value is published exactly once.

Use the RTLD_GLOBAL flag with dlopen when symbols from the loaded library need to be resolved by other shared libraries that are loaded later.

Do not use scope-bound RAII for resources (such as dlopen handles) that are intentionally cached for reuse across multiple execution runs.

To prevent resource leaks with handles acquired via dlopen, use a try-catch block to ensure dlclose is called if any subsequent initialization steps fail.

Cache the results of extern "C" function calls outside of performance-critical loops to minimize overhead from repeated calls across the language boundary.

gemini-code-assist · 2026-05-18T10:09:41Z

+                std::string payload(data.c_str(), data.size());
+                self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);


Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send.

Suggested change

std::string payload(data.c_str(), data.size());

self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);

self.channel_send(ch, route, data.c_str(), data.size(), correlation_id, timeout_us);

gemini-code-assist · 2026-05-18T10:09:41Z

+                std::string payload(data.c_str(), data.size());
+                self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);


Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send_l2_for_test.

Suggested change

std::string payload(data.c_str(), data.size());

self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);

self.channel_send_l2_for_test(ch, route, data.c_str(), data.size(), correlation_id, timeout_us);

gemini-code-assist · 2026-05-18T10:09:41Z

+                std::string payload(data.c_str(), data.size());
+                std::vector<uint8_t> bytes(payload.begin(), payload.end());
+                self.channel_send(worker_id, channel, route, bytes, correlation_id);


This implementation performs multiple redundant copies (nb::bytes -> std::string -> std::vector<uint8_t>). By updating the Orchestrator::channel_send signature to take a raw pointer and size, you can pass the data directly from nb::bytes without any intermediate allocations.

self.channel_send(worker_id, channel, route, data.c_str(), data.size(), correlation_id);

gemini-code-assist · 2026-05-18T10:09:41Z

+        uint32_t max_message_bytes
+    );
+    void close_channel(int worker_id, uint64_t ch);
+    void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id);


Passing std::vector<uint8_t> by value (or even by const reference when constructed from a binding) forces unnecessary data copies. Changing the signature to take const void* and size_t allows for zero-copy pass-through from the Python bindings to the underlying WorkerThread control command.

Suggested change

void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id);

void channel_send(int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id);

gemini-code-assist · 2026-05-18T10:09:41Z

+void Orchestrator::channel_send(
+    int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id
+) {
+    auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);
+    if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");
+    wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id);
+}


Update the implementation to match the more efficient const void* signature, avoiding the overhead of std::vector construction.

Suggested change

void Orchestrator::channel_send(

int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id

) {

auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);

if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");

wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id);

}

void Orchestrator::channel_send(

int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id

) {

auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id);

if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id");

wt->control_channel_send(ch, route, data, size, correlation_id);

}

- Hide child host pointers from hierarchical shared-memory callers - Chunk mailbox shared-memory reads and writes beyond payload capacity - Validate host/device memory and channel layout overflow cases

feat(runtime): add L3 L2 host-device send recv channel

d3be82c

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

feat(runtime): add host device shared memory

589033b

PKUZHOU changed the title ~~feat(runtime): add L3 L2 host-device send recv channel~~ feat(runtime): add L3/L2 host-device communication May 19, 2026

feat(runtime): harden L3 shared memory access

1756c1a

- Hide child host pointers from hierarchical shared-memory callers - Chunk mailbox shared-memory reads and writes beyond payload capacity - Validate host/device memory and channel layout overflow cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): add L3/L2 host-device communication#803

feat(runtime): add L3/L2 host-device communication#803
PKUZHOU wants to merge 3 commits into
hw-native-sys:mainfrom
PKUZHOU:feat/l3-l2-send-recv-v1

PKUZHOU commented May 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE;
	if (nbytes > max_message_bytes \|\| nbytes > HDCH_MAX_INLINE_BYTES) return HDCH_ERR_MSG_TOO_LARGE;

-            if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE;
-            if (slot->payload_bytes != 0) {
-                memcpy(dst, slot->inline_data, slot->payload_bytes);
-            }
-            *out_nbytes = slot->payload_bytes;
+            uint32_t n = slot->payload_bytes;
+            if (n > HDCH_MAX_INLINE_BYTES || n > dst_capacity) n = 0;
+            if (n != 0) {
+                memcpy(dst, slot->inline_data, n);
+            }
+            *out_nbytes = n;

		std::string payload(data.c_str(), data.size());
		self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);

	std::string payload(data.c_str(), data.size());
	self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);
	self.channel_send(ch, route, data.c_str(), data.size(), correlation_id, timeout_us);

		std::string payload(data.c_str(), data.size());
		self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us);

	void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id);
	void channel_send(int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id);

Conversation

PKUZHOU commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PKUZHOU commented May 18, 2026 •

edited

Loading