feat(runtime): add L3/L2 host-device communication#803
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a bounded host-device message channel (HDCH) using a ring-buffer design in shared memory, providing platform-specific backends for hardware and simulation along with Python bindings for Orchestrator and ChipWorker. The review feedback identifies critical security vulnerabilities, specifically potential buffer overflows and TOCTOU issues when reading from shared memory. Additionally, recommendations were made to optimize performance by eliminating redundant data copies in the binding and orchestration layers and to resolve a data race in the HAL library initialization using atomic synchronization.
| uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us | ||
| ) { | ||
| if (lanes == nullptr || cursor == nullptr || (data == nullptr && nbytes != 0)) return HDCH_ERR_INVALID; | ||
| if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE; |
There was a problem hiding this comment.
The nbytes check only validates against max_message_bytes, which is read from shared memory and could be corrupted. To prevent a buffer overflow in the subsequent memcpy into slot->inline_data, you must also explicitly validate nbytes against the hard-coded HDCH_MAX_INLINE_BYTES capacity.
| if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE; | |
| if (nbytes > max_message_bytes || nbytes > HDCH_MAX_INLINE_BYTES) return HDCH_ERR_MSG_TOO_LARGE; |
References
- Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows caused by potential memory corruption or builder bypasses.
| if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE; | ||
| if (slot->payload_bytes != 0) { | ||
| memcpy(dst, slot->inline_data, slot->payload_bytes); | ||
| } | ||
| *out_nbytes = slot->payload_bytes; |
There was a problem hiding this comment.
Reading slot->payload_bytes multiple times from shared memory is vulnerable to TOCTOU issues if the device-side modifies it. Furthermore, it must be validated against HDCH_MAX_INLINE_BYTES to prevent reading past the end of the inline_data buffer. In this polling context, clamping to a safe range is preferred to avoid noisy failures.
| if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE; | |
| if (slot->payload_bytes != 0) { | |
| memcpy(dst, slot->inline_data, slot->payload_bytes); | |
| } | |
| *out_nbytes = slot->payload_bytes; | |
| uint32_t n = slot->payload_bytes; | |
| if (n > HDCH_MAX_INLINE_BYTES || n > dst_capacity) n = 0; | |
| if (n != 0) { | |
| memcpy(dst, slot->inline_data, n); | |
| } | |
| *out_nbytes = n; |
References
- Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows.
- In a polling loop that reads from shared memory, prefer clamping invalid values to a safe range over throwing an exception for transient inconsistencies.
| namespace { | ||
| void *g_hdch_hal_handle = nullptr; | ||
| constexpr unsigned int HDCH_DEV_SVM_MAP_HOST = 2U; | ||
|
|
||
| using HdchHalHostRegisterFn = int (*)(void *dev_ptr, size_t size, unsigned int flags, int device_id, void **host_ptr); | ||
| using HdchHalHostUnregisterFn = int (*)(void *host_ptr, int device_id); | ||
|
|
||
| int hdch_load_hal_if_needed() { | ||
| if (g_hdch_hal_handle != nullptr) return 0; | ||
| g_hdch_hal_handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_LOCAL); | ||
| return g_hdch_hal_handle == nullptr ? -1 : 0; | ||
| } | ||
|
|
||
| HdchHalHostRegisterFn hdch_get_halHostRegister() { | ||
| if (g_hdch_hal_handle == nullptr) return nullptr; | ||
| return reinterpret_cast<HdchHalHostRegisterFn>(dlsym(g_hdch_hal_handle, "halHostRegister")); | ||
| } | ||
|
|
||
| HdchHalHostUnregisterFn hdch_get_halHostUnregister() { | ||
| if (g_hdch_hal_handle == nullptr) return nullptr; | ||
| return reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(g_hdch_hal_handle, "halHostUnregister")); | ||
| } | ||
|
|
||
| } // namespace |
There was a problem hiding this comment.
The initialization of g_hdch_hal_handle lacks synchronization, leading to a data race. Use compare_exchange_strong for concurrent idempotent initialization. Per repository rules, use RTLD_GLOBAL with dlopen for symbol visibility and avoid scope-bound RAII for cached handles, managing their lifecycle explicitly. Also, ensure dlclose is called if initialization fails and cache the results of dlsym calls to minimize overhead.
static std::atomic<void*> g_hdch_hal_handle{nullptr};
static HdchHalHostRegisterFn g_halHostRegister = nullptr;
static HdchHalHostUnregisterFn g_halHostUnregister = nullptr;
int hdch_load_hal_if_needed() {
if (g_hdch_hal_handle.load() != nullptr) return 0;
void* handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_GLOBAL);
if (!handle) return -1;
try {
auto reg = reinterpret_cast<HdchHalHostRegisterFn>(dlsym(handle, "halHostRegister"));
auto unreg = reinterpret_cast<HdchHalHostUnregisterFn>(dlsym(handle, "halHostUnregister"));
void* expected = nullptr;
if (g_hdch_hal_handle.compare_exchange_strong(expected, handle)) {
g_halHostRegister = reg;
g_halHostUnregister = unreg;
return 0;
}
dlclose(handle);
return 0;
} catch (...) {
dlclose(handle);
return -1;
}
}References
- Use compare_exchange_strong for concurrent idempotent initialization of shared members to ensure the value is published exactly once.
- Use the RTLD_GLOBAL flag with dlopen when symbols from the loaded library need to be resolved by other shared libraries that are loaded later.
- Do not use scope-bound RAII for resources (such as dlopen handles) that are intentionally cached for reuse across multiple execution runs.
- To prevent resource leaks with handles acquired via dlopen, use a try-catch block to ensure dlclose is called if any subsequent initialization steps fail.
- Cache the results of extern "C" function calls outside of performance-critical loops to minimize overhead from repeated calls across the language boundary.
| std::string payload(data.c_str(), data.size()); | ||
| self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); |
There was a problem hiding this comment.
Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send.
| std::string payload(data.c_str(), data.size()); | |
| self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); | |
| self.channel_send(ch, route, data.c_str(), data.size(), correlation_id, timeout_us); |
| std::string payload(data.c_str(), data.size()); | ||
| self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); |
There was a problem hiding this comment.
Creating an intermediate std::string from nb::bytes is redundant and incurs an unnecessary copy. nb::bytes provides direct access to the underlying buffer via c_str() and size(), which can be passed directly to self.channel_send_l2_for_test.
| std::string payload(data.c_str(), data.size()); | |
| self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); | |
| self.channel_send_l2_for_test(ch, route, data.c_str(), data.size(), correlation_id, timeout_us); |
| std::string payload(data.c_str(), data.size()); | ||
| std::vector<uint8_t> bytes(payload.begin(), payload.end()); | ||
| self.channel_send(worker_id, channel, route, bytes, correlation_id); |
There was a problem hiding this comment.
This implementation performs multiple redundant copies (nb::bytes -> std::string -> std::vector<uint8_t>). By updating the Orchestrator::channel_send signature to take a raw pointer and size, you can pass the data directly from nb::bytes without any intermediate allocations.
self.channel_send(worker_id, channel, route, data.c_str(), data.size(), correlation_id);| uint32_t max_message_bytes | ||
| ); | ||
| void close_channel(int worker_id, uint64_t ch); | ||
| void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id); |
There was a problem hiding this comment.
Passing std::vector<uint8_t> by value (or even by const reference when constructed from a binding) forces unnecessary data copies. Changing the signature to take const void* and size_t allows for zero-copy pass-through from the Python bindings to the underlying WorkerThread control command.
| void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id); | |
| void channel_send(int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id); |
| void Orchestrator::channel_send( | ||
| int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id | ||
| ) { | ||
| auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); | ||
| if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id"); | ||
| wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id); | ||
| } |
There was a problem hiding this comment.
Update the implementation to match the more efficient const void* signature, avoiding the overhead of std::vector construction.
| void Orchestrator::channel_send( | |
| int worker_id, uint64_t ch, uint32_t route, const std::vector<uint8_t> &data, uint64_t correlation_id | |
| ) { | |
| auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); | |
| if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id"); | |
| wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id); | |
| } | |
| void Orchestrator::channel_send( | |
| int worker_id, uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id | |
| ) { | |
| auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); | |
| if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id"); | |
| wt->control_channel_send(ch, route, data, size, correlation_id); | |
| } |
- Hide child host pointers from hierarchical shared-memory callers - Chunk mailbox shared-memory reads and writes beyond payload capacity - Validate host/device memory and channel layout overflow cases
Summary
Tests