diff --git a/docs/L3-L2_host-device_communication.md b/docs/L3-L2_host-device_communication.md new file mode 100644 index 000000000..7b320b8eb --- /dev/null +++ b/docs/L3-L2_host-device_communication.md @@ -0,0 +1,401 @@ +# L3/L2 Host-Device Communication + +A second L3/L2 communication model is added beside the existing bounded +message channel. The runtime now exposes both **send/recv** for small control +messages and **shared-memory read/write + notify/wait** for data regions that +need explicit producer-consumer synchronization. + +In this document, L3 means the host-side Worker / Orchestrator runtime that +issues control commands. L2 means the device-side execution boundary reached +through the chip child process and platform runtime. Today the L2-facing +implementation covers the AICPU broker / sim test paths, while the memory +layout is kept plain POD so future device-side participants can use the same +region format. + +Current L3 shared-memory access is intentionally mailbox-mediated. The L3 +parent does not mmap the chip child's host mapping and does not access the +device pointer directly. L3 `shared_memory_read` and `shared_memory_write` +are chunked mailbox RPC copy helpers over the child-owned region; they are +not a device-direct zero-copy data plane. + +For the WorkerThread mailbox used to reach chip children, see +[worker-manager.md](worker-manager.md). For where the Orchestrator fits +in the hierarchical runtime, see +[hierarchical_level_runtime.md](hierarchical_level_runtime.md). For the +platform/runtime split, see [chip-level-arch.md](chip-level-arch.md). + +--- + +## 1. Communication model + +The PR deliberately keeps three semantics separate: + +```text +Control: send(route, payload, correlation_id) / recv() +Data: read(offset, nbytes) / write(offset, bytes) +Sync: notify(signal_id, value) / wait(signal_id, target) +``` + +### Responsibilities + +- **Message channel**: small commands, events, and completion notifications. + Messages carry `route`, `correlation_id`, and a byte payload. +- **Shared memory**: a CPU/NPU-visible data region addressed by offset. + Reads and writes copy bytes only; they do not carry route metadata. +- **Signal slots**: software synchronization words associated with a shared + memory region. Signals publish phase or sequence progress. + +This split avoids forcing bulk data through the message path and avoids +hiding synchronization inside raw memory copies. Callers choose the primitive +that matches their protocol. + +--- + +## 2. Message channel + +The message channel is implemented by +`src/common/worker/host_device_channel.{h,cpp}` and exported through +`pto_runtime_c_api.h`. + +```cpp +typedef struct { + uint32_t lane_count_cpu_to_l2; + uint32_t lane_count_l2_to_cpu; + uint32_t lane_depth; + uint32_t max_message_bytes; + uint32_t flags; +} HostDeviceChannelConfig; +``` + +Each channel owns a shared control region: + +```text +HostDeviceChannelHeader + +-- cpu_to_l2 lanes[0..N-1] + | +-- HostDeviceLaneHeader + | +-- HostDeviceDesc[lane_depth] + +-- l2_to_cpu lanes[0..M-1] + +-- HostDeviceLaneHeader + +-- HostDeviceDesc[lane_depth] +``` + +Each lane is a bounded SPSC ring. The CPU send path writes into +`cpu_to_l2` lanes; the CPU receive path consumes from `l2_to_cpu` lanes. +The L2 test helpers use the opposite direction so unit tests can exercise +both endpoints without real device code. + +### Channel constraints + +- `lane_depth` must be a non-zero power of two. +- `max_message_bytes` must be non-zero and no larger than + `HDCH_MAX_INLINE_BYTES` (`256`). +- `timeout_us == 0` means non-blocking probe. If the selected operation + cannot complete immediately, it returns a would-block error. +- Payloads are inline. This channel is not a streaming data plane. + +--- + +## 3. Shared memory + +The shared-memory model is implemented by +`src/common/worker/host_device_memory.{h,cpp}` and exported through +`pto_runtime_c_api.h`. + +```cpp +typedef struct { + uint64_t data_bytes; + uint32_t signal_count; + uint32_t flags; +} HostDeviceMemoryConfig; + +typedef struct { + uint64_t host_ptr; + uint64_t device_ptr; + uint64_t data_bytes; + uint32_t signal_count; + uint32_t flags; +} HostDeviceMemoryInfo; +``` + +The region starts with metadata, followed by cache-line-sized signal slots, +followed by the data region: + +```text +offset 0: + HostDeviceMemoryHeader + magic + version + flags + signal_count + data_offset + data_bytes + total_bytes + fatal_status + +offset sizeof(HostDeviceMemoryHeader): + HostDeviceSignalSlot[signal_count] // each slot is 64 B + +offset data_offset: + data[data_bytes] // read/write offset 0 starts here +``` + +`HostDeviceMemoryInfo::host_ptr` and `device_ptr` both point at the data +region, not the header. All public `shared_memory_read` and +`shared_memory_write` offsets are relative to the data region. + +### Memory constraints + +- `data_bytes` must be non-zero. +- `signal_count` must be non-zero. +- The computed layout is 64-byte aligned and checked for integer overflow. +- Bounds checks use `offset <= data_bytes` and + `nbytes <= data_bytes - offset`, so zero-length operations at the end of + the region are valid. + +--- + +## 4. Synchronization contract + +Shared-memory read/write does not provide an implicit producer-consumer +protocol. The caller owns phase ordering. + +### CPU produces, L2 consumes + +```text +L3 CPU: + shared_memory_write(mem, offset, bytes) + shared_memory_notify(mem, signal_id, seq) + +L2: + wait(signal_id, seq) + read/process data region +``` + +### L2 produces, CPU consumes + +```text +L2: + write/process data region + notify(signal_id, seq) + +L3 CPU: + shared_memory_wait(mem, signal_id, seq, timeout_us) + shared_memory_read(mem, offset, nbytes) +``` + +`notify` stores the signal value with release semantics. `wait` polls the +signal with acquire semantics until the observed value is greater than or +equal to the target. Signal values are intended to be monotonic sequence or +phase numbers; the implementation does not enforce monotonicity. + +Use separate `signal_id` values for independent directions or pipeline +stages. V1 does not provide an MPMC queue or ownership protocol on top of +the data region. + +--- + +## 5. Runtime and platform path + +The public ABI lives in `src/common/worker/pto_runtime_c_api.h`. The relevant +entry points are: + +```cpp +HostDeviceChannelHandle open_host_device_channel_ctx( + DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg); +int host_device_send_ctx(...); +int host_device_recv_ctx(...); + +HostDeviceMemoryHandle open_host_device_memory_ctx( + DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg); +int host_device_memory_info_ctx(...); +int host_device_memory_read_ctx(...); +int host_device_memory_write_ctx(...); +int host_device_memory_notify_ctx(...); +int host_device_memory_wait_ctx(...); +``` + +`ChipWorker` resolves these symbols unconditionally via `dlsym`. Runtime +variants that do not support the feature must export stubs rather than omit +symbols, so the worker has one stable ABI surface. + +### Platform behavior + +- `a2a3/sim`: uses a 64-byte-aligned host buffer; host and device pointers + are the same simulated address. +- `a5/sim`: uses the same simulated allocation model as `a2a3/sim`. +- `a2a3/onboard`: allocates device-visible memory, registers the host mapping + with `halHostRegister`, and initializes the common region layout. +- `a5/onboard`: exports unsupported stubs for this API surface. + +The common `host_device_*` helpers own layout validation, bounds checks, ring +operations, read/write copies, and signal load/store semantics. Platform code +is responsible for allocation and mapping. + +--- + +## 6. L3 control path + +When Python code calls the API on a direct `ChipWorker`, the call reaches the +runtime C ABI in the same child process and `host_ptr` is a valid address in +that process. + +When Python code calls the API through an L3 `Orchestrator`, the L3 parent +does **not** own the chip child's host mapping. The Orchestrator forwards the +operation through the target `WorkerThread` mailbox: + +```text +Python Orchestrator + | + v +src/common/hierarchical/orchestrator.cpp + | + v +WorkerManager::get_worker(NEXT_LEVEL, worker_id) + | + v +WorkerThread::control_*() + | + v CONTROL_REQUEST mailbox command +python/simpler/worker.py::_chip_process_loop + | + v +ChipWorker::{open_channel, shared_memory_write, ...} + | + v +runtime C ABI +``` + +`WorkerThread::mailbox_mu_` serializes task dispatch and control commands for +the same child mailbox. Different WorkerThreads still operate independently. + +### L3 metadata semantics + +`shared_memory_info` returns different `host_ptr` semantics depending on the +caller: + +- Direct `ChipWorker`: `host_ptr` is the current process's data-region + address. +- L3 `Orchestrator`: `host_ptr` is always `0`; callers must use + `shared_memory_read` and `shared_memory_write`. + +L3 reads and writes are mailbox RPC copies. Large transfers are split into +chunks no larger than `CTRL_PAYLOAD_CAPACITY`; the API materializes the full +`bytes` result for reads. This is a convenient control-plane data copy path, +not a zero-copy parent mapping and not a streaming transport. + +The mailbox protocol is intentionally kept stable while chunking is handled +above it. A logical L3 read/write may issue multiple +`CTRL_SHARED_MEMORY_READ` or `CTRL_SHARED_MEMORY_WRITE` child requests, but +each individual mailbox payload still fits inside `CTRL_PAYLOAD_CAPACITY`. + +--- + +## 7. Python API + +Direct chip-worker API: + +```python +ch = chip.open_channel(cpu_to_l2_lanes=1, l2_to_cpu_lanes=1) +chip.channel_send(ch, route=7, data=b"cmd", correlation_id=1) +payload, route, cid = chip.channel_recv(ch, capacity=256, timeout_us=1000) + +mem = chip.open_shared_memory(data_bytes=4096, signal_count=2) +host_ptr, device_ptr, data_bytes, signal_count, flags = ( + chip.shared_memory_info(mem) +) +chip.shared_memory_write(mem, 0, b"payload") +chip.shared_memory_notify(mem, 0, 1) +chip.shared_memory_wait(mem, 1, 1, timeout_us=1000) +out = chip.shared_memory_read(mem, 0, 7) +``` + +L3 Orchestrator API adds `worker_id` and forwards through the selected +next-level worker: + +```python +mem = orch.open_shared_memory(worker_id=0, data_bytes=4096, signal_count=2) +host_ptr, device_ptr, _, _, _ = orch.shared_memory_info(0, mem) +assert host_ptr == 0 + +orch.shared_memory_write(0, mem, 0, b"payload") +orch.shared_memory_notify(0, mem, 0, 1) +orch.shared_memory_wait(0, mem, 1, 1, timeout_us=1000) +out = orch.shared_memory_read(0, mem, 0, 7) +``` + +The `*_l2_for_test` methods exist only to simulate the L2 endpoint in tests. +They are not the production CPU-facing protocol. + +--- + +## 8. Test coverage + +Focused smoke tests cover both the protocol boundary and the supported +hardware path: + +- `tests/ut/py/test_worker/test_host_device_comm_sim.py` + - Runs on `a2a3sim` and `a5sim`. + - Uses a real L3 `Worker` and child mailbox path. + - Covers `open_channel`, `channel_send`, `channel_recv`, L3 + `open_shared_memory`, `shared_memory_info`, chunked + `shared_memory_write/read`, `shared_memory_notify`, and + `shared_memory_wait`. +- `tests/ut/py/test_worker/test_host_device_comm_hw.py` + - Runs on `a2a3` onboard hardware. + - Covers CPU/L2 channel traffic in both directions, chunked shared-memory + round trip larger than `CTRL_PAYLOAD_CAPACITY`, notify/wait, and + diagnostic failures for invalid config or oversized channel payloads. + +These tests prove the current parent, nanobind/C++ binding, child handler, +and runtime backend agree on the mailbox command, offset, payload, and +metadata semantics. They do not claim device-direct zero-copy correctness; +that is outside the current L3 API contract. + +--- + +## 9. Why this layering + +### 9.1 Why keep send/recv and shared memory separate? + +Small messages need routing metadata, correlation ids, and bounded queue +backpressure. Bulk data exchange needs an addressable buffer and explicit +phase synchronization. Combining both into one primitive would either make +messages carry unnecessary memory protocol state or make data exchange look +like a stream of oversized control packets. + +### 9.2 Why explicit notify/wait? + +Read/write only copies bytes. Different callers need different phase +protocols: one-shot handoff, ping-pong buffers, or multi-stage pipelines. +Explicit signal slots make the synchronization boundary visible and keep the +memory primitive from inventing hidden fences or ownership rules. + +### 9.3 Why no direct L3 parent pointer? + +The shared-memory region is opened by the chip child that owns the runtime +context. In PROCESS mode, the L3 parent talks to that child through a +pre-existing mailbox and does not have a valid mapping to dereference. +Returning `host_ptr = 0` prevents accidental parent-side pointer use and +forces L3 callers onto mailbox-mediated read/write operations. + +### 9.4 Why unsupported stubs? + +`ChipWorker` resolves the runtime ABI as a stable set of symbols. Exporting +unsupported stubs keeps loading deterministic on platforms that have not +implemented a backend yet, while still producing a clear runtime error if the +API is used. + +--- + +## 10. Related + +- [worker-manager.md](worker-manager.md) - WorkerThread mailbox, + control commands, and child-process dispatch +- [hierarchical_level_runtime.md](hierarchical_level_runtime.md) - + where L3/L2 workers fit in the level model +- [task-flow.md](task-flow.md) - task data movement through the + hierarchical runtime +- `src/common/worker/host_device_channel.{h,cpp}` - message channel layout + and SPSC ring implementation +- `src/common/worker/host_device_memory.{h,cpp}` - shared-memory layout, + bounds checks, and signal semantics diff --git a/python/bindings/CMakeLists.txt b/python/bindings/CMakeLists.txt index ae392526e..9ef01f9ee 100644 --- a/python/bindings/CMakeLists.txt +++ b/python/bindings/CMakeLists.txt @@ -33,6 +33,8 @@ nanobind_add_module(_task_interface ${BINDING_SOURCES} ${HIERARCHICAL_SOURCES}) target_sources(_task_interface PRIVATE ${CMAKE_SOURCE_DIR}/src/common/worker/chip_worker.cpp + ${CMAKE_SOURCE_DIR}/src/common/worker/host_device_channel.cpp + ${CMAKE_SOURCE_DIR}/src/common/worker/host_device_memory.cpp ) target_include_directories(_task_interface PRIVATE diff --git a/python/bindings/task_interface.cpp b/python/bindings/task_interface.cpp index 6c01540cd..5ad3b8e3e 100644 --- a/python/bindings/task_interface.cpp +++ b/python/bindings/task_interface.cpp @@ -741,6 +741,126 @@ NB_MODULE(_task_interface, m) { .def("free", &ChipWorker::free, nb::arg("ptr")) .def("copy_to", &ChipWorker::copy_to, nb::arg("dst"), nb::arg("src"), nb::arg("size")) .def("copy_from", &ChipWorker::copy_from, nb::arg("dst"), nb::arg("src"), nb::arg("size")) + .def( + "open_channel", + [](ChipWorker &self, uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, uint32_t lane_depth, + uint32_t max_message_bytes, uint32_t flags) { + HostDeviceChannelConfig cfg{ + cpu_to_l2_lanes, l2_to_cpu_lanes, lane_depth, max_message_bytes, flags + }; + return self.open_channel(cfg); + }, + nb::arg("cpu_to_l2_lanes") = 1, nb::arg("l2_to_cpu_lanes") = 1, nb::arg("lane_depth") = 64, + nb::arg("max_message_bytes") = HDCH_MAX_INLINE_BYTES, nb::arg("flags") = 0, + "Open a bounded host/device message channel." + ) + .def("close_channel", &ChipWorker::close_channel, nb::arg("channel")) + .def( + "channel_send", + [](ChipWorker &self, uint64_t ch, uint32_t route, nb::bytes data, uint64_t correlation_id, + uint32_t timeout_us) { + std::string payload(data.c_str(), data.size()); + self.channel_send(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); + }, + nb::arg("channel"), nb::arg("route"), nb::arg("data"), nb::arg("correlation_id") = 0, + nb::arg("timeout_us") = 0 + ) + .def( + "channel_recv", + [](ChipWorker &self, uint64_t ch, size_t capacity, uint32_t timeout_us) { + uint32_t route = 0; + uint64_t correlation_id = 0; + auto data = self.channel_recv(ch, capacity, timeout_us, &route, &correlation_id); + return nb::make_tuple(nb::bytes(reinterpret_cast(data.data()), data.size()), route, correlation_id); + }, + nb::arg("channel"), nb::arg("capacity") = HDCH_MAX_INLINE_BYTES, nb::arg("timeout_us") = 0 + ) + .def( + "channel_send_l2_for_test", + [](ChipWorker &self, uint64_t ch, uint32_t route, nb::bytes data, uint64_t correlation_id, + uint32_t timeout_us) { + std::string payload(data.c_str(), data.size()); + self.channel_send_l2_for_test(ch, route, payload.data(), payload.size(), correlation_id, timeout_us); + }, + nb::arg("channel"), nb::arg("route"), nb::arg("data"), nb::arg("correlation_id") = 0, + nb::arg("timeout_us") = 0 + ) + .def( + "channel_recv_l2_for_test", + [](ChipWorker &self, uint64_t ch, size_t capacity, uint32_t timeout_us) { + uint32_t route = 0; + uint64_t correlation_id = 0; + auto data = self.channel_recv_l2_for_test(ch, capacity, timeout_us, &route, &correlation_id); + return nb::make_tuple(nb::bytes(reinterpret_cast(data.data()), data.size()), route, correlation_id); + }, + nb::arg("channel"), nb::arg("capacity") = HDCH_MAX_INLINE_BYTES, nb::arg("timeout_us") = 0 + ) + .def( + "open_shared_memory", + [](ChipWorker &self, uint64_t data_bytes, uint32_t signal_count, uint32_t flags) { + HostDeviceMemoryConfig cfg{data_bytes, signal_count, flags}; + return self.open_shared_memory(cfg); + }, + nb::arg("data_bytes"), nb::arg("signal_count") = 2, nb::arg("flags") = 0, + "Open a host/device shared-memory region." + ) + .def("close_shared_memory", &ChipWorker::close_shared_memory, nb::arg("memory")) + .def( + "shared_memory_info", + [](ChipWorker &self, uint64_t mem) { + HostDeviceMemoryInfo info = self.shared_memory_info(mem); + return nb::make_tuple(info.host_ptr, info.device_ptr, info.data_bytes, info.signal_count, info.flags); + }, + nb::arg("memory"), "Return shared-memory metadata. host_ptr is a current-process address." + ) + .def( + "shared_memory_read", + [](ChipWorker &self, uint64_t mem, uint64_t offset, size_t nbytes) { + auto data = self.shared_memory_read(mem, offset, nbytes); + return nb::bytes(reinterpret_cast(data.data()), data.size()); + }, + nb::arg("memory"), nb::arg("offset"), nb::arg("nbytes") + ) + .def( + "shared_memory_write", + [](ChipWorker &self, uint64_t mem, uint64_t offset, nb::bytes data) { + std::string payload(data.c_str(), data.size()); + self.shared_memory_write(mem, offset, payload.data(), payload.size()); + }, + nb::arg("memory"), nb::arg("offset"), nb::arg("data") + ) + .def( + "shared_memory_notify", &ChipWorker::shared_memory_notify, nb::arg("memory"), nb::arg("signal_id"), + nb::arg("value") + ) + .def( + "shared_memory_wait", &ChipWorker::shared_memory_wait, nb::arg("memory"), nb::arg("signal_id"), + nb::arg("target"), nb::arg("timeout_us") = 0 + ) + .def( + "shared_memory_read_l2_for_test", + [](ChipWorker &self, uint64_t mem, uint64_t offset, size_t nbytes) { + auto data = self.shared_memory_read_l2_for_test(mem, offset, nbytes); + return nb::bytes(reinterpret_cast(data.data()), data.size()); + }, + nb::arg("memory"), nb::arg("offset"), nb::arg("nbytes") + ) + .def( + "shared_memory_write_l2_for_test", + [](ChipWorker &self, uint64_t mem, uint64_t offset, nb::bytes data) { + std::string payload(data.c_str(), data.size()); + self.shared_memory_write_l2_for_test(mem, offset, payload.data(), payload.size()); + }, + nb::arg("memory"), nb::arg("offset"), nb::arg("data") + ) + .def( + "shared_memory_notify_l2_for_test", &ChipWorker::shared_memory_notify_l2_for_test, + nb::arg("memory"), nb::arg("signal_id"), nb::arg("value") + ) + .def( + "shared_memory_wait_l2_for_test", &ChipWorker::shared_memory_wait_l2_for_test, + nb::arg("memory"), nb::arg("signal_id"), nb::arg("target"), nb::arg("timeout_us") = 0 + ) .def( "comm_init", &ChipWorker::comm_init, nb::arg("rank"), nb::arg("nranks"), nb::arg("rootinfo_path"), "Initialize a communicator for this rank. ChipWorker owns ACL + stream " diff --git a/python/bindings/worker_bind.h b/python/bindings/worker_bind.h index 024523c54..9713f38bf 100644 --- a/python/bindings/worker_bind.h +++ b/python/bindings/worker_bind.h @@ -29,6 +29,8 @@ #include #include +#include +#include #include "chip_bootstrap_channel.h" #include "ring.h" @@ -159,6 +161,106 @@ inline void bind_worker(nb::module_ &m) { }, nb::arg("worker_id"), nb::arg("dst"), nb::arg("src"), nb::arg("size"), "Copy worker src to host dst." ) + .def( + "open_channel", + [](Orchestrator &self, int worker_id, uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, + uint32_t lane_depth, uint32_t max_message_bytes) { + return self.open_channel(worker_id, cpu_to_l2_lanes, l2_to_cpu_lanes, lane_depth, max_message_bytes); + }, + nb::arg("worker_id"), nb::arg("cpu_to_l2_lanes") = 1, nb::arg("l2_to_cpu_lanes") = 1, + nb::arg("lane_depth") = 64, nb::arg("max_message_bytes") = 256, + "Open a host/device message channel on a next-level worker." + ) + .def( + "close_channel", + [](Orchestrator &self, int worker_id, uint64_t channel) { + self.close_channel(worker_id, channel); + }, + nb::arg("worker_id"), nb::arg("channel"), "Close a host/device message channel." + ) + .def( + "channel_send", + [](Orchestrator &self, int worker_id, uint64_t channel, uint32_t route, nb::bytes data, + uint64_t correlation_id) { + std::string payload(data.c_str(), data.size()); + std::vector bytes(payload.begin(), payload.end()); + self.channel_send(worker_id, channel, route, bytes, correlation_id); + }, + nb::arg("worker_id"), nb::arg("channel"), nb::arg("route"), nb::arg("data"), nb::arg("correlation_id") = 0, + "Send a message through a host/device channel." + ) + .def( + "channel_recv", + [](Orchestrator &self, int worker_id, uint64_t channel, size_t capacity, uint32_t timeout_us) { + uint32_t route = 0; + uint64_t correlation_id = 0; + auto data = self.channel_recv(worker_id, channel, capacity, timeout_us, &route, &correlation_id); + return nb::make_tuple( + nb::bytes(reinterpret_cast(data.data()), data.size()), route, correlation_id + ); + }, + nb::arg("worker_id"), nb::arg("channel"), nb::arg("capacity") = 256, nb::arg("timeout_us") = 0, + "Receive a message through a host/device channel." + ) + .def( + "open_shared_memory", + [](Orchestrator &self, int worker_id, uint64_t data_bytes, uint32_t signal_count, uint32_t flags) { + return self.open_shared_memory(worker_id, data_bytes, signal_count, flags); + }, + nb::arg("worker_id"), nb::arg("data_bytes"), nb::arg("signal_count") = 2, nb::arg("flags") = 0, + "Open a host/device shared-memory region on a next-level worker." + ) + .def( + "close_shared_memory", + [](Orchestrator &self, int worker_id, uint64_t memory) { + self.close_shared_memory(worker_id, memory); + }, + nb::arg("worker_id"), nb::arg("memory"), "Close a host/device shared-memory region." + ) + .def( + "shared_memory_info", + [](Orchestrator &self, int worker_id, uint64_t memory) { + HostDeviceMemoryInfo info = self.shared_memory_info(worker_id, memory); + return nb::make_tuple(info.host_ptr, info.device_ptr, info.data_bytes, info.signal_count, info.flags); + }, + nb::arg("worker_id"), nb::arg("memory"), + "Return shared-memory metadata. host_ptr is 0 for hierarchical mailbox callers." + ) + .def( + "shared_memory_read", + [](Orchestrator &self, int worker_id, uint64_t memory, uint64_t offset, size_t nbytes) { + auto data = self.shared_memory_read(worker_id, memory, offset, nbytes); + return nb::bytes(reinterpret_cast(data.data()), data.size()); + }, + nb::arg("worker_id"), nb::arg("memory"), nb::arg("offset"), nb::arg("nbytes"), + "Read shared-memory bytes via chunked mailbox RPC. Returns a full materialized bytes object; this is " + "not streaming or zero-copy." + ) + .def( + "shared_memory_write", + [](Orchestrator &self, int worker_id, uint64_t memory, uint64_t offset, nb::bytes data) { + std::string payload(data.c_str(), data.size()); + std::vector bytes(payload.begin(), payload.end()); + self.shared_memory_write(worker_id, memory, offset, bytes); + }, + nb::arg("worker_id"), nb::arg("memory"), nb::arg("offset"), nb::arg("data"), + "Write shared-memory bytes via chunked mailbox RPC. This is not streaming or zero-copy." + ) + .def( + "shared_memory_notify", + [](Orchestrator &self, int worker_id, uint64_t memory, uint32_t signal_id, uint64_t value) { + self.shared_memory_notify(worker_id, memory, signal_id, value); + }, + nb::arg("worker_id"), nb::arg("memory"), nb::arg("signal_id"), nb::arg("value") + ) + .def( + "shared_memory_wait", + [](Orchestrator &self, int worker_id, uint64_t memory, uint32_t signal_id, uint64_t target, + uint32_t timeout_us) { + self.shared_memory_wait(worker_id, memory, signal_id, target, timeout_us); + }, + nb::arg("worker_id"), nb::arg("memory"), nb::arg("signal_id"), nb::arg("target"), nb::arg("timeout_us") = 0 + ) .def( "alloc", [](Orchestrator &self, const std::vector &shape, DataType dtype) { @@ -251,6 +353,35 @@ inline void bind_worker(nb::module_ &m) { m.attr("MAILBOX_SIZE") = static_cast(MAILBOX_SIZE); m.attr("MAILBOX_OFF_ERROR_MSG") = static_cast(MAILBOX_OFF_ERROR_MSG); m.attr("MAILBOX_ERROR_MSG_SIZE") = static_cast(MAILBOX_ERROR_MSG_SIZE); + m.attr("MAILBOX_OFF_ARGS") = static_cast(MAILBOX_OFF_ARGS); + m.attr("MAILBOX_ARGS_CAPACITY") = static_cast(MAILBOX_ARGS_CAPACITY); + m.attr("CTRL_MALLOC") = static_cast(CTRL_MALLOC); + m.attr("CTRL_FREE") = static_cast(CTRL_FREE); + m.attr("CTRL_COPY_TO") = static_cast(CTRL_COPY_TO); + m.attr("CTRL_COPY_FROM") = static_cast(CTRL_COPY_FROM); + m.attr("CTRL_PREPARE") = static_cast(CTRL_PREPARE); + m.attr("CTRL_REGISTER") = static_cast(CTRL_REGISTER); + m.attr("CTRL_UNREGISTER") = static_cast(CTRL_UNREGISTER); + m.attr("CTRL_OPEN_CHANNEL") = static_cast(CTRL_OPEN_CHANNEL); + m.attr("CTRL_CLOSE_CHANNEL") = static_cast(CTRL_CLOSE_CHANNEL); + m.attr("CTRL_CHANNEL_SEND") = static_cast(CTRL_CHANNEL_SEND); + m.attr("CTRL_CHANNEL_RECV") = static_cast(CTRL_CHANNEL_RECV); + m.attr("CTRL_OPEN_SHARED_MEMORY") = static_cast(CTRL_OPEN_SHARED_MEMORY); + m.attr("CTRL_CLOSE_SHARED_MEMORY") = static_cast(CTRL_CLOSE_SHARED_MEMORY); + m.attr("CTRL_SHARED_MEMORY_INFO") = static_cast(CTRL_SHARED_MEMORY_INFO); + m.attr("CTRL_SHARED_MEMORY_READ") = static_cast(CTRL_SHARED_MEMORY_READ); + m.attr("CTRL_SHARED_MEMORY_WRITE") = static_cast(CTRL_SHARED_MEMORY_WRITE); + m.attr("CTRL_SHARED_MEMORY_NOTIFY") = static_cast(CTRL_SHARED_MEMORY_NOTIFY); + m.attr("CTRL_SHARED_MEMORY_WAIT") = static_cast(CTRL_SHARED_MEMORY_WAIT); + m.attr("CTRL_OFF_ARG0") = static_cast(CTRL_OFF_ARG0); + m.attr("CTRL_OFF_ARG1") = static_cast(CTRL_OFF_ARG1); + m.attr("CTRL_OFF_ARG2") = static_cast(CTRL_OFF_ARG2); + m.attr("CTRL_OFF_RESULT") = static_cast(CTRL_OFF_RESULT); + m.attr("CTRL_OFF_ARG3") = static_cast(CTRL_OFF_ARG3); + m.attr("CTRL_OFF_ARG4") = static_cast(CTRL_OFF_ARG4); + m.attr("CTRL_OFF_PAYLOAD") = static_cast(CTRL_OFF_PAYLOAD); + m.attr("CTRL_PAYLOAD_CAPACITY") = static_cast(CTRL_PAYLOAD_CAPACITY); + m.attr("CTRL_SHM_NAME_BYTES") = static_cast(CTRL_SHM_NAME_BYTES); m.attr("MAX_RING_DEPTH") = static_cast(MAX_RING_DEPTH); m.attr("MAX_SCOPE_DEPTH") = static_cast(MAX_SCOPE_DEPTH); diff --git a/python/simpler/orchestrator.py b/python/simpler/orchestrator.py index 29bc84db6..db1570581 100644 --- a/python/simpler/orchestrator.py +++ b/python/simpler/orchestrator.py @@ -175,6 +175,91 @@ def copy_from(self, worker_id: int, dst: int, src: int, size: int) -> None: """Copy *size* bytes from worker *src* to host *dst*.""" self._o.copy_from(int(worker_id), int(dst), int(src), int(size)) + def open_channel( + self, + worker_id: int, + cpu_to_l2_lanes: int = 1, + l2_to_cpu_lanes: int = 1, + lane_depth: int = 64, + max_message_bytes: int = 256, + ) -> int: + """Open a bounded L3/L2 message channel on next-level worker *worker_id*.""" + return int( + self._o.open_channel( + int(worker_id), + int(cpu_to_l2_lanes), + int(l2_to_cpu_lanes), + int(lane_depth), + int(max_message_bytes), + ) + ) + + def close_channel(self, worker_id: int, channel: int) -> None: + """Close a channel returned by ``open_channel``.""" + self._o.close_channel(int(worker_id), int(channel)) + + def channel_send( + self, + worker_id: int, + channel: int, + route: int, + data: bytes, + correlation_id: int = 0, + ) -> None: + """Send one inline message from L3 CPU toward L2.""" + self._o.channel_send(int(worker_id), int(channel), int(route), bytes(data), int(correlation_id)) + + def channel_recv(self, worker_id: int, channel: int, capacity: int = 256, timeout_us: int = 0) -> tuple[bytes, int, int]: + """Receive one inline message from L2 toward L3 CPU.""" + data, route, correlation_id = self._o.channel_recv( + int(worker_id), int(channel), int(capacity), int(timeout_us) + ) + return bytes(data), int(route), int(correlation_id) + + def open_shared_memory(self, worker_id: int, data_bytes: int, signal_count: int = 2, flags: int = 0) -> int: + """Open a host/device shared-memory region on next-level worker *worker_id*.""" + return int(self._o.open_shared_memory(int(worker_id), int(data_bytes), int(signal_count), int(flags))) + + def close_shared_memory(self, worker_id: int, memory: int) -> None: + """Close a shared-memory region returned by ``open_shared_memory``.""" + self._o.close_shared_memory(int(worker_id), int(memory)) + + def shared_memory_info(self, worker_id: int, memory: int) -> tuple[int, int, int, int, int]: + """Return ``(host_ptr, device_ptr, data_bytes, signal_count, flags)``. + + ``host_ptr`` is always ``0`` because the L3 parent has no directly + dereferenceable host mapping for chip-child shared memory. + """ + host_ptr, device_ptr, data_bytes, signal_count, flags = self._o.shared_memory_info(int(worker_id), int(memory)) + return int(host_ptr), int(device_ptr), int(data_bytes), int(signal_count), int(flags) + + def shared_memory_read(self, worker_id: int, memory: int, offset: int, nbytes: int) -> bytes: + """Read bytes from a shared-memory data region. + + L3 access chunks large reads through mailbox RPC; it is not a direct + parent-process mapping or streaming data plane. The returned + ``bytes`` materializes the full requested range. + """ + return bytes(self._o.shared_memory_read(int(worker_id), int(memory), int(offset), int(nbytes))) + + def shared_memory_write(self, worker_id: int, memory: int, offset: int, data: bytes) -> None: + """Write bytes into a shared-memory data region. + + L3 access chunks large writes through mailbox RPC; it is not a direct + parent-process mapping or streaming data plane. + """ + self._o.shared_memory_write(int(worker_id), int(memory), int(offset), bytes(data)) + + def shared_memory_notify(self, worker_id: int, memory: int, signal_id: int, value: int) -> None: + """Publish a software signal value for a shared-memory region.""" + self._o.shared_memory_notify(int(worker_id), int(memory), int(signal_id), int(value)) + + def shared_memory_wait( + self, worker_id: int, memory: int, signal_id: int, target: int, timeout_us: int = 0 + ) -> None: + """Wait until a shared-memory software signal reaches ``target``.""" + self._o.shared_memory_wait(int(worker_id), int(memory), int(signal_id), int(target), int(timeout_us)) + def alloc(self, shape: Sequence[int], dtype: DataType) -> ContinuousTensor: """Allocate a runtime-managed intermediate buffer. diff --git a/python/simpler/task_interface.py b/python/simpler/task_interface.py index 0acbc11b6..593726ea1 100644 --- a/python/simpler/task_interface.py +++ b/python/simpler/task_interface.py @@ -390,6 +390,110 @@ def copy_from(self, dst, src, size): """Copy *size* bytes from worker *src* to host *dst*.""" self._impl.copy_from(int(dst), int(src), int(size)) + def open_channel( + self, + cpu_to_l2_lanes: int = 1, + l2_to_cpu_lanes: int = 1, + lane_depth: int = 64, + max_message_bytes: int = 256, + flags: int = 0, + ) -> int: + """Open a bounded L3/L2 host-device message channel.""" + return int( + self._impl.open_channel( + int(cpu_to_l2_lanes), + int(l2_to_cpu_lanes), + int(lane_depth), + int(max_message_bytes), + int(flags), + ) + ) + + def close_channel(self, channel: int) -> None: + """Close a channel returned by ``open_channel``.""" + self._impl.close_channel(int(channel)) + + def channel_send( + self, + channel: int, + route: int, + data: bytes, + correlation_id: int = 0, + timeout_us: int = 0, + ) -> None: + """Send one inline message from L3 CPU toward L2.""" + self._impl.channel_send(int(channel), int(route), bytes(data), int(correlation_id), int(timeout_us)) + + def channel_recv(self, channel: int, capacity: int = 256, timeout_us: int = 0) -> tuple[bytes, int, int]: + """Receive one inline message from L2 toward L3 CPU.""" + data, route, correlation_id = self._impl.channel_recv(int(channel), int(capacity), int(timeout_us)) + return bytes(data), int(route), int(correlation_id) + + def channel_send_l2_for_test( + self, + channel: int, + route: int, + data: bytes, + correlation_id: int = 0, + timeout_us: int = 0, + ) -> None: + """Sim/test helper that injects a message from the L2 side.""" + self._impl.channel_send_l2_for_test(int(channel), int(route), bytes(data), int(correlation_id), int(timeout_us)) + + def channel_recv_l2_for_test(self, channel: int, capacity: int = 256, timeout_us: int = 0) -> tuple[bytes, int, int]: + """Sim/test helper that receives a CPU-to-L2 message from the L2 side.""" + data, route, correlation_id = self._impl.channel_recv_l2_for_test(int(channel), int(capacity), int(timeout_us)) + return bytes(data), int(route), int(correlation_id) + + def open_shared_memory(self, data_bytes: int, signal_count: int = 2, flags: int = 0) -> int: + """Open a host/device shared-memory region.""" + return int(self._impl.open_shared_memory(int(data_bytes), int(signal_count), int(flags))) + + def close_shared_memory(self, memory: int) -> None: + """Close a shared-memory region returned by ``open_shared_memory``.""" + self._impl.close_shared_memory(int(memory)) + + def shared_memory_info(self, memory: int) -> tuple[int, int, int, int, int]: + """Return ``(host_ptr, device_ptr, data_bytes, signal_count, flags)``. + + ``host_ptr`` is a current-process host address and may be directly + dereferenced only by this process. + """ + host_ptr, device_ptr, data_bytes, signal_count, flags = self._impl.shared_memory_info(int(memory)) + return int(host_ptr), int(device_ptr), int(data_bytes), int(signal_count), int(flags) + + def shared_memory_read(self, memory: int, offset: int, nbytes: int) -> bytes: + """Read bytes from a shared-memory data region.""" + return bytes(self._impl.shared_memory_read(int(memory), int(offset), int(nbytes))) + + def shared_memory_write(self, memory: int, offset: int, data: bytes) -> None: + """Write bytes into a shared-memory data region.""" + self._impl.shared_memory_write(int(memory), int(offset), bytes(data)) + + def shared_memory_notify(self, memory: int, signal_id: int, value: int) -> None: + """Publish a software signal value for a shared-memory region.""" + self._impl.shared_memory_notify(int(memory), int(signal_id), int(value)) + + def shared_memory_wait(self, memory: int, signal_id: int, target: int, timeout_us: int = 0) -> None: + """Wait until a shared-memory software signal reaches ``target``.""" + self._impl.shared_memory_wait(int(memory), int(signal_id), int(target), int(timeout_us)) + + def shared_memory_read_l2_for_test(self, memory: int, offset: int, nbytes: int) -> bytes: + """Sim/test helper that reads the shared-memory region from the L2 side.""" + return bytes(self._impl.shared_memory_read_l2_for_test(int(memory), int(offset), int(nbytes))) + + def shared_memory_write_l2_for_test(self, memory: int, offset: int, data: bytes) -> None: + """Sim/test helper that writes the shared-memory region from the L2 side.""" + self._impl.shared_memory_write_l2_for_test(int(memory), int(offset), bytes(data)) + + def shared_memory_notify_l2_for_test(self, memory: int, signal_id: int, value: int) -> None: + """Sim/test helper that publishes an L2-side software signal.""" + self._impl.shared_memory_notify_l2_for_test(int(memory), int(signal_id), int(value)) + + def shared_memory_wait_l2_for_test(self, memory: int, signal_id: int, target: int, timeout_us: int = 0) -> None: + """Sim/test helper that waits on a shared-memory signal from the L2 side.""" + self._impl.shared_memory_wait_l2_for_test(int(memory), int(signal_id), int(target), int(timeout_us)) + def comm_init(self, rank: int, nranks: int, rootinfo_path: str) -> int: """Initialize a distributed communicator for this rank. diff --git a/python/simpler/worker.py b/python/simpler/worker.py index dbb9eef70..40a8cb980 100644 --- a/python/simpler/worker.py +++ b/python/simpler/worker.py @@ -67,6 +67,35 @@ def my_l4_orch(orch, args, config): from _task_interface import ( # pyright: ignore[reportMissingImports] CHIP_BOOTSTRAP_MAILBOX_SIZE, + CTRL_CHANNEL_RECV, + CTRL_CHANNEL_SEND, + CTRL_CLOSE_CHANNEL, + CTRL_CLOSE_SHARED_MEMORY, + CTRL_COPY_FROM, + CTRL_COPY_TO, + CTRL_FREE, + CTRL_MALLOC, + CTRL_OFF_ARG0, + CTRL_OFF_ARG1, + CTRL_OFF_ARG2, + CTRL_OFF_ARG3, + CTRL_OFF_ARG4, + CTRL_OFF_PAYLOAD, + CTRL_OFF_RESULT, + CTRL_OPEN_CHANNEL, + CTRL_OPEN_SHARED_MEMORY, + CTRL_PAYLOAD_CAPACITY, + CTRL_PREPARE, + CTRL_REGISTER, + CTRL_SHARED_MEMORY_INFO, + CTRL_SHARED_MEMORY_NOTIFY, + CTRL_SHARED_MEMORY_READ, + CTRL_SHARED_MEMORY_WAIT, + CTRL_SHARED_MEMORY_WRITE, + CTRL_SHM_NAME_BYTES, + CTRL_UNREGISTER, + MAILBOX_ARGS_CAPACITY, + MAILBOX_OFF_ARGS, MAX_REGISTERED_CALLABLE_IDS, ChipBootstrapChannel, ChipBootstrapMailboxState, @@ -118,12 +147,14 @@ def my_l4_orch(orch, args, config): # Args region starts after CONFIG, rounded up to 8 bytes so the first # ContinuousTensor.data (uint64_t at OFF_ARGS+8) is 8-byte aligned, avoiding # SIGBUS on strict-alignment platforms (aarch64 atomics, some ARM cores). -_OFF_ARGS = (_OFF_CONFIG + _CFG_FMT.size + 7) & ~7 +_PY_OFF_ARGS = (_OFF_CONFIG + _CFG_FMT.size + 7) & ~7 +assert _PY_OFF_ARGS == MAILBOX_OFF_ARGS, "CallConfig mailbox layout drifted" +_OFF_ARGS = MAILBOX_OFF_ARGS assert _OFF_ARGS % 8 == 0, "_OFF_ARGS must be 8-aligned for ContinuousTensor.data" -# MAILBOX_ARGS_CAPACITY mirrors the C++ constexpr in worker_manager.h so the -# Python reader can bounds-check incoming args blobs. Source-of-truth for the -# constants on the right is the nanobind binding (cannot drift). -_MAILBOX_ARGS_CAPACITY = MAILBOX_SIZE - _OFF_ARGS - MAILBOX_ERROR_MSG_SIZE +# MAILBOX_ARGS_CAPACITY comes from the nanobind binding so the Python reader +# can bounds-check incoming args blobs against the C++ mailbox layout. +_MAILBOX_ARGS_CAPACITY = MAILBOX_ARGS_CAPACITY +assert _MAILBOX_ARGS_CAPACITY == MAILBOX_SIZE - _OFF_ARGS - MAILBOX_ERROR_MSG_SIZE # MAILBOX_OFF_ERROR_MSG / MAILBOX_ERROR_MSG_SIZE come from the C++ # nanobind module so the two sides cannot drift. @@ -135,29 +166,42 @@ def my_l4_orch(orch, args, config): _CONTROL_DONE = 5 # Control sub-commands (written at _OFF_CALLABLE as uint64) -_CTRL_MALLOC = 0 -_CTRL_FREE = 1 -_CTRL_COPY_TO = 2 -_CTRL_COPY_FROM = 3 +_CTRL_MALLOC = CTRL_MALLOC +_CTRL_FREE = CTRL_FREE +_CTRL_COPY_TO = CTRL_COPY_TO +_CTRL_COPY_FROM = CTRL_COPY_FROM # Pre-warm a chip child for cid=arg0 by calling # `prepare_callable(cid, registry[cid])` so the first run() does # not pay the H2D upload cost. Sent from the parent right after init() # (or whenever a new ChipCallable cid is registered). -_CTRL_PREPARE = 4 +_CTRL_PREPARE = CTRL_PREPARE # Dynamic post-init register of a ChipCallable. Parent stages the bytes # in a per-register POSIX shm and writes (cid, shm_name) into the mailbox; # the child mmaps the shm and calls prepare_callable_from_blob(cid, addr). # See docs/callable-ipc-dynamic-register.md for the design. -_CTRL_REGISTER = 5 +_CTRL_REGISTER = CTRL_REGISTER # Symmetric unregister: drop the cid from chip-child state so the AICPU # orch_so_table_ slot can be reused. Payload is just the cid; no shm. -_CTRL_UNREGISTER = 6 +_CTRL_UNREGISTER = CTRL_UNREGISTER +_CTRL_OPEN_CHANNEL = CTRL_OPEN_CHANNEL +_CTRL_CLOSE_CHANNEL = CTRL_CLOSE_CHANNEL +_CTRL_CHANNEL_SEND = CTRL_CHANNEL_SEND +_CTRL_CHANNEL_RECV = CTRL_CHANNEL_RECV +_CTRL_OPEN_SHARED_MEMORY = CTRL_OPEN_SHARED_MEMORY +_CTRL_CLOSE_SHARED_MEMORY = CTRL_CLOSE_SHARED_MEMORY +_CTRL_SHARED_MEMORY_INFO = CTRL_SHARED_MEMORY_INFO +_CTRL_SHARED_MEMORY_READ = CTRL_SHARED_MEMORY_READ +_CTRL_SHARED_MEMORY_WRITE = CTRL_SHARED_MEMORY_WRITE +_CTRL_SHARED_MEMORY_NOTIFY = CTRL_SHARED_MEMORY_NOTIFY +_CTRL_SHARED_MEMORY_WAIT = CTRL_SHARED_MEMORY_WAIT +_CTRL_TEST_CHANNEL_SEND_L2 = 0x545354000001 +_CTRL_TEST_CHANNEL_RECV_L2 = 0x545354000002 # Reserved 32-byte region at the start of OFF_ARGS used by _CTRL_REGISTER to # carry the NUL-terminated POSIX shm name. POSIX shm names on Linux are # bounded well below this, but the on-wire field is fixed-width to keep # the layout simple. -_CTRL_SHM_NAME_BYTES = 32 +_CTRL_SHM_NAME_BYTES = CTRL_SHM_NAME_BYTES # Control args layout (reuses task mailbox fields when state == _CONTROL_*): # offset 8 (_OFF_CALLABLE): uint64 sub-command @@ -165,10 +209,14 @@ def my_l4_orch(orch, args, config): # offset 24: uint64 arg1 (host_ptr for copy) # offset 32: uint64 arg2 (nbytes for copy) # offset 40: uint64 result (returned ptr from malloc) -_CTRL_OFF_ARG0 = 16 -_CTRL_OFF_ARG1 = 24 -_CTRL_OFF_ARG2 = 32 -_CTRL_OFF_RESULT = 40 +_CTRL_OFF_ARG0 = CTRL_OFF_ARG0 +_CTRL_OFF_ARG1 = CTRL_OFF_ARG1 +_CTRL_OFF_ARG2 = CTRL_OFF_ARG2 +_CTRL_OFF_RESULT = CTRL_OFF_RESULT +_CTRL_OFF_ARG3 = CTRL_OFF_ARG3 +_CTRL_OFF_ARG4 = CTRL_OFF_ARG4 +_CTRL_OFF_PAYLOAD = CTRL_OFF_PAYLOAD +_CTRL_PAYLOAD_CAPACITY = CTRL_PAYLOAD_CAPACITY def _mailbox_addr(shm: SharedMemory) -> int: @@ -414,6 +462,86 @@ def _run_chip_main_loop( # noqa: PLR0912 -- TASK_READY + 6 control sub-commands # CTRL_PREPARE for the same cid is treated as a fresh # registration (re-runs the H2D upload / AICPU dlopen). prepared.discard(int(cid)) + elif sub_cmd == _CTRL_OPEN_CHANNEL: + c2l = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0]) + l2c = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + depth = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + max_bytes = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG3)[0]) + ch = cw.open_channel(c2l, l2c, depth, max_bytes) + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, ch) + elif sub_cmd == _CTRL_CLOSE_CHANNEL: + ch = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + cw.close_channel(ch) + elif sub_cmd == _CTRL_CHANNEL_SEND: + ch = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + route = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + n = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + cid = struct.unpack_from("Q", buf, _CTRL_OFF_ARG3)[0] + cw.channel_send(ch, route, bytes(buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + n]), cid) + elif sub_cmd == _CTRL_CHANNEL_RECV: + ch = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + capacity = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + timeout_us = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + data, route, cid = cw.channel_recv(ch, capacity, timeout_us) + buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + len(data)] = data + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, len(data)) + struct.pack_into("Q", buf, _CTRL_OFF_ARG3, route) + struct.pack_into("Q", buf, _CTRL_OFF_ARG4, cid) + elif sub_cmd == _CTRL_TEST_CHANNEL_SEND_L2: + ch = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + route = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + n = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + cid = struct.unpack_from("Q", buf, _CTRL_OFF_ARG3)[0] + cw.channel_send_l2_for_test(ch, route, bytes(buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + n]), cid) + elif sub_cmd == _CTRL_TEST_CHANNEL_RECV_L2: + ch = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + capacity = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + timeout_us = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + data, route, cid = cw.channel_recv_l2_for_test(ch, capacity, timeout_us) + buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + len(data)] = data + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, len(data)) + struct.pack_into("Q", buf, _CTRL_OFF_ARG3, route) + struct.pack_into("Q", buf, _CTRL_OFF_ARG4, cid) + elif sub_cmd == _CTRL_OPEN_SHARED_MEMORY: + data_bytes = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0]) + signal_count = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + flags = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + mem = cw.open_shared_memory(data_bytes, signal_count, flags) + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, mem) + elif sub_cmd == _CTRL_CLOSE_SHARED_MEMORY: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + cw.close_shared_memory(mem) + elif sub_cmd == _CTRL_SHARED_MEMORY_INFO: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + _host_ptr, device_ptr, data_bytes, signal_count, flags = cw.shared_memory_info(mem) + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, 0) + struct.pack_into("Q", buf, _CTRL_OFF_ARG1, device_ptr) + struct.pack_into("Q", buf, _CTRL_OFF_ARG2, data_bytes) + struct.pack_into("Q", buf, _CTRL_OFF_ARG3, signal_count) + struct.pack_into("Q", buf, _CTRL_OFF_ARG4, flags) + elif sub_cmd == _CTRL_SHARED_MEMORY_READ: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + offset = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + n = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + data = cw.shared_memory_read(mem, offset, n) + buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + len(data)] = data + struct.pack_into("Q", buf, _CTRL_OFF_RESULT, len(data)) + elif sub_cmd == _CTRL_SHARED_MEMORY_WRITE: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + offset = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + n = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + cw.shared_memory_write(mem, offset, bytes(buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + n])) + elif sub_cmd == _CTRL_SHARED_MEMORY_NOTIFY: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + signal_id = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + value = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + cw.shared_memory_notify(mem, signal_id, value) + elif sub_cmd == _CTRL_SHARED_MEMORY_WAIT: + mem = struct.unpack_from("Q", buf, _CTRL_OFF_ARG0)[0] + signal_id = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0]) + target = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0]) + timeout_us = int(struct.unpack_from("Q", buf, _CTRL_OFF_ARG3)[0]) + cw.shared_memory_wait(mem, signal_id, target, timeout_us) except Exception as e: # noqa: BLE001 code = 1 if sub_cmd in (_CTRL_REGISTER, _CTRL_UNREGISTER): @@ -1302,6 +1430,59 @@ def _check_chip_worker_id(self, worker_id: int) -> None: if worker_id < 0 or worker_id >= len(self._chip_shms): raise IndexError(f"worker_id {worker_id} out of range (have {len(self._chip_shms)} chips)") + def _chip_control( + self, worker_id: int, sub_cmd: int, arg0: int = 0, arg1: int = 0, arg2: int = 0, arg3: int = 0 + ) -> int: + """Send a control command without payload. Returns the uint64 result field.""" + return self._chip_control_payload(worker_id, sub_cmd, arg0=arg0, arg1=arg1, arg2=arg2, arg3=arg3)[0] + + def _chip_control_payload( + self, + worker_id: int, + sub_cmd: int, + arg0: int = 0, + arg1: int = 0, + arg2: int = 0, + arg3: int = 0, + payload: bytes = b"", + recv_capacity: int = 0, + ) -> tuple[int, bytes, int, int, int, int]: + """Send a control command with a mailbox payload. Returns result, payload, arg1, arg2, arg3, arg4.""" + if worker_id < 0 or worker_id >= len(self._chip_shms): + raise IndexError(f"worker_id {worker_id} out of range (have {len(self._chip_shms)} chips)") + if len(payload) > _CTRL_PAYLOAD_CAPACITY: + raise ValueError(f"control payload too large: {len(payload)} > {_CTRL_PAYLOAD_CAPACITY}") + if recv_capacity > _CTRL_PAYLOAD_CAPACITY: + raise ValueError(f"recv capacity too large: {recv_capacity} > {_CTRL_PAYLOAD_CAPACITY}") + shm = self._chip_shms[worker_id] + buf = shm.buf + assert buf is not None + state_addr = _buffer_field_addr(buf, _OFF_STATE) + _write_error(buf, 0, "") + struct.pack_into("Q", buf, _OFF_CALLABLE, sub_cmd) + struct.pack_into("Q", buf, _CTRL_OFF_ARG0, arg0) + struct.pack_into("Q", buf, _CTRL_OFF_ARG1, arg1) + struct.pack_into("Q", buf, _CTRL_OFF_ARG2, arg2) + struct.pack_into("Q", buf, _CTRL_OFF_ARG3, arg3) + if payload: + buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + len(payload)] = payload + _mailbox_store_i32(state_addr, _CONTROL_REQUEST) + while _mailbox_load_i32(state_addr) != _CONTROL_DONE: + pass + error = struct.unpack_from("i", buf, _OFF_ERROR)[0] + if error != 0: + err_msg = _read_error_msg(buf) + _mailbox_store_i32(state_addr, _IDLE) + raise RuntimeError(f"chip control command {sub_cmd} failed on worker {worker_id}: {err_msg}") + result = struct.unpack_from("Q", buf, _CTRL_OFF_RESULT)[0] + out_arg1 = struct.unpack_from("Q", buf, _CTRL_OFF_ARG1)[0] + out_arg2 = struct.unpack_from("Q", buf, _CTRL_OFF_ARG2)[0] + out_arg3 = struct.unpack_from("Q", buf, _CTRL_OFF_ARG3)[0] + out_arg4 = struct.unpack_from("Q", buf, _CTRL_OFF_ARG4)[0] + out_payload = bytes(buf[_CTRL_OFF_PAYLOAD : _CTRL_OFF_PAYLOAD + min(int(result), recv_capacity)]) + _mailbox_store_i32(state_addr, _IDLE) + return int(result), out_payload, int(out_arg1), int(out_arg2), int(out_arg3), int(out_arg4) + def malloc(self, size: int, worker_id: int = 0) -> int: """Allocate memory on next-level chip worker *worker_id*. Returns a pointer.""" if self.level == 2: @@ -1341,6 +1522,208 @@ def copy_from(self, dst: int, src: int, size: int, worker_id: int = 0) -> None: assert self._orch is not None self._orch._impl.copy_from(worker_id, dst, src, size) + def open_channel( + self, + worker_id: int = 0, + cpu_to_l2_lanes: int = 1, + l2_to_cpu_lanes: int = 1, + lane_depth: int = 64, + max_message_bytes: int = 256, + ) -> int: + """Open a bounded L3/L2 message channel on next-level worker *worker_id*.""" + if self.level == 2: + assert self._chip_worker is not None + return self._chip_worker.open_channel(cpu_to_l2_lanes, l2_to_cpu_lanes, lane_depth, max_message_bytes) + return self._chip_control_payload( + worker_id, + _CTRL_OPEN_CHANNEL, + arg0=cpu_to_l2_lanes, + arg1=l2_to_cpu_lanes, + arg2=lane_depth, + arg3=max_message_bytes, + )[0] + + def close_channel(self, channel: int, worker_id: int = 0) -> None: + """Close a channel returned by ``open_channel``.""" + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.close_channel(channel) + return + self._chip_control(worker_id, _CTRL_CLOSE_CHANNEL, arg0=channel) + + def channel_send( + self, + channel: int, + route: int, + data: bytes, + correlation_id: int = 0, + worker_id: int = 0, + ) -> None: + """Send one inline message from L3 CPU toward L2.""" + payload = bytes(data) + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.channel_send(channel, route, payload, correlation_id) + return + self._chip_control_payload( + worker_id, + _CTRL_CHANNEL_SEND, + arg0=channel, + arg1=route, + arg2=len(payload), + arg3=correlation_id, + payload=payload, + ) + + def channel_recv(self, channel: int, capacity: int = 256, timeout_us: int = 0, worker_id: int = 0) -> tuple[bytes, int, int]: + """Receive one inline message from L2 toward L3 CPU.""" + if self.level == 2: + assert self._chip_worker is not None + return self._chip_worker.channel_recv(channel, capacity, timeout_us) + nbytes, payload, _arg1, _arg2, route, correlation_id = self._chip_control_payload( + worker_id, + _CTRL_CHANNEL_RECV, + arg0=channel, + arg1=capacity, + arg2=timeout_us, + recv_capacity=capacity, + ) + return payload[:nbytes], route, correlation_id + + def open_shared_memory(self, data_bytes: int, signal_count: int = 2, flags: int = 0, worker_id: int = 0) -> int: + """Open a host/device shared-memory region on next-level worker *worker_id*.""" + if self.level == 2: + assert self._chip_worker is not None + return self._chip_worker.open_shared_memory(data_bytes, signal_count, flags) + return self._chip_control_payload( + worker_id, + _CTRL_OPEN_SHARED_MEMORY, + arg0=data_bytes, + arg1=signal_count, + arg2=flags, + )[0] + + def close_shared_memory(self, memory: int, worker_id: int = 0) -> None: + """Close a shared-memory region returned by ``open_shared_memory``.""" + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.close_shared_memory(memory) + return + self._chip_control(worker_id, _CTRL_CLOSE_SHARED_MEMORY, arg0=memory) + + def shared_memory_info(self, memory: int, worker_id: int = 0) -> tuple[int, int, int, int, int]: + """Return shared-memory metadata. + + For L3 mailbox access, ``host_ptr`` is always ``0`` because the + parent process has no directly dereferenceable host mapping. + """ + if self.level == 2: + assert self._chip_worker is not None + return self._chip_worker.shared_memory_info(memory) + _host_ptr, _payload, device_ptr, data_bytes, signal_count, flags = self._chip_control_payload( + worker_id, _CTRL_SHARED_MEMORY_INFO, arg0=memory + ) + return 0, int(device_ptr), int(data_bytes), int(signal_count), int(flags) + + def shared_memory_read(self, memory: int, offset: int, nbytes: int, worker_id: int = 0) -> bytes: + """Read bytes from a shared-memory data region. + + L3 mailbox access chunks large reads internally; it is still an RPC + copy helper, not a parent-process direct mapping or streaming data + plane. The returned ``bytes`` materializes the full requested range. + """ + if self.level == 2: + assert self._chip_worker is not None + return self._chip_worker.shared_memory_read(memory, offset, nbytes) + if nbytes < 0: + raise ValueError("shared_memory_read: nbytes must be non-negative") + if offset < 0: + raise ValueError("shared_memory_read: offset must be non-negative") + _host_ptr, _device_ptr, data_bytes, _signal_count, _flags = self.shared_memory_info(memory, worker_id=worker_id) + if offset > data_bytes or nbytes > data_bytes - offset: + raise ValueError( + f"shared_memory_read out of range: offset={offset}, nbytes={nbytes}, data_bytes={data_bytes}" + ) + out = bytearray() + remaining = int(nbytes) + current_offset = int(offset) + while remaining > 0 or (nbytes == 0 and not out): + chunk = min(remaining, _CTRL_PAYLOAD_CAPACITY) + out_nbytes, payload, _arg1, _arg2, _arg3, _arg4 = self._chip_control_payload( + worker_id, + _CTRL_SHARED_MEMORY_READ, + arg0=memory, + arg1=current_offset, + arg2=chunk, + recv_capacity=chunk, + ) + if int(out_nbytes) != chunk: + raise RuntimeError(f"shared_memory_read short read: expected {chunk}, got {out_nbytes}") + if len(payload) < chunk: + raise RuntimeError(f"shared_memory_read short payload: expected {chunk}, got {len(payload)}") + out.extend(payload[:out_nbytes]) + if nbytes == 0: + break + current_offset += chunk + remaining -= chunk + return bytes(out) + + def shared_memory_write(self, memory: int, offset: int, data: bytes, worker_id: int = 0) -> None: + """Write bytes into a shared-memory data region. + + L3 mailbox access chunks large writes internally; it is still an RPC + copy helper, not a parent-process direct mapping. + """ + payload = bytes(data) + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.shared_memory_write(memory, offset, payload) + return + if len(payload) == 0: + self._chip_control_payload( + worker_id, + _CTRL_SHARED_MEMORY_WRITE, + arg0=memory, + arg1=offset, + arg2=0, + payload=b"", + ) + return + current_offset = int(offset) + written = 0 + while written < len(payload): + chunk = payload[written : written + _CTRL_PAYLOAD_CAPACITY] + self._chip_control_payload( + worker_id, + _CTRL_SHARED_MEMORY_WRITE, + arg0=memory, + arg1=current_offset, + arg2=len(chunk), + payload=chunk, + ) + current_offset += len(chunk) + written += len(chunk) + + def shared_memory_notify(self, memory: int, signal_id: int, value: int, worker_id: int = 0) -> None: + """Publish a software signal value for a shared-memory region.""" + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.shared_memory_notify(memory, signal_id, value) + return + self._chip_control(worker_id, _CTRL_SHARED_MEMORY_NOTIFY, arg0=memory, arg1=signal_id, arg2=value) + + def shared_memory_wait( + self, memory: int, signal_id: int, target: int, timeout_us: int = 0, worker_id: int = 0 + ) -> None: + """Wait until a shared-memory software signal reaches ``target``.""" + if self.level == 2: + assert self._chip_worker is not None + self._chip_worker.shared_memory_wait(memory, signal_id, target, timeout_us) + return + self._chip_control( + worker_id, _CTRL_SHARED_MEMORY_WAIT, arg0=memory, arg1=signal_id, arg2=target, arg3=timeout_us + ) + # ------------------------------------------------------------------ # run — uniform entry point # ------------------------------------------------------------------ diff --git a/src/a2a3/platform/onboard/host/CMakeLists.txt b/src/a2a3/platform/onboard/host/CMakeLists.txt index cdcbec780..3a9e2fdb3 100644 --- a/src/a2a3/platform/onboard/host/CMakeLists.txt +++ b/src/a2a3/platform/onboard/host/CMakeLists.txt @@ -57,6 +57,8 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/comm_hccl.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/pmu_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/dep_gen_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_channel.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_memory.cpp" ) if(DEFINED CUSTOM_SOURCE_DIRS) foreach(SRC_DIR ${CUSTOM_SOURCE_DIRS}) diff --git a/src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp b/src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp index 4b1e6e107..a1a20f7a3 100644 --- a/src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp +++ b/src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp @@ -21,6 +21,8 @@ #include "prepare_callable_common.h" #include "task_args.h" +#include +#include #include #include @@ -29,6 +31,8 @@ #include "common/unified_log.h" #include "device_runner.h" +#include "host_device_channel.h" +#include "host_device_memory.h" #include "host_log.h" #include "host/raii_scope_guard.h" #include "runtime.h" @@ -38,6 +42,31 @@ // time against libunified_dlog.so / libascendalog.so. extern "C" int dlog_setlevel(int moduleId, int level, int enableEvent); +namespace { +void *g_hdch_hal_handle = nullptr; +constexpr unsigned int HDCH_DEV_SVM_MAP_HOST = 2U; + +using HdchHalHostRegisterFn = int (*)(void *dev_ptr, size_t size, unsigned int flags, int device_id, void **host_ptr); +using HdchHalHostUnregisterFn = int (*)(void *host_ptr, int device_id); + +int hdch_load_hal_if_needed() { + if (g_hdch_hal_handle != nullptr) return 0; + g_hdch_hal_handle = dlopen("libascend_hal.so", RTLD_NOW | RTLD_LOCAL); + return g_hdch_hal_handle == nullptr ? -1 : 0; +} + +HdchHalHostRegisterFn hdch_get_halHostRegister() { + if (g_hdch_hal_handle == nullptr) return nullptr; + return reinterpret_cast(dlsym(g_hdch_hal_handle, "halHostRegister")); +} + +HdchHalHostUnregisterFn hdch_get_halHostUnregister() { + if (g_hdch_hal_handle == nullptr) return nullptr; + return reinterpret_cast(dlsym(g_hdch_hal_handle, "halHostUnregister")); +} + +} // namespace + extern "C" { /* =========================================================================== @@ -190,6 +219,194 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +HostDeviceChannelHandle open_host_device_channel_ctx(DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg) { + if (ctx == NULL) { + errno = EINVAL; + return NULL; + } + size_t bytes = host_device_channel_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + auto *runner = static_cast(ctx); + void *dev_ptr = nullptr; + void *host_ptr = nullptr; + try { + dev_ptr = runner->allocate_tensor(bytes); + if (dev_ptr == nullptr) { + errno = ENOMEM; + return NULL; + } + if (hdch_load_hal_if_needed() != 0) { + runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } + HdchHalHostRegisterFn fn = hdch_get_halHostRegister(); + if (fn == nullptr || fn(dev_ptr, bytes, HDCH_DEV_SVM_MAP_HOST, runner->device_id(), &host_ptr) != 0 || + host_ptr == nullptr) { + runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } + HostDeviceChannel *ch = host_device_channel_wrap(dev_ptr, host_ptr, bytes, cfg, 0, nullptr); + if (ch == nullptr) { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr) unreg(host_ptr, runner->device_id()); + runner->free_tensor(dev_ptr); + errno = ENOMEM; + return NULL; + } + return static_cast(ch); + } catch (...) { + if (host_ptr != nullptr) { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr) unreg(host_ptr, runner->device_id()); + } + if (dev_ptr != nullptr) runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } +} + +int close_host_device_channel_ctx(DeviceContextHandle ctx, HostDeviceChannelHandle raw_ch) { + if (ctx == NULL || raw_ch == NULL) return HDCH_ERR_INVALID; + auto *runner = static_cast(ctx); + auto *ch = static_cast(raw_ch); + try { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr && ch->host_base != nullptr) { + unreg(ch->host_base, runner->device_id()); + } + void *dev_ptr = ch->device_base; + host_device_channel_destroy(ch); + if (dev_ptr != nullptr) runner->free_tensor(dev_ptr); + return HDCH_OK; + } catch (...) { + return HDCH_ERR_BACKEND; + } +} + +int host_device_send_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, uint32_t route, const void *data, size_t nbytes, + uint64_t correlation_id, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_send_cpu( + static_cast(ch), route, data, nbytes, correlation_id, timeout_us + ); +} + +int host_device_recv_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, void *dst, size_t dst_capacity, size_t *out_nbytes, + uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_recv_cpu( + static_cast(ch), dst, dst_capacity, out_nbytes, out_correlation_id, out_route, timeout_us + ); +} +HostDeviceMemoryHandle open_host_device_memory_ctx(DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg) { + if (ctx == NULL) { + errno = EINVAL; + return NULL; + } + size_t bytes = host_device_memory_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + auto *runner = static_cast(ctx); + void *dev_ptr = nullptr; + void *host_ptr = nullptr; + try { + dev_ptr = runner->allocate_tensor(bytes); + if (dev_ptr == nullptr) { + errno = ENOMEM; + return NULL; + } + if (hdch_load_hal_if_needed() != 0) { + runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } + HdchHalHostRegisterFn fn = hdch_get_halHostRegister(); + if (fn == nullptr || fn(dev_ptr, bytes, HDCH_DEV_SVM_MAP_HOST, runner->device_id(), &host_ptr) != 0 || + host_ptr == nullptr) { + runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } + HostDeviceMemory *mem = host_device_memory_wrap(dev_ptr, host_ptr, bytes, cfg, 0, nullptr); + if (mem == nullptr) { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr) unreg(host_ptr, runner->device_id()); + runner->free_tensor(dev_ptr); + errno = ENOMEM; + return NULL; + } + return static_cast(mem); + } catch (...) { + if (host_ptr != nullptr) { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr) unreg(host_ptr, runner->device_id()); + } + if (dev_ptr != nullptr) runner->free_tensor(dev_ptr); + errno = EIO; + return NULL; + } +} + +int close_host_device_memory_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle raw_mem) { + if (ctx == NULL || raw_mem == NULL) return HDMEM_ERR_INVALID; + auto *runner = static_cast(ctx); + auto *mem = static_cast(raw_mem); + try { + HdchHalHostUnregisterFn unreg = hdch_get_halHostUnregister(); + if (unreg != nullptr && mem->host_base != nullptr) { + unreg(mem->host_base, runner->device_id()); + } + void *dev_ptr = mem->device_base; + host_device_memory_destroy(mem); + if (dev_ptr != nullptr) runner->free_tensor(dev_ptr); + return HDMEM_OK; + } catch (...) { + return HDMEM_ERR_BACKEND; + } +} + +int host_device_memory_info_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, HostDeviceMemoryInfo *info) { + (void)ctx; + return host_device_memory_info(static_cast(mem), info); +} + +int host_device_memory_read_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, void *dst, size_t nbytes +) { + (void)ctx; + return host_device_memory_read(static_cast(mem), offset, dst, nbytes); +} + +int host_device_memory_write_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, const void *src, size_t nbytes +) { + (void)ctx; + return host_device_memory_write(static_cast(mem), offset, src, nbytes); +} + +int host_device_memory_notify_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t value) { + (void)ctx; + return host_device_memory_notify(static_cast(mem), signal_id, value); +} + +int host_device_memory_wait_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us +) { + (void)ctx; + return host_device_memory_wait(static_cast(mem), signal_id, target, timeout_us); +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { diff --git a/src/a2a3/platform/sim/host/CMakeLists.txt b/src/a2a3/platform/sim/host/CMakeLists.txt index bf229fb2f..27eb70f7c 100644 --- a/src/a2a3/platform/sim/host/CMakeLists.txt +++ b/src/a2a3/platform/sim/host/CMakeLists.txt @@ -48,6 +48,8 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/dep_gen_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../aicpu/platform_aicpu_affinity.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform_comm/comm_sim.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_channel.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_memory.cpp" ) if(DEFINED CUSTOM_SOURCE_DIRS) diff --git a/src/a2a3/platform/sim/host/pto_runtime_c_api.cpp b/src/a2a3/platform/sim/host/pto_runtime_c_api.cpp index 76b8f4f93..a676fdbab 100644 --- a/src/a2a3/platform/sim/host/pto_runtime_c_api.cpp +++ b/src/a2a3/platform/sim/host/pto_runtime_c_api.cpp @@ -21,7 +21,9 @@ #include "prepare_callable_common.h" #include "task_args.h" +#include #include +#include #include #include @@ -31,6 +33,8 @@ #include "host_log.h" #include "cpu_sim_context.h" #include "device_runner.h" +#include "host_device_channel.h" +#include "host_device_memory.h" #include "runtime.h" extern "C" { @@ -152,6 +156,112 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +HostDeviceChannelHandle open_host_device_channel_ctx(DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg) { + (void)ctx; + size_t bytes = host_device_channel_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + void *base = NULL; + int rc = posix_memalign(&base, 64, bytes); + if (rc != 0) { + errno = rc; + return NULL; + } + HostDeviceChannel *ch = host_device_channel_wrap(base, base, bytes, cfg, 1, free); + if (ch == nullptr) { + free(base); + errno = ENOMEM; + return NULL; + } + return static_cast(ch); +} + +int close_host_device_channel_ctx(DeviceContextHandle ctx, HostDeviceChannelHandle ch) { + (void)ctx; + host_device_channel_destroy(static_cast(ch)); + return HDCH_OK; +} + +int host_device_send_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, uint32_t route, const void *data, size_t nbytes, + uint64_t correlation_id, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_send_cpu( + static_cast(ch), route, data, nbytes, correlation_id, timeout_us + ); +} + +int host_device_recv_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, void *dst, size_t dst_capacity, size_t *out_nbytes, + uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_recv_cpu( + static_cast(ch), dst, dst_capacity, out_nbytes, out_correlation_id, out_route, timeout_us + ); +} +HostDeviceMemoryHandle open_host_device_memory_ctx(DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg) { + (void)ctx; + size_t bytes = host_device_memory_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + void *base = NULL; + int rc = posix_memalign(&base, 64, bytes); + if (rc != 0) { + errno = rc; + return NULL; + } + HostDeviceMemory *mem = host_device_memory_wrap(base, base, bytes, cfg, 1, free); + if (mem == nullptr) { + free(base); + errno = ENOMEM; + return NULL; + } + return static_cast(mem); +} + +int close_host_device_memory_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem) { + (void)ctx; + host_device_memory_destroy(static_cast(mem)); + return HDMEM_OK; +} + +int host_device_memory_info_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, HostDeviceMemoryInfo *info) { + (void)ctx; + return host_device_memory_info(static_cast(mem), info); +} + +int host_device_memory_read_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, void *dst, size_t nbytes +) { + (void)ctx; + return host_device_memory_read(static_cast(mem), offset, dst, nbytes); +} + +int host_device_memory_write_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, const void *src, size_t nbytes +) { + (void)ctx; + return host_device_memory_write(static_cast(mem), offset, src, nbytes); +} + +int host_device_memory_notify_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t value) { + (void)ctx; + return host_device_memory_notify(static_cast(mem), signal_id, value); +} + +int host_device_memory_wait_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us +) { + (void)ctx; + return host_device_memory_wait(static_cast(mem), signal_id, target, timeout_us); +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { diff --git a/src/a5/platform/onboard/host/CMakeLists.txt b/src/a5/platform/onboard/host/CMakeLists.txt index 0232f9b4a..eb33d152b 100644 --- a/src/a5/platform/onboard/host/CMakeLists.txt +++ b/src/a5/platform/onboard/host/CMakeLists.txt @@ -42,6 +42,8 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/l2_perf_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/pmu_collector.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/../../src/host/tensor_dump_collector.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_channel.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_memory.cpp" ) if(DEFINED CUSTOM_SOURCE_DIRS) foreach(SRC_DIR ${CUSTOM_SOURCE_DIRS}) diff --git a/src/a5/platform/onboard/host/pto_runtime_c_api.cpp b/src/a5/platform/onboard/host/pto_runtime_c_api.cpp index ec5b0ebb7..2b514f423 100644 --- a/src/a5/platform/onboard/host/pto_runtime_c_api.cpp +++ b/src/a5/platform/onboard/host/pto_runtime_c_api.cpp @@ -21,6 +21,7 @@ #include "prepare_callable_common.h" #include "task_args.h" +#include #include #include @@ -29,6 +30,8 @@ #include "common/unified_log.h" #include "device_runner.h" +#include "host_device_channel.h" +#include "host_device_memory.h" #include "host_log.h" #include "host/raii_scope_guard.h" #include "runtime.h" @@ -157,6 +160,108 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +HostDeviceChannelHandle open_host_device_channel_ctx(DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg) { + (void)ctx; + (void)cfg; + errno = ENOTSUP; + return NULL; +} + +int close_host_device_channel_ctx(DeviceContextHandle ctx, HostDeviceChannelHandle ch) { + (void)ctx; + (void)ch; + return HDCH_ERR_BACKEND; +} + +int host_device_send_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, uint32_t route, const void *data, size_t nbytes, + uint64_t correlation_id, uint32_t timeout_us +) { + (void)ctx; + (void)ch; + (void)route; + (void)data; + (void)nbytes; + (void)correlation_id; + (void)timeout_us; + return HDCH_ERR_BACKEND; +} + +int host_device_recv_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, void *dst, size_t dst_capacity, size_t *out_nbytes, + uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +) { + (void)ctx; + (void)ch; + (void)dst; + (void)dst_capacity; + (void)out_nbytes; + (void)out_correlation_id; + (void)out_route; + (void)timeout_us; + return HDCH_ERR_BACKEND; +} +HostDeviceMemoryHandle open_host_device_memory_ctx(DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg) { + (void)ctx; + (void)cfg; + errno = ENOTSUP; + return NULL; +} + +int close_host_device_memory_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem) { + (void)ctx; + (void)mem; + return HDMEM_ERR_BACKEND; +} + +int host_device_memory_info_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, HostDeviceMemoryInfo *info) { + (void)ctx; + (void)mem; + (void)info; + return HDMEM_ERR_BACKEND; +} + +int host_device_memory_read_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, void *dst, size_t nbytes +) { + (void)ctx; + (void)mem; + (void)offset; + (void)dst; + (void)nbytes; + return HDMEM_ERR_BACKEND; +} + +int host_device_memory_write_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, const void *src, size_t nbytes +) { + (void)ctx; + (void)mem; + (void)offset; + (void)src; + (void)nbytes; + return HDMEM_ERR_BACKEND; +} + +int host_device_memory_notify_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t value) { + (void)ctx; + (void)mem; + (void)signal_id; + (void)value; + return HDMEM_ERR_BACKEND; +} + +int host_device_memory_wait_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us +) { + (void)ctx; + (void)mem; + (void)signal_id; + (void)target; + (void)timeout_us; + return HDMEM_ERR_BACKEND; +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { diff --git a/src/a5/platform/sim/host/CMakeLists.txt b/src/a5/platform/sim/host/CMakeLists.txt index 3161f4f74..3415950bf 100644 --- a/src/a5/platform/sim/host/CMakeLists.txt +++ b/src/a5/platform/sim/host/CMakeLists.txt @@ -49,6 +49,8 @@ list(APPEND HOST_RUNTIME_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/../aicpu/platform_aicpu_affinity.cpp" # Shared POSIX-shm sim comm backend (same source as a2a3 sim). "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/platform_comm/comm_sim.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_channel.cpp" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../common/worker/host_device_memory.cpp" ) if(DEFINED CUSTOM_SOURCE_DIRS) diff --git a/src/a5/platform/sim/host/pto_runtime_c_api.cpp b/src/a5/platform/sim/host/pto_runtime_c_api.cpp index 5f5099142..05f949d89 100644 --- a/src/a5/platform/sim/host/pto_runtime_c_api.cpp +++ b/src/a5/platform/sim/host/pto_runtime_c_api.cpp @@ -21,7 +21,9 @@ #include "prepare_callable_common.h" #include "task_args.h" +#include #include +#include #include #include @@ -31,6 +33,8 @@ #include "host_log.h" #include "cpu_sim_context.h" #include "device_runner.h" +#include "host_device_channel.h" +#include "host_device_memory.h" #include "runtime.h" extern "C" { @@ -152,6 +156,112 @@ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *de } } +HostDeviceChannelHandle open_host_device_channel_ctx(DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg) { + (void)ctx; + size_t bytes = host_device_channel_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + void *base = NULL; + int rc = posix_memalign(&base, 64, bytes); + if (rc != 0) { + errno = rc; + return NULL; + } + HostDeviceChannel *ch = host_device_channel_wrap(base, base, bytes, cfg, 1, free); + if (ch == nullptr) { + free(base); + errno = ENOMEM; + return NULL; + } + return static_cast(ch); +} + +int close_host_device_channel_ctx(DeviceContextHandle ctx, HostDeviceChannelHandle ch) { + (void)ctx; + host_device_channel_destroy(static_cast(ch)); + return HDCH_OK; +} + +int host_device_send_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, uint32_t route, const void *data, size_t nbytes, + uint64_t correlation_id, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_send_cpu( + static_cast(ch), route, data, nbytes, correlation_id, timeout_us + ); +} + +int host_device_recv_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, void *dst, size_t dst_capacity, size_t *out_nbytes, + uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +) { + (void)ctx; + return host_device_channel_recv_cpu( + static_cast(ch), dst, dst_capacity, out_nbytes, out_correlation_id, out_route, timeout_us + ); +} +HostDeviceMemoryHandle open_host_device_memory_ctx(DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg) { + (void)ctx; + size_t bytes = host_device_memory_required_bytes(cfg); + if (bytes == 0) { + errno = EINVAL; + return NULL; + } + void *base = NULL; + int rc = posix_memalign(&base, 64, bytes); + if (rc != 0) { + errno = rc; + return NULL; + } + HostDeviceMemory *mem = host_device_memory_wrap(base, base, bytes, cfg, 1, free); + if (mem == nullptr) { + free(base); + errno = ENOMEM; + return NULL; + } + return static_cast(mem); +} + +int close_host_device_memory_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem) { + (void)ctx; + host_device_memory_destroy(static_cast(mem)); + return HDMEM_OK; +} + +int host_device_memory_info_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, HostDeviceMemoryInfo *info) { + (void)ctx; + return host_device_memory_info(static_cast(mem), info); +} + +int host_device_memory_read_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, void *dst, size_t nbytes +) { + (void)ctx; + return host_device_memory_read(static_cast(mem), offset, dst, nbytes); +} + +int host_device_memory_write_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, const void *src, size_t nbytes +) { + (void)ctx; + return host_device_memory_write(static_cast(mem), offset, src, nbytes); +} + +int host_device_memory_notify_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t value) { + (void)ctx; + return host_device_memory_notify(static_cast(mem), signal_id, value); +} + +int host_device_memory_wait_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us +) { + (void)ctx; + return host_device_memory_wait(static_cast(mem), signal_id, target, timeout_us); +} + int finalize_device(DeviceContextHandle ctx) { if (ctx == NULL) return -1; try { diff --git a/src/common/hierarchical/orchestrator.cpp b/src/common/hierarchical/orchestrator.cpp index 5a6e710f9..8a278d4a3 100644 --- a/src/common/hierarchical/orchestrator.cpp +++ b/src/common/hierarchical/orchestrator.cpp @@ -54,6 +54,80 @@ void Orchestrator::copy_from(int worker_id, uint64_t dst, uint64_t src, size_t s wt->control_copy_from(dst, src, size); } +uint64_t Orchestrator::open_channel( + int worker_id, uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, uint32_t lane_depth, + uint32_t max_message_bytes +) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::open_channel: invalid worker_id"); + return wt->control_open_channel(cpu_to_l2_lanes, l2_to_cpu_lanes, lane_depth, max_message_bytes); +} + +void Orchestrator::close_channel(int worker_id, uint64_t ch) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::close_channel: invalid worker_id"); + wt->control_close_channel(ch); +} + +void Orchestrator::channel_send( + int worker_id, uint64_t ch, uint32_t route, const std::vector &data, uint64_t correlation_id +) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::channel_send: invalid worker_id"); + wt->control_channel_send(ch, route, data.data(), data.size(), correlation_id); +} + +std::vector Orchestrator::channel_recv( + int worker_id, uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *route, uint64_t *correlation_id +) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::channel_recv: invalid worker_id"); + return wt->control_channel_recv(ch, capacity, timeout_us, route, correlation_id); +} + + +uint64_t Orchestrator::open_shared_memory(int worker_id, uint64_t data_bytes, uint32_t signal_count, uint32_t flags) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::open_shared_memory: invalid worker_id"); + return wt->control_open_shared_memory(data_bytes, signal_count, flags); +} + +void Orchestrator::close_shared_memory(int worker_id, uint64_t mem) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::close_shared_memory: invalid worker_id"); + wt->control_close_shared_memory(mem); +} + +HostDeviceMemoryInfo Orchestrator::shared_memory_info(int worker_id, uint64_t mem) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::shared_memory_info: invalid worker_id"); + return wt->control_shared_memory_info(mem); +} + +std::vector Orchestrator::shared_memory_read(int worker_id, uint64_t mem, uint64_t offset, size_t nbytes) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::shared_memory_read: invalid worker_id"); + return wt->control_shared_memory_read(mem, offset, nbytes); +} + +void Orchestrator::shared_memory_write(int worker_id, uint64_t mem, uint64_t offset, const std::vector &data) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::shared_memory_write: invalid worker_id"); + wt->control_shared_memory_write(mem, offset, data.data(), data.size()); +} + +void Orchestrator::shared_memory_notify(int worker_id, uint64_t mem, uint32_t signal_id, uint64_t value) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::shared_memory_notify: invalid worker_id"); + wt->control_shared_memory_notify(mem, signal_id, value); +} + +void Orchestrator::shared_memory_wait(int worker_id, uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + auto *wt = manager_->get_worker(WorkerType::NEXT_LEVEL, worker_id); + if (!wt) throw std::runtime_error("Orchestrator::shared_memory_wait: invalid worker_id"); + wt->control_shared_memory_wait(mem, signal_id, target, timeout_us); +} + TaskSlotState &Orchestrator::slot_state(TaskSlot s) { TaskSlotState *p = allocator_->slot_state(s); if (!p) throw std::runtime_error("Orchestrator::slot_state: invalid slot id"); diff --git a/src/common/hierarchical/orchestrator.h b/src/common/hierarchical/orchestrator.h index f8abdb424..e46ad341e 100644 --- a/src/common/hierarchical/orchestrator.h +++ b/src/common/hierarchical/orchestrator.h @@ -45,6 +45,7 @@ #include "scope.h" #include "tensormap.h" #include "types.h" +#include "../worker/pto_runtime_c_api.h" class WorkerManager; @@ -91,6 +92,21 @@ class Orchestrator { void free(int worker_id, uint64_t ptr); void copy_to(int worker_id, uint64_t dst, uint64_t src, size_t size); void copy_from(int worker_id, uint64_t dst, uint64_t src, size_t size); + uint64_t open_channel( + int worker_id, uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, uint32_t lane_depth, + uint32_t max_message_bytes + ); + void close_channel(int worker_id, uint64_t ch); + void channel_send(int worker_id, uint64_t ch, uint32_t route, const std::vector &data, uint64_t correlation_id); + std::vector + channel_recv(int worker_id, uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *route, uint64_t *correlation_id); + uint64_t open_shared_memory(int worker_id, uint64_t data_bytes, uint32_t signal_count, uint32_t flags); + void close_shared_memory(int worker_id, uint64_t mem); + HostDeviceMemoryInfo shared_memory_info(int worker_id, uint64_t mem); + std::vector shared_memory_read(int worker_id, uint64_t mem, uint64_t offset, size_t nbytes); + void shared_memory_write(int worker_id, uint64_t mem, uint64_t offset, const std::vector &data); + void shared_memory_notify(int worker_id, uint64_t mem, uint32_t signal_id, uint64_t value); + void shared_memory_wait(int worker_id, uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); // Submit a NEXT_LEVEL task. `callable_id` is a cid registered via // Worker.register(): the chip child looks it up in its COW-inherited diff --git a/src/common/hierarchical/worker_manager.cpp b/src/common/hierarchical/worker_manager.cpp index 3f81e7eec..5f9c7425a 100644 --- a/src/common/hierarchical/worker_manager.cpp +++ b/src/common/hierarchical/worker_manager.cpp @@ -16,6 +16,7 @@ #include #include +#include #include #include #include @@ -320,11 +321,15 @@ WorkerThread *WorkerManager::pick_idle_excluding(WorkerType type, const std::vec // WorkerThread — memory control (orch thread, concurrent with worker thread) // ============================================================================= -static void write_control_args(char *mbox, uint64_t sub_cmd, uint64_t a0 = 0, uint64_t a1 = 0, uint64_t a2 = 0) { +static void write_control_args( + char *mbox, uint64_t sub_cmd, uint64_t a0 = 0, uint64_t a1 = 0, uint64_t a2 = 0, uint64_t a3 = 0, uint64_t a4 = 0 +) { std::memcpy(mbox + MAILBOX_OFF_CALLABLE, &sub_cmd, sizeof(uint64_t)); std::memcpy(mbox + CTRL_OFF_ARG0, &a0, sizeof(uint64_t)); std::memcpy(mbox + CTRL_OFF_ARG1, &a1, sizeof(uint64_t)); std::memcpy(mbox + CTRL_OFF_ARG2, &a2, sizeof(uint64_t)); + std::memcpy(mbox + CTRL_OFF_ARG3, &a3, sizeof(uint64_t)); + std::memcpy(mbox + CTRL_OFF_ARG4, &a4, sizeof(uint64_t)); } static uint64_t read_control_result(const char *mbox) { @@ -410,6 +415,147 @@ void WorkerThread::control_copy_from(uint64_t dst, uint64_t src, size_t size) { run_control_command("control_copy_from"); } +uint64_t WorkerThread::control_open_channel( + uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, uint32_t lane_depth, uint32_t max_message_bytes +) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_OPEN_CHANNEL, cpu_to_l2_lanes, l2_to_cpu_lanes, lane_depth, max_message_bytes); + run_control_command("control_open_channel"); + return read_control_result(mbox()); +} + +void WorkerThread::control_close_channel(uint64_t ch) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_CLOSE_CHANNEL, ch); + run_control_command("control_close_channel"); +} + +void WorkerThread::control_channel_send( + uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id +) { + if (data == nullptr && size != 0) throw std::invalid_argument("control_channel_send: data is null"); + if (size > CTRL_PAYLOAD_CAPACITY) throw std::invalid_argument("control_channel_send: payload too large"); + std::lock_guard lk(mailbox_mu_); + if (size != 0) std::memcpy(mbox() + CTRL_OFF_PAYLOAD, data, size); + write_control_args(mbox(), CTRL_CHANNEL_SEND, ch, route, size, correlation_id); + run_control_command("control_channel_send"); +} + +std::vector WorkerThread::control_channel_recv( + uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *route, uint64_t *correlation_id +) { + if (capacity > CTRL_PAYLOAD_CAPACITY) throw std::invalid_argument("control_channel_recv: capacity too large"); + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_CHANNEL_RECV, ch, capacity, timeout_us); + run_control_command("control_channel_recv"); + uint64_t nbytes = read_control_result(mbox()); + uint64_t r = 0; + uint64_t cid = 0; + std::memcpy(&r, mbox() + CTRL_OFF_ARG3, sizeof(uint64_t)); + std::memcpy(&cid, mbox() + CTRL_OFF_ARG4, sizeof(uint64_t)); + std::vector out(static_cast(nbytes)); + if (nbytes != 0) std::memcpy(out.data(), mbox() + CTRL_OFF_PAYLOAD, static_cast(nbytes)); + if (route != nullptr) *route = static_cast(r); + if (correlation_id != nullptr) *correlation_id = cid; + return out; +} + +uint64_t WorkerThread::control_open_shared_memory(uint64_t data_bytes, uint32_t signal_count, uint32_t flags) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_OPEN_SHARED_MEMORY, data_bytes, signal_count, flags); + run_control_command("control_open_shared_memory"); + return read_control_result(mbox()); +} + +void WorkerThread::control_close_shared_memory(uint64_t mem) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_CLOSE_SHARED_MEMORY, mem); + run_control_command("control_close_shared_memory"); +} + +HostDeviceMemoryInfo WorkerThread::control_shared_memory_info(uint64_t mem) { + std::lock_guard lk(mailbox_mu_); + return control_shared_memory_info_locked(mem); +} + +HostDeviceMemoryInfo WorkerThread::control_shared_memory_info_locked(uint64_t mem) { + write_control_args(mbox(), CTRL_SHARED_MEMORY_INFO, mem); + run_control_command("control_shared_memory_info"); + HostDeviceMemoryInfo info{}; + info.host_ptr = 0; + std::memcpy(&info.device_ptr, mbox() + CTRL_OFF_ARG1, sizeof(uint64_t)); + std::memcpy(&info.data_bytes, mbox() + CTRL_OFF_ARG2, sizeof(uint64_t)); + uint64_t signal_count = 0; + uint64_t flags = 0; + std::memcpy(&signal_count, mbox() + CTRL_OFF_ARG3, sizeof(uint64_t)); + std::memcpy(&flags, mbox() + CTRL_OFF_ARG4, sizeof(uint64_t)); + info.signal_count = static_cast(signal_count); + info.flags = static_cast(flags); + return info; +} + +void WorkerThread::validate_shared_memory_read_range_locked(uint64_t mem, uint64_t offset, size_t nbytes) { + HostDeviceMemoryInfo info = control_shared_memory_info_locked(mem); + if (offset > info.data_bytes || nbytes > info.data_bytes - offset) { + throw std::out_of_range( + "control_shared_memory_read out of range: offset=" + std::to_string(offset) + + ", nbytes=" + std::to_string(nbytes) + ", data_bytes=" + std::to_string(info.data_bytes) + ); + } +} + +std::vector WorkerThread::control_shared_memory_read(uint64_t mem, uint64_t offset, size_t nbytes) { + std::lock_guard lk(mailbox_mu_); + // L3 shared-memory reads are chunked mailbox RPC copies. Validate the + // logical range before materializing the full return vector so invalid + // large reads fail clearly instead of attempting an unnecessary allocation. + validate_shared_memory_read_range_locked(mem, offset, nbytes); + std::vector out(nbytes); + size_t copied = 0; + do { + size_t chunk = std::min(nbytes - copied, CTRL_PAYLOAD_CAPACITY); + write_control_args(mbox(), CTRL_SHARED_MEMORY_READ, mem, offset + copied, static_cast(chunk)); + run_control_command("control_shared_memory_read"); + uint64_t out_nbytes = read_control_result(mbox()); + if (out_nbytes != chunk) { + throw std::runtime_error( + "control_shared_memory_read: short read, expected " + std::to_string(chunk) + ", got " + + std::to_string(out_nbytes) + ); + } + if (chunk != 0) std::memcpy(out.data() + copied, mbox() + CTRL_OFF_PAYLOAD, chunk); + copied += chunk; + } while (copied < nbytes); + return out; +} + +void WorkerThread::control_shared_memory_write(uint64_t mem, uint64_t offset, const void *data, size_t nbytes) { + if (data == nullptr && nbytes != 0) throw std::invalid_argument("control_shared_memory_write: data is null"); + std::lock_guard lk(mailbox_mu_); + size_t written = 0; + do { + size_t chunk = std::min(nbytes - written, CTRL_PAYLOAD_CAPACITY); + if (chunk != 0) { + std::memcpy(mbox() + CTRL_OFF_PAYLOAD, static_cast(data) + written, chunk); + } + write_control_args(mbox(), CTRL_SHARED_MEMORY_WRITE, mem, offset + written, static_cast(chunk)); + run_control_command("control_shared_memory_write"); + written += chunk; + } while (written < nbytes); +} + +void WorkerThread::control_shared_memory_notify(uint64_t mem, uint32_t signal_id, uint64_t value) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_SHARED_MEMORY_NOTIFY, mem, signal_id, value); + run_control_command("control_shared_memory_notify"); +} + +void WorkerThread::control_shared_memory_wait(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + std::lock_guard lk(mailbox_mu_); + write_control_args(mbox(), CTRL_SHARED_MEMORY_WAIT, mem, signal_id, target, timeout_us); + run_control_command("control_shared_memory_wait"); +} + bool WorkerManager::any_busy() const { for (auto &wt : next_level_threads_) if (!wt->idle()) return true; diff --git a/src/common/hierarchical/worker_manager.h b/src/common/hierarchical/worker_manager.h index 7f67292be..49db54cbd 100644 --- a/src/common/hierarchical/worker_manager.h +++ b/src/common/hierarchical/worker_manager.h @@ -41,6 +41,7 @@ #include "../task_interface/call_config.h" #include "types.h" +#include "../worker/pto_runtime_c_api.h" class Ring; // forward decl — owns the slot state pool class WorkerManager; @@ -107,6 +108,17 @@ static constexpr uint64_t CTRL_PREPARE = 4; // CTRL_UNREGISTER carries only the cid. static constexpr uint64_t CTRL_REGISTER = 5; static constexpr uint64_t CTRL_UNREGISTER = 6; +static constexpr uint64_t CTRL_OPEN_CHANNEL = 7; +static constexpr uint64_t CTRL_CLOSE_CHANNEL = 8; +static constexpr uint64_t CTRL_CHANNEL_SEND = 9; +static constexpr uint64_t CTRL_CHANNEL_RECV = 10; +static constexpr uint64_t CTRL_OPEN_SHARED_MEMORY = 11; +static constexpr uint64_t CTRL_CLOSE_SHARED_MEMORY = 12; +static constexpr uint64_t CTRL_SHARED_MEMORY_INFO = 13; +static constexpr uint64_t CTRL_SHARED_MEMORY_READ = 14; +static constexpr uint64_t CTRL_SHARED_MEMORY_WRITE = 15; +static constexpr uint64_t CTRL_SHARED_MEMORY_NOTIFY = 16; +static constexpr uint64_t CTRL_SHARED_MEMORY_WAIT = 17; // Control args reuse the task mailbox region (mutually exclusive with task dispatch): // offset 16: uint64 arg0 (size for malloc; ptr for free; dst for copy; cid for register) @@ -117,6 +129,11 @@ static constexpr ptrdiff_t CTRL_OFF_ARG0 = 16; static constexpr ptrdiff_t CTRL_OFF_ARG1 = 24; static constexpr ptrdiff_t CTRL_OFF_ARG2 = 32; static constexpr ptrdiff_t CTRL_OFF_RESULT = 40; +static constexpr ptrdiff_t CTRL_OFF_ARG3 = 48; +static constexpr ptrdiff_t CTRL_OFF_ARG4 = 56; +static constexpr ptrdiff_t CTRL_OFF_PAYLOAD = 64; +static constexpr size_t CTRL_PAYLOAD_CAPACITY = + MAILBOX_SIZE - static_cast(CTRL_OFF_PAYLOAD) - MAILBOX_ERROR_MSG_SIZE; // CTRL_REGISTER puts the NUL-terminated POSIX shm name at MAILBOX_OFF_ARGS. // Fixed-width so the wire layout stays simple; well above the encoded length @@ -185,6 +202,17 @@ class WorkerThread { void control_free(uint64_t ptr); void control_copy_to(uint64_t dst, uint64_t src, size_t size); void control_copy_from(uint64_t dst, uint64_t src, size_t size); + uint64_t control_open_channel(uint32_t cpu_to_l2_lanes, uint32_t l2_to_cpu_lanes, uint32_t lane_depth, uint32_t max_message_bytes); + void control_close_channel(uint64_t ch); + void control_channel_send(uint64_t ch, uint32_t route, const void *data, size_t size, uint64_t correlation_id); + std::vector control_channel_recv(uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *route, uint64_t *correlation_id); + uint64_t control_open_shared_memory(uint64_t data_bytes, uint32_t signal_count, uint32_t flags); + void control_close_shared_memory(uint64_t mem); + HostDeviceMemoryInfo control_shared_memory_info(uint64_t mem); + std::vector control_shared_memory_read(uint64_t mem, uint64_t offset, size_t nbytes); + void control_shared_memory_write(uint64_t mem, uint64_t offset, const void *data, size_t nbytes); + void control_shared_memory_notify(uint64_t mem, uint32_t signal_id, uint64_t value); + void control_shared_memory_wait(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); // Pre-warm a chip child by triggering prepare_callable for `cid` in the // child via CTRL_PREPARE. Issued from the parent at end of init() so the @@ -224,6 +252,8 @@ class WorkerThread { // region and holds `mailbox_mu_`; this helper signals the child, // spin-polls CONTROL_DONE, and throws on a non-zero child error code. void run_control_command(const char *op_name); + HostDeviceMemoryInfo control_shared_memory_info_locked(uint64_t mem); + void validate_shared_memory_read_range_locked(uint64_t mem, uint64_t offset, size_t nbytes); char *mbox() const { return static_cast(mailbox_); } MailboxState read_mailbox_state() const; diff --git a/src/common/worker/chip_worker.cpp b/src/common/worker/chip_worker.cpp index 8971df7ad..5de5a45d1 100644 --- a/src/common/worker/chip_worker.cpp +++ b/src/common/worker/chip_worker.cpp @@ -11,6 +11,8 @@ #include "chip_worker.h" +#include +#include #include #include @@ -53,6 +55,23 @@ std::vector read_binary_file(const std::string &path) { return buf; } +std::string errno_suffix(int err) { + if (err == 0) return ""; + return std::string(": ") + std::strerror(err) + " (errno=" + std::to_string(err) + ")"; +} + +std::string channel_cfg_summary(const HostDeviceChannelConfig &cfg) { + return "cpu_to_l2_lanes=" + std::to_string(cfg.lane_count_cpu_to_l2) + + ", l2_to_cpu_lanes=" + std::to_string(cfg.lane_count_l2_to_cpu) + + ", lane_depth=" + std::to_string(cfg.lane_depth) + + ", max_message_bytes=" + std::to_string(cfg.max_message_bytes) + ", flags=" + std::to_string(cfg.flags); +} + +std::string memory_cfg_summary(const HostDeviceMemoryConfig &cfg) { + return "data_bytes=" + std::to_string(cfg.data_bytes) + ", signal_count=" + std::to_string(cfg.signal_count) + + ", flags=" + std::to_string(cfg.flags); +} + } // namespace ChipWorker::~ChipWorker() { finalize(); } @@ -99,6 +118,26 @@ void ChipWorker::init( device_free_ctx_fn_ = load_symbol(handle, "device_free_ctx"); copy_to_device_ctx_fn_ = load_symbol(handle, "copy_to_device_ctx"); copy_from_device_ctx_fn_ = load_symbol(handle, "copy_from_device_ctx"); + open_host_device_channel_ctx_fn_ = + load_symbol(handle, "open_host_device_channel_ctx"); + close_host_device_channel_ctx_fn_ = + load_symbol(handle, "close_host_device_channel_ctx"); + host_device_send_ctx_fn_ = load_symbol(handle, "host_device_send_ctx"); + host_device_recv_ctx_fn_ = load_symbol(handle, "host_device_recv_ctx"); + open_host_device_memory_ctx_fn_ = + load_symbol(handle, "open_host_device_memory_ctx"); + close_host_device_memory_ctx_fn_ = + load_symbol(handle, "close_host_device_memory_ctx"); + host_device_memory_info_ctx_fn_ = + load_symbol(handle, "host_device_memory_info_ctx"); + host_device_memory_read_ctx_fn_ = + load_symbol(handle, "host_device_memory_read_ctx"); + host_device_memory_write_ctx_fn_ = + load_symbol(handle, "host_device_memory_write_ctx"); + host_device_memory_notify_ctx_fn_ = + load_symbol(handle, "host_device_memory_notify_ctx"); + host_device_memory_wait_ctx_fn_ = + load_symbol(handle, "host_device_memory_wait_ctx"); get_runtime_size_fn_ = load_symbol(handle, "get_runtime_size"); simpler_init_fn_ = load_symbol(handle, "simpler_init"); prepare_callable_fn_ = load_symbol(handle, "prepare_callable"); @@ -165,6 +204,17 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + open_host_device_channel_ctx_fn_ = nullptr; + close_host_device_channel_ctx_fn_ = nullptr; + host_device_send_ctx_fn_ = nullptr; + host_device_recv_ctx_fn_ = nullptr; + open_host_device_memory_ctx_fn_ = nullptr; + close_host_device_memory_ctx_fn_ = nullptr; + host_device_memory_info_ctx_fn_ = nullptr; + host_device_memory_read_ctx_fn_ = nullptr; + host_device_memory_write_ctx_fn_ = nullptr; + host_device_memory_notify_ctx_fn_ = nullptr; + host_device_memory_wait_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; prepare_callable_fn_ = nullptr; @@ -201,6 +251,17 @@ void ChipWorker::init( device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + open_host_device_channel_ctx_fn_ = nullptr; + close_host_device_channel_ctx_fn_ = nullptr; + host_device_send_ctx_fn_ = nullptr; + host_device_recv_ctx_fn_ = nullptr; + open_host_device_memory_ctx_fn_ = nullptr; + close_host_device_memory_ctx_fn_ = nullptr; + host_device_memory_info_ctx_fn_ = nullptr; + host_device_memory_read_ctx_fn_ = nullptr; + host_device_memory_write_ctx_fn_ = nullptr; + host_device_memory_notify_ctx_fn_ = nullptr; + host_device_memory_wait_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; simpler_init_fn_ = nullptr; prepare_callable_fn_ = nullptr; @@ -252,6 +313,17 @@ void ChipWorker::finalize() { device_free_ctx_fn_ = nullptr; copy_to_device_ctx_fn_ = nullptr; copy_from_device_ctx_fn_ = nullptr; + open_host_device_channel_ctx_fn_ = nullptr; + close_host_device_channel_ctx_fn_ = nullptr; + host_device_send_ctx_fn_ = nullptr; + host_device_recv_ctx_fn_ = nullptr; + open_host_device_memory_ctx_fn_ = nullptr; + close_host_device_memory_ctx_fn_ = nullptr; + host_device_memory_info_ctx_fn_ = nullptr; + host_device_memory_read_ctx_fn_ = nullptr; + host_device_memory_write_ctx_fn_ = nullptr; + host_device_memory_notify_ctx_fn_ = nullptr; + host_device_memory_wait_ctx_fn_ = nullptr; get_runtime_size_fn_ = nullptr; prepare_callable_fn_ = nullptr; run_prepared_fn_ = nullptr; @@ -373,6 +445,217 @@ void ChipWorker::copy_from(uint64_t dst, uint64_t src, size_t size) { } } +uint64_t ChipWorker::open_channel(const HostDeviceChannelConfig &cfg) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + size_t required = host_device_channel_required_bytes(&cfg); + if (required == 0) { + throw std::runtime_error("open_channel invalid config or size overflow: " + channel_cfg_summary(cfg)); + } + errno = 0; + void *ch = open_host_device_channel_ctx_fn_(device_ctx_, &cfg); + if (ch == nullptr) { + int err = errno; + throw std::runtime_error( + "open_channel failed" + errno_suffix(err) + "; required_bytes=" + std::to_string(required) + "; " + + channel_cfg_summary(cfg) + ); + } + return reinterpret_cast(ch); +} + +void ChipWorker::close_channel(uint64_t ch) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = close_host_device_channel_ctx_fn_(device_ctx_, reinterpret_cast(ch)); + if (rc != 0) { + throw std::runtime_error("close_channel failed with code " + std::to_string(rc)); + } +} + +void ChipWorker::channel_send( + uint64_t ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us +) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = host_device_send_ctx_fn_( + device_ctx_, reinterpret_cast(ch), route, data, nbytes, correlation_id, timeout_us + ); + if (rc != 0) { + throw std::runtime_error("channel_send failed with code " + std::to_string(rc)); + } +} + +std::vector ChipWorker::channel_recv( + uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *out_route, uint64_t *out_correlation_id +) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + std::vector buf(capacity); + size_t out_nbytes = 0; + uint64_t correlation_id = 0; + uint32_t route = 0; + int rc = host_device_recv_ctx_fn_( + device_ctx_, reinterpret_cast(ch), buf.data(), buf.size(), &out_nbytes, &correlation_id, &route, + timeout_us + ); + if (rc != 0) { + throw std::runtime_error("channel_recv failed with code " + std::to_string(rc)); + } + buf.resize(out_nbytes); + if (out_route != nullptr) *out_route = route; + if (out_correlation_id != nullptr) *out_correlation_id = correlation_id; + return buf; +} + +void ChipWorker::channel_send_l2_for_test( + uint64_t ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us +) { + int rc = host_device_channel_send_l2_for_test( + reinterpret_cast(ch), route, data, nbytes, correlation_id, timeout_us + ); + if (rc != 0) { + throw std::runtime_error("channel_send_l2_for_test failed with code " + std::to_string(rc)); + } +} + +std::vector ChipWorker::channel_recv_l2_for_test( + uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *out_route, uint64_t *out_correlation_id +) { + std::vector buf(capacity); + size_t out_nbytes = 0; + uint64_t correlation_id = 0; + uint32_t route = 0; + int rc = host_device_channel_recv_l2_for_test( + reinterpret_cast(ch), buf.data(), buf.size(), &out_nbytes, &correlation_id, &route, + timeout_us + ); + if (rc != 0) { + throw std::runtime_error("channel_recv_l2_for_test failed with code " + std::to_string(rc)); + } + buf.resize(out_nbytes); + if (out_route != nullptr) *out_route = route; + if (out_correlation_id != nullptr) *out_correlation_id = correlation_id; + return buf; +} +uint64_t ChipWorker::open_shared_memory(const HostDeviceMemoryConfig &cfg) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + size_t required = host_device_memory_required_bytes(&cfg); + if (required == 0) { + throw std::runtime_error("open_shared_memory invalid config or size overflow: " + memory_cfg_summary(cfg)); + } + errno = 0; + void *mem = open_host_device_memory_ctx_fn_(device_ctx_, &cfg); + if (mem == nullptr) { + int err = errno; + throw std::runtime_error( + "open_shared_memory failed" + errno_suffix(err) + "; required_bytes=" + std::to_string(required) + "; " + + memory_cfg_summary(cfg) + ); + } + return reinterpret_cast(mem); +} + +void ChipWorker::close_shared_memory(uint64_t mem) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = close_host_device_memory_ctx_fn_(device_ctx_, reinterpret_cast(mem)); + if (rc != 0) { + throw std::runtime_error("close_shared_memory failed with code " + std::to_string(rc)); + } +} + +HostDeviceMemoryInfo ChipWorker::shared_memory_info(uint64_t mem) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + HostDeviceMemoryInfo info{}; + int rc = host_device_memory_info_ctx_fn_(device_ctx_, reinterpret_cast(mem), &info); + if (rc != 0) { + throw std::runtime_error("shared_memory_info failed with code " + std::to_string(rc)); + } + return info; +} + +std::vector ChipWorker::shared_memory_read(uint64_t mem, uint64_t offset, size_t nbytes) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + std::vector buf(nbytes); + int rc = host_device_memory_read_ctx_fn_(device_ctx_, reinterpret_cast(mem), offset, buf.data(), buf.size()); + if (rc != 0) { + throw std::runtime_error("shared_memory_read failed with code " + std::to_string(rc)); + } + return buf; +} + +void ChipWorker::shared_memory_write(uint64_t mem, uint64_t offset, const void *data, size_t nbytes) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = host_device_memory_write_ctx_fn_(device_ctx_, reinterpret_cast(mem), offset, data, nbytes); + if (rc != 0) { + throw std::runtime_error("shared_memory_write failed with code " + std::to_string(rc)); + } +} + +void ChipWorker::shared_memory_notify(uint64_t mem, uint32_t signal_id, uint64_t value) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = host_device_memory_notify_ctx_fn_(device_ctx_, reinterpret_cast(mem), signal_id, value); + if (rc != 0) { + throw std::runtime_error("shared_memory_notify failed with code " + std::to_string(rc)); + } +} + +void ChipWorker::shared_memory_wait(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + if (!initialized_) { + throw std::runtime_error("ChipWorker not initialized; call init() first"); + } + int rc = host_device_memory_wait_ctx_fn_(device_ctx_, reinterpret_cast(mem), signal_id, target, timeout_us); + if (rc != 0) { + throw std::runtime_error("shared_memory_wait failed with code " + std::to_string(rc)); + } +} + +std::vector ChipWorker::shared_memory_read_l2_for_test(uint64_t mem, uint64_t offset, size_t nbytes) { + std::vector buf(nbytes); + int rc = host_device_memory_read_l2_for_test(reinterpret_cast(mem), offset, buf.data(), buf.size()); + if (rc != 0) { + throw std::runtime_error("shared_memory_read_l2_for_test failed with code " + std::to_string(rc)); + } + return buf; +} + +void ChipWorker::shared_memory_write_l2_for_test(uint64_t mem, uint64_t offset, const void *data, size_t nbytes) { + int rc = host_device_memory_write_l2_for_test(reinterpret_cast(mem), offset, data, nbytes); + if (rc != 0) { + throw std::runtime_error("shared_memory_write_l2_for_test failed with code " + std::to_string(rc)); + } +} + +void ChipWorker::shared_memory_notify_l2_for_test(uint64_t mem, uint32_t signal_id, uint64_t value) { + int rc = host_device_memory_notify_l2_for_test(reinterpret_cast(mem), signal_id, value); + if (rc != 0) { + throw std::runtime_error("shared_memory_notify_l2_for_test failed with code " + std::to_string(rc)); + } +} + +void ChipWorker::shared_memory_wait_l2_for_test(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + int rc = host_device_memory_wait_l2_for_test(reinterpret_cast(mem), signal_id, target, timeout_us); + if (rc != 0) { + throw std::runtime_error("shared_memory_wait_l2_for_test failed with code " + std::to_string(rc)); + } +} + uint64_t ChipWorker::comm_init(int rank, int nranks, const std::string &rootinfo_path) { if (!initialized_) { throw std::runtime_error("ChipWorker not initialized; call init() first"); diff --git a/src/common/worker/chip_worker.h b/src/common/worker/chip_worker.h index a6d0c77a4..b2c784b28 100644 --- a/src/common/worker/chip_worker.h +++ b/src/common/worker/chip_worker.h @@ -18,6 +18,8 @@ #include "../task_interface/call_config.h" #include "../task_interface/task_args.h" +#include "host_device_channel.h" +#include "host_device_memory.h" #include "types.h" class ChipWorker : public IWorker { @@ -76,6 +78,30 @@ class ChipWorker : public IWorker { void free(uint64_t ptr); void copy_to(uint64_t dst, uint64_t src, size_t size); void copy_from(uint64_t dst, uint64_t src, size_t size); + uint64_t open_channel(const HostDeviceChannelConfig &cfg); + void close_channel(uint64_t ch); + void channel_send(uint64_t ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us); + std::vector channel_recv( + uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *out_route, uint64_t *out_correlation_id + ); + void channel_send_l2_for_test( + uint64_t ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us + ); + std::vector channel_recv_l2_for_test( + uint64_t ch, size_t capacity, uint32_t timeout_us, uint32_t *out_route, uint64_t *out_correlation_id + ); + + uint64_t open_shared_memory(const HostDeviceMemoryConfig &cfg); + void close_shared_memory(uint64_t mem); + HostDeviceMemoryInfo shared_memory_info(uint64_t mem); + std::vector shared_memory_read(uint64_t mem, uint64_t offset, size_t nbytes); + void shared_memory_write(uint64_t mem, uint64_t offset, const void *data, size_t nbytes); + void shared_memory_notify(uint64_t mem, uint32_t signal_id, uint64_t value); + void shared_memory_wait(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); + std::vector shared_memory_read_l2_for_test(uint64_t mem, uint64_t offset, size_t nbytes); + void shared_memory_write_l2_for_test(uint64_t mem, uint64_t offset, const void *data, size_t nbytes); + void shared_memory_notify_l2_for_test(uint64_t mem, uint32_t signal_id, uint64_t value); + void shared_memory_wait_l2_for_test(uint64_t mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); /// Distributed communication primitives (optional — only available when /// the bound runtime exports comm_*). Wraps the backend-neutral C API @@ -110,6 +136,17 @@ class ChipWorker : public IWorker { using DeviceFreeCtxFn = void (*)(void *, void *); using CopyToDeviceCtxFn = int (*)(void *, void *, const void *, size_t); using CopyFromDeviceCtxFn = int (*)(void *, void *, const void *, size_t); + using OpenHostDeviceChannelCtxFn = void *(*)(void *, const HostDeviceChannelConfig *); + using CloseHostDeviceChannelCtxFn = int (*)(void *, void *); + using HostDeviceSendCtxFn = int (*)(void *, void *, uint32_t, const void *, size_t, uint64_t, uint32_t); + using HostDeviceRecvCtxFn = int (*)(void *, void *, void *, size_t, size_t *, uint64_t *, uint32_t *, uint32_t); + using OpenHostDeviceMemoryCtxFn = void *(*)(void *, const HostDeviceMemoryConfig *); + using CloseHostDeviceMemoryCtxFn = int (*)(void *, void *); + using HostDeviceMemoryInfoCtxFn = int (*)(void *, void *, HostDeviceMemoryInfo *); + using HostDeviceMemoryReadCtxFn = int (*)(void *, void *, uint64_t, void *, size_t); + using HostDeviceMemoryWriteCtxFn = int (*)(void *, void *, uint64_t, const void *, size_t); + using HostDeviceMemoryNotifyCtxFn = int (*)(void *, void *, uint32_t, uint64_t); + using HostDeviceMemoryWaitCtxFn = int (*)(void *, void *, uint32_t, uint64_t, uint32_t); using GetRuntimeSizeFn = size_t (*)(); // From host_runtime.so. Single platform-side init that does (a) thread // attach + device-id record, (b) executor binary takeover, (c) onboard @@ -137,6 +174,17 @@ class ChipWorker : public IWorker { DeviceFreeCtxFn device_free_ctx_fn_ = nullptr; CopyToDeviceCtxFn copy_to_device_ctx_fn_ = nullptr; CopyFromDeviceCtxFn copy_from_device_ctx_fn_ = nullptr; + OpenHostDeviceChannelCtxFn open_host_device_channel_ctx_fn_ = nullptr; + CloseHostDeviceChannelCtxFn close_host_device_channel_ctx_fn_ = nullptr; + HostDeviceSendCtxFn host_device_send_ctx_fn_ = nullptr; + HostDeviceRecvCtxFn host_device_recv_ctx_fn_ = nullptr; + OpenHostDeviceMemoryCtxFn open_host_device_memory_ctx_fn_ = nullptr; + CloseHostDeviceMemoryCtxFn close_host_device_memory_ctx_fn_ = nullptr; + HostDeviceMemoryInfoCtxFn host_device_memory_info_ctx_fn_ = nullptr; + HostDeviceMemoryReadCtxFn host_device_memory_read_ctx_fn_ = nullptr; + HostDeviceMemoryWriteCtxFn host_device_memory_write_ctx_fn_ = nullptr; + HostDeviceMemoryNotifyCtxFn host_device_memory_notify_ctx_fn_ = nullptr; + HostDeviceMemoryWaitCtxFn host_device_memory_wait_ctx_fn_ = nullptr; GetRuntimeSizeFn get_runtime_size_fn_ = nullptr; SimplerInitFn simpler_init_fn_ = nullptr; PrepareCallableFn prepare_callable_fn_ = nullptr; diff --git a/src/common/worker/host_device_channel.cpp b/src/common/worker/host_device_channel.cpp new file mode 100644 index 000000000..5398cfde2 --- /dev/null +++ b/src/common/worker/host_device_channel.cpp @@ -0,0 +1,292 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include "host_device_channel.h" + +#include +#include + +#include +#include +#include +#include + +namespace { + +bool is_power_of_two(uint32_t v) { return v != 0 && (v & (v - 1U)) == 0; } + +bool checked_add_size(size_t a, size_t b, size_t *out) { + if (out == nullptr || a > std::numeric_limits::max() - b) return false; + *out = a + b; + return true; +} + +bool checked_mul_size(size_t a, size_t b, size_t *out) { + if (out == nullptr || (a != 0 && b > std::numeric_limits::max() / a)) return false; + *out = a * b; + return true; +} + +bool checked_align_up(size_t v, size_t alignment, size_t *out) { + if (alignment == 0 || (alignment & (alignment - 1U)) != 0) return false; + size_t biased = 0; + if (!checked_add_size(v, alignment - 1U, &biased)) return false; + *out = biased & ~(alignment - 1U); + return true; +} + +bool valid_cfg(const HostDeviceChannelConfig *cfg) { + return cfg != nullptr && cfg->lane_count_cpu_to_l2 > 0 && cfg->lane_count_l2_to_cpu > 0 && + is_power_of_two(cfg->lane_depth) && cfg->max_message_bytes > 0 && + cfg->max_message_bytes <= HDCH_MAX_INLINE_BYTES; +} + +bool compute_layout(const HostDeviceChannelConfig *cfg, size_t *total_bytes) { + if (!valid_cfg(cfg) || total_bytes == nullptr) return false; + + size_t desc_bytes = 0; + size_t lane_bytes = 0; + size_t total_lanes = 0; + size_t lanes_bytes = 0; + size_t raw_total = 0; + size_t required = 0; + if (!checked_mul_size(static_cast(cfg->lane_depth), sizeof(HostDeviceDesc), &desc_bytes)) return false; + if (!checked_add_size(sizeof(HostDeviceLaneHeader), desc_bytes, &lane_bytes)) return false; + if (!checked_add_size( + static_cast(cfg->lane_count_cpu_to_l2), static_cast(cfg->lane_count_l2_to_cpu), &total_lanes + )) { + return false; + } + if (!checked_mul_size(lane_bytes, total_lanes, &lanes_bytes)) return false; + if (!checked_add_size(sizeof(HostDeviceChannelHeader), lanes_bytes, &raw_total)) return false; + if (!checked_align_up(raw_total, 64, &required)) return false; + + *total_bytes = required; + return true; +} + +HostDeviceLaneHeader *first_lane(HostDeviceChannelHeader *hdr) { + return reinterpret_cast(reinterpret_cast(hdr) + sizeof(HostDeviceChannelHeader)); +} + +HostDeviceLaneHeader *lane_at(HostDeviceLaneHeader *base, uint32_t lane_depth, size_t index) { + size_t stride = sizeof(HostDeviceLaneHeader) + static_cast(lane_depth) * sizeof(HostDeviceDesc); + return reinterpret_cast(reinterpret_cast(base) + stride * index); +} + +HostDeviceDesc *lane_slots(HostDeviceLaneHeader *lane) { + return reinterpret_cast(reinterpret_cast(lane) + sizeof(HostDeviceLaneHeader)); +} + +HostDeviceLaneHeader *cpu_to_l2_lanes(HostDeviceChannel *ch) { + return first_lane(reinterpret_cast(ch->host_base)); +} + +HostDeviceLaneHeader *l2_to_cpu_lanes(HostDeviceChannel *ch) { + auto *hdr = reinterpret_cast(ch->host_base); + return lane_at(cpu_to_l2_lanes(ch), hdr->lane_depth, hdr->lane_count_cpu_to_l2); +} + +uint32_t load_u32(const volatile uint32_t *p) { return __atomic_load_n(p, __ATOMIC_ACQUIRE); } + +void store_u32(volatile uint32_t *p, uint32_t v) { __atomic_store_n(p, v, __ATOMIC_RELEASE); } + +int send_one( + HostDeviceLaneHeader *lanes, uint32_t lane_count, uint32_t lane_depth, uint32_t max_message_bytes, uint32_t *cursor, + uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us +) { + if (lanes == nullptr || cursor == nullptr || (data == nullptr && nbytes != 0)) return HDCH_ERR_INVALID; + if (nbytes > max_message_bytes) return HDCH_ERR_MSG_TOO_LARGE; + + auto start = std::chrono::steady_clock::now(); + while (true) { + for (uint32_t i = 0; i < lane_count; ++i) { + uint32_t lane_index = (*cursor + i) % lane_count; + HostDeviceLaneHeader *lane = lane_at(lanes, lane_depth, lane_index); + uint32_t head = load_u32(&lane->head); + uint32_t tail = load_u32(&lane->tail); + if ((tail - head) >= lane_depth) { + continue; + } + + HostDeviceDesc *slot = lane_slots(lane) + (tail & lane->depth_mask); + slot->flags = 0; + slot->payload_bytes = static_cast(nbytes); + slot->seq = tail; + slot->correlation_id = correlation_id; + slot->route = route; + slot->reserved0 = 0; + if (nbytes != 0) { + memcpy(slot->inline_data, data, nbytes); + } + store_u32(&lane->tail, tail + 1U); + *cursor = (lane_index + 1U) % lane_count; + return HDCH_OK; + } + + if (timeout_us == 0) return HDCH_ERR_WOULD_BLOCK; + auto elapsed = std::chrono::duration_cast(std::chrono::steady_clock::now() - start); + if (elapsed.count() >= timeout_us) return HDCH_ERR_WOULD_BLOCK; + std::this_thread::yield(); + } +} + +int recv_one( + HostDeviceLaneHeader *lanes, uint32_t lane_count, uint32_t lane_depth, uint32_t *cursor, void *dst, + size_t dst_capacity, size_t *out_nbytes, uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +) { + if (lanes == nullptr || cursor == nullptr || dst == nullptr || out_nbytes == nullptr || + out_correlation_id == nullptr || out_route == nullptr) { + return HDCH_ERR_INVALID; + } + + auto start = std::chrono::steady_clock::now(); + while (true) { + for (uint32_t i = 0; i < lane_count; ++i) { + uint32_t lane_index = (*cursor + i) % lane_count; + HostDeviceLaneHeader *lane = lane_at(lanes, lane_depth, lane_index); + uint32_t head = load_u32(&lane->head); + uint32_t tail = load_u32(&lane->tail); + if (head == tail) { + continue; + } + + HostDeviceDesc *slot = lane_slots(lane) + (head & lane->depth_mask); + if (slot->payload_bytes > dst_capacity) return HDCH_ERR_MSG_TOO_LARGE; + if (slot->payload_bytes != 0) { + memcpy(dst, slot->inline_data, slot->payload_bytes); + } + *out_nbytes = slot->payload_bytes; + *out_correlation_id = slot->correlation_id; + *out_route = slot->route; + store_u32(&lane->head, head + 1U); + *cursor = (lane_index + 1U) % lane_count; + return HDCH_OK; + } + + if (timeout_us == 0) return HDCH_ERR_WOULD_BLOCK; + auto elapsed = std::chrono::duration_cast(std::chrono::steady_clock::now() - start); + if (elapsed.count() >= timeout_us) return HDCH_ERR_WOULD_BLOCK; + std::this_thread::yield(); + } +} + +} // namespace + +size_t host_device_channel_required_bytes(const HostDeviceChannelConfig *cfg) { + size_t total_bytes = 0; + if (!compute_layout(cfg, &total_bytes)) return 0; + return total_bytes; +} + +int host_device_channel_init_region(void *host_base, size_t bytes, const HostDeviceChannelConfig *cfg) { + if (host_base == nullptr) return HDCH_ERR_INVALID; + size_t required = 0; + if (!compute_layout(cfg, &required)) return HDCH_ERR_INVALID; + if (required == 0 || bytes < required) return HDCH_ERR_INVALID; + + memset(host_base, 0, required); + auto *hdr = reinterpret_cast(host_base); + hdr->magic = HDCH_MAGIC; + hdr->version = HDCH_VERSION; + hdr->flags = cfg->flags; + hdr->lane_count_cpu_to_l2 = cfg->lane_count_cpu_to_l2; + hdr->lane_count_l2_to_cpu = cfg->lane_count_l2_to_cpu; + hdr->lane_depth = cfg->lane_depth; + hdr->max_message_bytes = cfg->max_message_bytes; + hdr->control_bytes = required; + + HostDeviceLaneHeader *lane = first_lane(hdr); + size_t lane_count = static_cast(cfg->lane_count_cpu_to_l2) + cfg->lane_count_l2_to_cpu; + for (size_t i = 0; i < lane_count; ++i) { + HostDeviceLaneHeader *cur = lane_at(lane, cfg->lane_depth, i); + cur->depth = cfg->lane_depth; + cur->depth_mask = cfg->lane_depth - 1U; + } + return HDCH_OK; +} + +HostDeviceChannel *host_device_channel_wrap( + void *device_base, void *host_base, size_t bytes, const HostDeviceChannelConfig *cfg, uint32_t owns_host_allocation, + void (*free_host_allocation)(void *) +) { + if (device_base == nullptr || host_base == nullptr || !valid_cfg(cfg)) return nullptr; + int rc = host_device_channel_init_region(host_base, bytes, cfg); + if (rc != HDCH_OK) return nullptr; + HostDeviceChannel *ch = new (std::nothrow) HostDeviceChannel(); + if (ch == nullptr) return nullptr; + ch->device_base = device_base; + ch->host_base = host_base; + ch->bytes = host_device_channel_required_bytes(cfg); + ch->cpu_tx_cursor = 0; + ch->l2_tx_cursor = 0; + ch->owns_host_allocation = owns_host_allocation; + ch->free_host_allocation = free_host_allocation; + return ch; +} + +void host_device_channel_destroy(HostDeviceChannel *ch) { + if (ch == nullptr) return; + if (ch->owns_host_allocation && ch->free_host_allocation != nullptr && ch->host_base != nullptr) { + ch->free_host_allocation(ch->host_base); + } + delete ch; +} + +int host_device_channel_send_cpu( + HostDeviceChannel *ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us +) { + if (ch == nullptr || ch->host_base == nullptr) return HDCH_ERR_INVALID; + auto *hdr = reinterpret_cast(ch->host_base); + if (hdr->magic != HDCH_MAGIC || hdr->version != HDCH_VERSION) return HDCH_ERR_INVALID; + return send_one( + cpu_to_l2_lanes(ch), hdr->lane_count_cpu_to_l2, hdr->lane_depth, hdr->max_message_bytes, &ch->cpu_tx_cursor, + route, data, nbytes, correlation_id, timeout_us + ); +} + +int host_device_channel_recv_cpu( + HostDeviceChannel *ch, void *dst, size_t dst_capacity, size_t *out_nbytes, uint64_t *out_correlation_id, + uint32_t *out_route, uint32_t timeout_us +) { + if (ch == nullptr || ch->host_base == nullptr) return HDCH_ERR_INVALID; + auto *hdr = reinterpret_cast(ch->host_base); + if (hdr->magic != HDCH_MAGIC || hdr->version != HDCH_VERSION) return HDCH_ERR_INVALID; + return recv_one( + l2_to_cpu_lanes(ch), hdr->lane_count_l2_to_cpu, hdr->lane_depth, &ch->l2_tx_cursor, dst, dst_capacity, + out_nbytes, out_correlation_id, out_route, timeout_us + ); +} + +int host_device_channel_send_l2_for_test( + HostDeviceChannel *ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, uint32_t timeout_us +) { + if (ch == nullptr || ch->host_base == nullptr) return HDCH_ERR_INVALID; + auto *hdr = reinterpret_cast(ch->host_base); + if (hdr->magic != HDCH_MAGIC || hdr->version != HDCH_VERSION) return HDCH_ERR_INVALID; + return send_one( + l2_to_cpu_lanes(ch), hdr->lane_count_l2_to_cpu, hdr->lane_depth, hdr->max_message_bytes, &ch->l2_tx_cursor, + route, data, nbytes, correlation_id, timeout_us + ); +} + +int host_device_channel_recv_l2_for_test( + HostDeviceChannel *ch, void *dst, size_t dst_capacity, size_t *out_nbytes, uint64_t *out_correlation_id, + uint32_t *out_route, uint32_t timeout_us +) { + if (ch == nullptr || ch->host_base == nullptr) return HDCH_ERR_INVALID; + auto *hdr = reinterpret_cast(ch->host_base); + if (hdr->magic != HDCH_MAGIC || hdr->version != HDCH_VERSION) return HDCH_ERR_INVALID; + return recv_one( + cpu_to_l2_lanes(ch), hdr->lane_count_cpu_to_l2, hdr->lane_depth, &ch->cpu_tx_cursor, dst, dst_capacity, + out_nbytes, out_correlation_id, out_route, timeout_us + ); +} diff --git a/src/common/worker/host_device_channel.h b/src/common/worker/host_device_channel.h new file mode 100644 index 000000000..450b448bc --- /dev/null +++ b/src/common/worker/host_device_channel.h @@ -0,0 +1,102 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#ifndef SRC_COMMON_WORKER_HOST_DEVICE_CHANNEL_H_ +#define SRC_COMMON_WORKER_HOST_DEVICE_CHANNEL_H_ + +#include +#include + +static constexpr uint32_t HDCH_MAGIC = 0x48444348U; // "HDCH" +static constexpr uint32_t HDCH_VERSION = 1; +static constexpr uint32_t HDCH_MAX_INLINE_BYTES = 256; + +static constexpr int HDCH_OK = 0; +static constexpr int HDCH_ERR_WOULD_BLOCK = -11; +static constexpr int HDCH_ERR_INVALID = -22; +static constexpr int HDCH_ERR_MSG_TOO_LARGE = -75; +static constexpr int HDCH_ERR_BACKEND = -5; + +#include "pto_runtime_c_api.h" + +struct alignas(64) HostDeviceDesc { + uint32_t flags; + uint32_t payload_bytes; + uint64_t seq; + uint64_t correlation_id; + uint32_t route; + uint32_t reserved0; + uint8_t inline_data[HDCH_MAX_INLINE_BYTES]; +}; + +struct alignas(64) HostDeviceLaneHeader { + volatile uint32_t head; + volatile uint32_t tail; + uint32_t depth; + uint32_t depth_mask; + uint64_t dropped_count; + uint64_t blocked_count; + uint64_t reserved[4]; +}; + +struct alignas(64) HostDeviceChannelHeader { + uint32_t magic; + uint32_t version; + uint32_t flags; + uint32_t lane_count_cpu_to_l2; + uint32_t lane_count_l2_to_cpu; + uint32_t lane_depth; + uint32_t max_message_bytes; + uint32_t reserved0; + uint64_t control_bytes; + uint64_t fatal_status; + uint64_t reserved[4]; +}; + +struct HostDeviceChannel { + void *device_base; + void *host_base; + size_t bytes; + uint32_t cpu_tx_cursor; + uint32_t l2_tx_cursor; + uint32_t owns_host_allocation; + void (*free_host_allocation)(void *); +}; + +size_t host_device_channel_required_bytes(const HostDeviceChannelConfig *cfg); +int host_device_channel_init_region(void *host_base, size_t bytes, const HostDeviceChannelConfig *cfg); +HostDeviceChannel *host_device_channel_wrap( + void *device_base, void *host_base, size_t bytes, const HostDeviceChannelConfig *cfg, uint32_t owns_host_allocation, + void (*free_host_allocation)(void *) +); +void host_device_channel_destroy(HostDeviceChannel *ch); + +int host_device_channel_send_cpu( + HostDeviceChannel *ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, + uint32_t timeout_us +); +int host_device_channel_recv_cpu( + HostDeviceChannel *ch, void *dst, size_t dst_capacity, size_t *out_nbytes, uint64_t *out_correlation_id, + uint32_t *out_route, uint32_t timeout_us +); + +// Test/sim endpoint for the L2 side. V2 AICPU broker should use the same POD +// layout and publish/consume protocol from device code. +int host_device_channel_send_l2_for_test( + HostDeviceChannel *ch, uint32_t route, const void *data, size_t nbytes, uint64_t correlation_id, + uint32_t timeout_us +); +int host_device_channel_recv_l2_for_test( + HostDeviceChannel *ch, void *dst, size_t dst_capacity, size_t *out_nbytes, uint64_t *out_correlation_id, + uint32_t *out_route, uint32_t timeout_us +); + +#endif // SRC_COMMON_WORKER_HOST_DEVICE_CHANNEL_H_ diff --git a/src/common/worker/host_device_memory.cpp b/src/common/worker/host_device_memory.cpp new file mode 100644 index 000000000..d0635b5ff --- /dev/null +++ b/src/common/worker/host_device_memory.cpp @@ -0,0 +1,231 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include "host_device_memory.h" + +#include + +#include +#include +#include +#include + +namespace { + +bool checked_add_size(size_t a, size_t b, size_t *out) { + if (out == nullptr || a > std::numeric_limits::max() - b) return false; + *out = a + b; + return true; +} + +bool checked_mul_size(size_t a, size_t b, size_t *out) { + if (out == nullptr || (a != 0 && b > std::numeric_limits::max() / a)) return false; + *out = a * b; + return true; +} + +bool checked_align_up(size_t v, size_t alignment, size_t *out) { + if (alignment == 0 || (alignment & (alignment - 1U)) != 0) return false; + size_t biased = 0; + if (!checked_add_size(v, alignment - 1U, &biased)) return false; + *out = biased & ~(alignment - 1U); + return true; +} + +bool valid_cfg(const HostDeviceMemoryConfig *cfg) { + return cfg != nullptr && cfg->data_bytes > 0 && cfg->signal_count > 0; +} + +bool compute_layout(const HostDeviceMemoryConfig *cfg, size_t *data_offset, size_t *total_bytes) { + if (!valid_cfg(cfg) || data_offset == nullptr || total_bytes == nullptr) return false; + if (cfg->data_bytes > static_cast(std::numeric_limits::max())) return false; + + size_t signal_bytes = 0; + size_t signals_end = 0; + size_t offset = 0; + size_t raw_total = 0; + size_t required = 0; + if (!checked_mul_size(static_cast(cfg->signal_count), sizeof(HostDeviceSignalSlot), &signal_bytes)) { + return false; + } + if (!checked_add_size(sizeof(HostDeviceMemoryHeader), signal_bytes, &signals_end)) return false; + if (!checked_align_up(signals_end, 64, &offset)) return false; + if (!checked_add_size(offset, static_cast(cfg->data_bytes), &raw_total)) return false; + if (!checked_align_up(raw_total, 64, &required)) return false; + + *data_offset = offset; + *total_bytes = required; + return true; +} + +HostDeviceMemoryHeader *header(HostDeviceMemory *mem) { + return reinterpret_cast(mem->host_base); +} + +const HostDeviceMemoryHeader *header_const(const HostDeviceMemory *mem) { + return reinterpret_cast(mem->host_base); +} + +HostDeviceSignalSlot *signals(HostDeviceMemory *mem) { + return reinterpret_cast(reinterpret_cast(mem->host_base) + sizeof(HostDeviceMemoryHeader)); +} + +void *host_data_ptr(HostDeviceMemory *mem) { + auto *hdr = header(mem); + return reinterpret_cast(mem->host_base) + hdr->data_offset; +} + +uint64_t device_data_addr(HostDeviceMemory *mem) { + auto *hdr = header(mem); + return reinterpret_cast(mem->device_base) + hdr->data_offset; +} + +bool valid_region(HostDeviceMemory *mem) { + if (mem == nullptr || mem->host_base == nullptr || mem->device_base == nullptr) return false; + auto *hdr = header(mem); + return hdr->magic == HDMEM_MAGIC && hdr->version == HDMEM_VERSION && hdr->total_bytes <= mem->bytes; +} + +bool valid_range(HostDeviceMemory *mem, uint64_t offset, size_t nbytes) { + if (!valid_region(mem)) return false; + auto *hdr = header(mem); + return offset <= hdr->data_bytes && nbytes <= hdr->data_bytes - offset; +} + +int read_impl(HostDeviceMemory *mem, uint64_t offset, void *dst, size_t nbytes) { + if ((dst == nullptr && nbytes != 0) || !valid_range(mem, offset, nbytes)) return HDMEM_ERR_INVALID; + if (nbytes != 0) memcpy(dst, reinterpret_cast(host_data_ptr(mem)) + offset, nbytes); + return HDMEM_OK; +} + +int write_impl(HostDeviceMemory *mem, uint64_t offset, const void *src, size_t nbytes) { + if ((src == nullptr && nbytes != 0) || !valid_range(mem, offset, nbytes)) return HDMEM_ERR_INVALID; + if (nbytes != 0) memcpy(reinterpret_cast(host_data_ptr(mem)) + offset, src, nbytes); + return HDMEM_OK; +} + +int notify_impl(HostDeviceMemory *mem, uint32_t signal_id, uint64_t value) { + if (!valid_region(mem)) return HDMEM_ERR_INVALID; + auto *hdr = header(mem); + if (signal_id >= hdr->signal_count) return HDMEM_ERR_INVALID; + __atomic_store_n(&signals(mem)[signal_id].value, value, __ATOMIC_RELEASE); + return HDMEM_OK; +} + +int wait_impl(HostDeviceMemory *mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + if (!valid_region(mem)) return HDMEM_ERR_INVALID; + auto *hdr = header(mem); + if (signal_id >= hdr->signal_count) return HDMEM_ERR_INVALID; + auto start = std::chrono::steady_clock::now(); + while (true) { + uint64_t value = __atomic_load_n(&signals(mem)[signal_id].value, __ATOMIC_ACQUIRE); + if (value >= target) return HDMEM_OK; + if (timeout_us == 0) return HDMEM_ERR_WOULD_BLOCK; + auto elapsed = std::chrono::duration_cast(std::chrono::steady_clock::now() - start); + if (elapsed.count() >= timeout_us) return HDMEM_ERR_WOULD_BLOCK; + std::this_thread::yield(); + } +} + +} // namespace + +size_t host_device_memory_required_bytes(const HostDeviceMemoryConfig *cfg) { + size_t data_offset = 0; + size_t total_bytes = 0; + if (!compute_layout(cfg, &data_offset, &total_bytes)) return 0; + return total_bytes; +} + +int host_device_memory_init_region(void *host_base, size_t bytes, const HostDeviceMemoryConfig *cfg) { + if (host_base == nullptr) return HDMEM_ERR_INVALID; + size_t data_offset = 0; + size_t required = 0; + if (!compute_layout(cfg, &data_offset, &required)) return HDMEM_ERR_INVALID; + if (required == 0 || bytes < required) return HDMEM_ERR_INVALID; + memset(host_base, 0, required); + auto *hdr = reinterpret_cast(host_base); + hdr->magic = HDMEM_MAGIC; + hdr->version = HDMEM_VERSION; + hdr->flags = cfg->flags; + hdr->signal_count = cfg->signal_count; + hdr->data_offset = data_offset; + hdr->data_bytes = cfg->data_bytes; + hdr->total_bytes = required; + return HDMEM_OK; +} + +HostDeviceMemory *host_device_memory_wrap( + void *device_base, void *host_base, size_t bytes, const HostDeviceMemoryConfig *cfg, uint32_t owns_host_allocation, + void (*free_host_allocation)(void *) +) { + if (device_base == nullptr || host_base == nullptr || !valid_cfg(cfg)) return nullptr; + int rc = host_device_memory_init_region(host_base, bytes, cfg); + if (rc != HDMEM_OK) return nullptr; + HostDeviceMemory *mem = new (std::nothrow) HostDeviceMemory(); + if (mem == nullptr) return nullptr; + mem->device_base = device_base; + mem->host_base = host_base; + mem->bytes = host_device_memory_required_bytes(cfg); + mem->owns_host_allocation = owns_host_allocation; + mem->free_host_allocation = free_host_allocation; + return mem; +} + +void host_device_memory_destroy(HostDeviceMemory *mem) { + if (mem == nullptr) return; + if (mem->owns_host_allocation && mem->free_host_allocation != nullptr && mem->host_base != nullptr) { + mem->free_host_allocation(mem->host_base); + } + delete mem; +} + +int host_device_memory_info(HostDeviceMemory *mem, HostDeviceMemoryInfo *info) { + if (!valid_region(mem) || info == nullptr) return HDMEM_ERR_INVALID; + const auto *hdr = header_const(mem); + info->host_ptr = reinterpret_cast(mem->host_base) + hdr->data_offset; + info->device_ptr = device_data_addr(mem); + info->data_bytes = hdr->data_bytes; + info->signal_count = hdr->signal_count; + info->flags = hdr->flags; + return HDMEM_OK; +} + +int host_device_memory_read(HostDeviceMemory *mem, uint64_t offset, void *dst, size_t nbytes) { + return read_impl(mem, offset, dst, nbytes); +} + +int host_device_memory_write(HostDeviceMemory *mem, uint64_t offset, const void *src, size_t nbytes) { + return write_impl(mem, offset, src, nbytes); +} + +int host_device_memory_notify(HostDeviceMemory *mem, uint32_t signal_id, uint64_t value) { + return notify_impl(mem, signal_id, value); +} + +int host_device_memory_wait(HostDeviceMemory *mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + return wait_impl(mem, signal_id, target, timeout_us); +} + +int host_device_memory_read_l2_for_test(HostDeviceMemory *mem, uint64_t offset, void *dst, size_t nbytes) { + return read_impl(mem, offset, dst, nbytes); +} + +int host_device_memory_write_l2_for_test(HostDeviceMemory *mem, uint64_t offset, const void *src, size_t nbytes) { + return write_impl(mem, offset, src, nbytes); +} + +int host_device_memory_notify_l2_for_test(HostDeviceMemory *mem, uint32_t signal_id, uint64_t value) { + return notify_impl(mem, signal_id, value); +} + +int host_device_memory_wait_l2_for_test(HostDeviceMemory *mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us) { + return wait_impl(mem, signal_id, target, timeout_us); +} diff --git a/src/common/worker/host_device_memory.h b/src/common/worker/host_device_memory.h new file mode 100644 index 000000000..ca255b16c --- /dev/null +++ b/src/common/worker/host_device_memory.h @@ -0,0 +1,71 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#ifndef SRC_COMMON_WORKER_HOST_DEVICE_MEMORY_H_ +#define SRC_COMMON_WORKER_HOST_DEVICE_MEMORY_H_ + +#include +#include + +#include "pto_runtime_c_api.h" + +static constexpr uint32_t HDMEM_MAGIC = 0x48444D45U; // "HDME" +static constexpr uint32_t HDMEM_VERSION = 1; + +static constexpr int HDMEM_OK = 0; +static constexpr int HDMEM_ERR_WOULD_BLOCK = -11; +static constexpr int HDMEM_ERR_INVALID = -22; +static constexpr int HDMEM_ERR_BACKEND = -5; + +struct alignas(64) HostDeviceMemoryHeader { + uint32_t magic; + uint32_t version; + uint32_t flags; + uint32_t signal_count; + uint64_t data_offset; + uint64_t data_bytes; + uint64_t total_bytes; + uint64_t fatal_status; + uint64_t reserved[3]; +}; + +struct alignas(64) HostDeviceSignalSlot { + volatile uint64_t value; + uint64_t reserved[7]; +}; + +struct HostDeviceMemory { + void *device_base; + void *host_base; + size_t bytes; + uint32_t owns_host_allocation; + void (*free_host_allocation)(void *); +}; + +size_t host_device_memory_required_bytes(const HostDeviceMemoryConfig *cfg); +int host_device_memory_init_region(void *host_base, size_t bytes, const HostDeviceMemoryConfig *cfg); +HostDeviceMemory *host_device_memory_wrap( + void *device_base, void *host_base, size_t bytes, const HostDeviceMemoryConfig *cfg, uint32_t owns_host_allocation, + void (*free_host_allocation)(void *) +); +void host_device_memory_destroy(HostDeviceMemory *mem); +int host_device_memory_info(HostDeviceMemory *mem, HostDeviceMemoryInfo *info); +int host_device_memory_read(HostDeviceMemory *mem, uint64_t offset, void *dst, size_t nbytes); +int host_device_memory_write(HostDeviceMemory *mem, uint64_t offset, const void *src, size_t nbytes); +int host_device_memory_notify(HostDeviceMemory *mem, uint32_t signal_id, uint64_t value); +int host_device_memory_wait(HostDeviceMemory *mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); + +int host_device_memory_read_l2_for_test(HostDeviceMemory *mem, uint64_t offset, void *dst, size_t nbytes); +int host_device_memory_write_l2_for_test(HostDeviceMemory *mem, uint64_t offset, const void *src, size_t nbytes); +int host_device_memory_notify_l2_for_test(HostDeviceMemory *mem, uint32_t signal_id, uint64_t value); +int host_device_memory_wait_l2_for_test(HostDeviceMemory *mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us); + +#endif // SRC_COMMON_WORKER_HOST_DEVICE_MEMORY_H_ diff --git a/src/common/worker/pto_runtime_c_api.h b/src/common/worker/pto_runtime_c_api.h index 0e2e28cb2..0b2e71c83 100644 --- a/src/common/worker/pto_runtime_c_api.h +++ b/src/common/worker/pto_runtime_c_api.h @@ -47,6 +47,30 @@ extern "C" { typedef void *RuntimeHandle; typedef void *DeviceContextHandle; +typedef void *HostDeviceChannelHandle; +typedef void *HostDeviceMemoryHandle; + +typedef struct { + uint32_t lane_count_cpu_to_l2; + uint32_t lane_count_l2_to_cpu; + uint32_t lane_depth; + uint32_t max_message_bytes; + uint32_t flags; +} HostDeviceChannelConfig; + +typedef struct { + uint64_t data_bytes; + uint32_t signal_count; + uint32_t flags; +} HostDeviceMemoryConfig; + +typedef struct { + uint64_t host_ptr; + uint64_t device_ptr; + uint64_t data_bytes; + uint32_t signal_count; + uint32_t flags; +} HostDeviceMemoryInfo; /* =========================================================================== * Public API (resolved by ChipWorker via dlsym) @@ -80,6 +104,57 @@ int copy_to_device_ctx(DeviceContextHandle ctx, void *dev_ptr, const void *host_ /** Copy device memory to a host pointer within the given device context. */ int copy_from_device_ctx(DeviceContextHandle ctx, void *host_ptr, const void *dev_ptr, size_t size); +/** Open a bounded host/device message channel backed by host-mapped device memory. */ +HostDeviceChannelHandle open_host_device_channel_ctx( + DeviceContextHandle ctx, const HostDeviceChannelConfig *cfg +); + +/** Close a channel returned by open_host_device_channel_ctx. */ +int close_host_device_channel_ctx(DeviceContextHandle ctx, HostDeviceChannelHandle ch); + +/** Send one inline message from L3 CPU toward L2. */ +int host_device_send_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, uint32_t route, const void *data, size_t nbytes, + uint64_t correlation_id, uint32_t timeout_us +); + +/** Receive one inline message from L2 toward L3 CPU. */ +int host_device_recv_ctx( + DeviceContextHandle ctx, HostDeviceChannelHandle ch, void *dst, size_t dst_capacity, size_t *out_nbytes, + uint64_t *out_correlation_id, uint32_t *out_route, uint32_t timeout_us +); + +/** Open a host/device shared-memory region with software signal slots. */ +HostDeviceMemoryHandle open_host_device_memory_ctx(DeviceContextHandle ctx, const HostDeviceMemoryConfig *cfg); + +/** Close a memory region returned by open_host_device_memory_ctx. */ +int close_host_device_memory_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem); + +/** Return host/device data pointers and region metadata for a shared-memory region. + * + * host_ptr is valid only in the process that owns the host mapping. Hierarchical + * mailbox callers that do not own that mapping expose host_ptr as 0. + */ +int host_device_memory_info_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, HostDeviceMemoryInfo *info); + +/** Copy from shared-memory data region into host dst. */ +int host_device_memory_read_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, void *dst, size_t nbytes +); + +/** Copy from host src into shared-memory data region. */ +int host_device_memory_write_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint64_t offset, const void *src, size_t nbytes +); + +/** Publish a software signal value after caller-visible writes. */ +int host_device_memory_notify_ctx(DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t value); + +/** Wait until a software signal value reaches target, or return -EAGAIN/-EWOULDBLOCK style error. */ +int host_device_memory_wait_ctx( + DeviceContextHandle ctx, HostDeviceMemoryHandle mem, uint32_t signal_id, uint64_t target, uint32_t timeout_us +); + /** * One-shot platform-side init. Called once by ChipWorker::init() right * after dlopen, before any other entry. Three responsibilities, in order: diff --git a/tests/ut/cpp/CMakeLists.txt b/tests/ut/cpp/CMakeLists.txt index cc6e0b494..58fd80b99 100644 --- a/tests/ut/cpp/CMakeLists.txt +++ b/tests/ut/cpp/CMakeLists.txt @@ -47,8 +47,8 @@ else() set(GTEST_INCLUDE_DIRS /usr/local/include) endif() -find_library(GTEST_LIB gtest PATHS ${GTEST_SEARCH_PATHS}) -find_library(GTEST_MAIN_LIB gtest_main PATHS ${GTEST_SEARCH_PATHS}) +find_library(GTEST_LIB gtest PATHS ${GTEST_SEARCH_PATHS} NO_DEFAULT_PATH) +find_library(GTEST_MAIN_LIB gtest_main PATHS ${GTEST_SEARCH_PATHS} NO_DEFAULT_PATH) if(NOT GTEST_LIB OR NOT GTEST_MAIN_LIB) message(STATUS "System GoogleTest not found — fetching via FetchContent") @@ -132,6 +132,8 @@ add_library(hierarchical_objs OBJECT ${HIERARCHICAL_SRC_DIR}/scheduler.cpp ${HIERARCHICAL_SRC_DIR}/worker.cpp ${WORKER_SRC_DIR}/chip_worker.cpp + ${WORKER_SRC_DIR}/host_device_channel.cpp + ${WORKER_SRC_DIR}/host_device_memory.cpp ) target_include_directories(hierarchical_objs PUBLIC ${HIERARCHICAL_SRC_DIR} @@ -224,6 +226,26 @@ function(add_common_utils_test name src) set_tests_properties(${name} PROPERTIES LABELS "no_hardware") endfunction() +function(add_common_worker_test name src) + add_executable(${name} + ${src} + ${CMAKE_SOURCE_DIR}/../../../src/common/worker/host_device_channel.cpp + ${CMAKE_SOURCE_DIR}/../../../src/common/worker/host_device_memory.cpp + ) + target_include_directories(${name} PRIVATE + ${GTEST_INCLUDE_DIRS} + ${CMAKE_SOURCE_DIR}/../../../src/common/worker + ) + target_compile_options(${name} PRIVATE -D_GLIBCXX_USE_CXX11_ABI=0) + target_link_libraries(${name} PRIVATE + ${GTEST_MAIN_LIB} + ${GTEST_LIB} + pthread + ) + add_test(NAME ${name} COMMAND ${name}) + set_tests_properties(${name} PROPERTIES LABELS "no_hardware") +endfunction() + enable_testing() # --------------------------------------------------------------------------- @@ -266,6 +288,8 @@ set_tests_properties(test_chip_callable_upload_immutable PROPERTIES LABELS "no_h # --------------------------------------------------------------------------- add_common_utils_test(test_elf_build_id common/test_elf_build_id.cpp) add_common_utils_test(test_runtime_orch_so common/test_runtime_orch_so.cpp) +add_common_worker_test(test_host_device_channel common/test_host_device_channel.cpp) +add_common_worker_test(test_host_device_memory common/test_host_device_memory.cpp) # Per-callable_id orch SO file naming regression (see rtStreamSynchronize # 507018 root cause). Compiles the a2a3 onboard `create_orch_so_file` diff --git a/tests/ut/cpp/common/test_host_device_channel.cpp b/tests/ut/cpp/common/test_host_device_channel.cpp new file mode 100644 index 000000000..ef5bcaa97 --- /dev/null +++ b/tests/ut/cpp/common/test_host_device_channel.cpp @@ -0,0 +1,118 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include + +#include +#include +#include + +#include "host_device_channel.h" + +namespace { + +HostDeviceChannelConfig cfg(uint32_t c2l = 2, uint32_t l2c = 2, uint32_t depth = 4, uint32_t bytes = 64) { + return HostDeviceChannelConfig{c2l, l2c, depth, bytes, 0}; +} + +HostDeviceChannel *make_channel(const HostDeviceChannelConfig &c) { + size_t bytes = host_device_channel_required_bytes(&c); + void *base = nullptr; + EXPECT_EQ(posix_memalign(&base, 64, bytes), 0); + auto *ch = host_device_channel_wrap(base, base, bytes, &c, 1, free); + EXPECT_NE(ch, nullptr); + return ch; +} + +} // namespace + +TEST(HostDeviceChannelTest, RejectsInvalidConfig) { + auto bad_depth = cfg(1, 1, 3, 64); + EXPECT_EQ(host_device_channel_required_bytes(&bad_depth), 0u); + + auto too_large = cfg(1, 1, 4, HDCH_MAX_INLINE_BYTES + 1); + EXPECT_EQ(host_device_channel_required_bytes(&too_large), 0u); +} + +TEST(HostDeviceChannelTest, RejectsOverflowingRequiredBytes) { + auto huge_layout = cfg(std::numeric_limits::max(), std::numeric_limits::max(), 1U << 31U, 64); + EXPECT_EQ(host_device_channel_required_bytes(&huge_layout), 0u); +} + +TEST(HostDeviceChannelTest, CpuToL2SendRecvRoundTrip) { + auto c = cfg(); + HostDeviceChannel *ch = make_channel(c); + + const char msg[] = "hello-l2"; + EXPECT_EQ(host_device_channel_send_cpu(ch, 7, msg, sizeof(msg) - 1, 42, 0), HDCH_OK); + + uint8_t out[64]{}; + size_t nbytes = 0; + uint64_t cid = 0; + uint32_t route = 0; + EXPECT_EQ(host_device_channel_recv_l2_for_test(ch, out, sizeof(out), &nbytes, &cid, &route, 0), HDCH_OK); + EXPECT_EQ(nbytes, sizeof(msg) - 1); + EXPECT_EQ(memcmp(out, msg, nbytes), 0); + EXPECT_EQ(route, 7u); + EXPECT_EQ(cid, 42u); + + host_device_channel_destroy(ch); +} + +TEST(HostDeviceChannelTest, L2ToCpuSendRecvRoundTrip) { + auto c = cfg(); + HostDeviceChannel *ch = make_channel(c); + + const char msg[] = "hello-cpu"; + EXPECT_EQ(host_device_channel_send_l2_for_test(ch, 3, msg, sizeof(msg) - 1, 99, 0), HDCH_OK); + + uint8_t out[64]{}; + size_t nbytes = 0; + uint64_t cid = 0; + uint32_t route = 0; + EXPECT_EQ(host_device_channel_recv_cpu(ch, out, sizeof(out), &nbytes, &cid, &route, 0), HDCH_OK); + EXPECT_EQ(nbytes, sizeof(msg) - 1); + EXPECT_EQ(memcmp(out, msg, nbytes), 0); + EXPECT_EQ(route, 3u); + EXPECT_EQ(cid, 99u); + + host_device_channel_destroy(ch); +} + +TEST(HostDeviceChannelTest, FullAndEmptyReturnWouldBlock) { + auto c = cfg(1, 1, 2, 16); + HostDeviceChannel *ch = make_channel(c); + const char payload[] = "x"; + + EXPECT_EQ(host_device_channel_recv_cpu(ch, nullptr, 0, nullptr, nullptr, nullptr, 0), HDCH_ERR_INVALID); + EXPECT_EQ(host_device_channel_send_cpu(ch, 0, payload, 1, 0, 0), HDCH_OK); + EXPECT_EQ(host_device_channel_send_cpu(ch, 0, payload, 1, 1, 0), HDCH_OK); + EXPECT_EQ(host_device_channel_send_cpu(ch, 0, payload, 1, 2, 0), HDCH_ERR_WOULD_BLOCK); + + uint8_t out[16]{}; + size_t nbytes = 0; + uint64_t cid = 0; + uint32_t route = 0; + EXPECT_EQ(host_device_channel_recv_l2_for_test(ch, out, sizeof(out), &nbytes, &cid, &route, 0), HDCH_OK); + EXPECT_EQ(host_device_channel_send_cpu(ch, 0, payload, 1, 3, 0), HDCH_OK); + + host_device_channel_destroy(ch); +} + +TEST(HostDeviceChannelTest, MessageTooLarge) { + auto c = cfg(1, 1, 4, 4); + HostDeviceChannel *ch = make_channel(c); + const char payload[] = "12345"; + + EXPECT_EQ(host_device_channel_send_cpu(ch, 0, payload, 5, 0, 0), HDCH_ERR_MSG_TOO_LARGE); + + host_device_channel_destroy(ch); +} diff --git a/tests/ut/cpp/common/test_host_device_memory.cpp b/tests/ut/cpp/common/test_host_device_memory.cpp new file mode 100644 index 000000000..af477aabc --- /dev/null +++ b/tests/ut/cpp/common/test_host_device_memory.cpp @@ -0,0 +1,128 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include + +#include +#include +#include + +#include "host_device_memory.h" + +namespace { + +HostDeviceMemoryConfig cfg(uint64_t data_bytes = 128, uint32_t signal_count = 2, uint32_t flags = 0) { + return HostDeviceMemoryConfig{data_bytes, signal_count, flags}; +} + +HostDeviceMemory *make_memory(const HostDeviceMemoryConfig &c) { + size_t bytes = host_device_memory_required_bytes(&c); + void *base = nullptr; + EXPECT_EQ(posix_memalign(&base, 64, bytes), 0); + auto *mem = host_device_memory_wrap(base, base, bytes, &c, 1, free); + EXPECT_NE(mem, nullptr); + return mem; +} + +} // namespace + +TEST(HostDeviceMemoryTest, RejectsInvalidConfig) { + auto no_data = cfg(0, 1); + EXPECT_EQ(host_device_memory_required_bytes(&no_data), 0u); + + auto no_signals = cfg(64, 0); + EXPECT_EQ(host_device_memory_required_bytes(&no_signals), 0u); +} + +TEST(HostDeviceMemoryTest, RejectsOverflowingRequiredBytes) { + auto too_large_data = cfg(std::numeric_limits::max(), 1); + EXPECT_EQ(host_device_memory_required_bytes(&too_large_data), 0u); + + auto align_overflow = cfg(static_cast(std::numeric_limits::max() - 31U), 1); + EXPECT_EQ(host_device_memory_required_bytes(&align_overflow), 0u); +} + +TEST(HostDeviceMemoryTest, InfoReturnsPointersAndShape) { + auto c = cfg(256, 3, 7); + HostDeviceMemory *mem = make_memory(c); + + HostDeviceMemoryInfo info{}; + EXPECT_EQ(host_device_memory_info(mem, &info), HDMEM_OK); + EXPECT_NE(info.host_ptr, 0u); + EXPECT_NE(info.device_ptr, 0u); + EXPECT_EQ(info.host_ptr, info.device_ptr); + EXPECT_EQ(info.data_bytes, 256u); + EXPECT_EQ(info.signal_count, 3u); + EXPECT_EQ(info.flags, 7u); + + host_device_memory_destroy(mem); +} + +TEST(HostDeviceMemoryTest, CpuWriteReadRoundTrip) { + HostDeviceMemory *mem = make_memory(cfg()); + const char msg[] = "shared-memory"; + + EXPECT_EQ(host_device_memory_write(mem, 8, msg, sizeof(msg) - 1), HDMEM_OK); + + uint8_t out[32]{}; + EXPECT_EQ(host_device_memory_read(mem, 8, out, sizeof(msg) - 1), HDMEM_OK); + EXPECT_EQ(std::memcmp(out, msg, sizeof(msg) - 1), 0); + + host_device_memory_destroy(mem); +} + +TEST(HostDeviceMemoryTest, L2WriteCpuReadAndCpuWriteL2Read) { + HostDeviceMemory *mem = make_memory(cfg()); + const char l2_msg[] = "from-l2"; + const char cpu_msg[] = "from-cpu"; + + EXPECT_EQ(host_device_memory_write_l2_for_test(mem, 0, l2_msg, sizeof(l2_msg) - 1), HDMEM_OK); + uint8_t cpu_out[16]{}; + EXPECT_EQ(host_device_memory_read(mem, 0, cpu_out, sizeof(l2_msg) - 1), HDMEM_OK); + EXPECT_EQ(std::memcmp(cpu_out, l2_msg, sizeof(l2_msg) - 1), 0); + + EXPECT_EQ(host_device_memory_write(mem, 32, cpu_msg, sizeof(cpu_msg) - 1), HDMEM_OK); + uint8_t l2_out[16]{}; + EXPECT_EQ(host_device_memory_read_l2_for_test(mem, 32, l2_out, sizeof(cpu_msg) - 1), HDMEM_OK); + EXPECT_EQ(std::memcmp(l2_out, cpu_msg, sizeof(cpu_msg) - 1), 0); + + host_device_memory_destroy(mem); +} + +TEST(HostDeviceMemoryTest, BoundsReject) { + HostDeviceMemory *mem = make_memory(cfg(16, 1)); + const char payload[] = "abcd"; + uint8_t out[4]{}; + + EXPECT_EQ(host_device_memory_write(mem, 13, payload, 4), HDMEM_ERR_INVALID); + EXPECT_EQ(host_device_memory_read(mem, 13, out, 4), HDMEM_ERR_INVALID); + EXPECT_EQ(host_device_memory_write(mem, 16, payload, 0), HDMEM_OK); + EXPECT_EQ(host_device_memory_read(mem, 16, out, 0), HDMEM_OK); + + host_device_memory_destroy(mem); +} + +TEST(HostDeviceMemoryTest, NotifyWait) { + HostDeviceMemory *mem = make_memory(cfg(64, 2)); + + EXPECT_EQ(host_device_memory_wait(mem, 0, 1, 0), HDMEM_ERR_WOULD_BLOCK); + EXPECT_EQ(host_device_memory_notify_l2_for_test(mem, 0, 5), HDMEM_OK); + EXPECT_EQ(host_device_memory_wait(mem, 0, 5, 0), HDMEM_OK); + + EXPECT_EQ(host_device_memory_wait_l2_for_test(mem, 1, 7, 0), HDMEM_ERR_WOULD_BLOCK); + EXPECT_EQ(host_device_memory_notify(mem, 1, 7), HDMEM_OK); + EXPECT_EQ(host_device_memory_wait_l2_for_test(mem, 1, 7, 0), HDMEM_OK); + + EXPECT_EQ(host_device_memory_notify(mem, 2, 1), HDMEM_ERR_INVALID); + EXPECT_EQ(host_device_memory_wait(mem, 2, 1, 0), HDMEM_ERR_INVALID); + + host_device_memory_destroy(mem); +} diff --git a/tests/ut/cpp/hierarchical/test_scheduler.cpp b/tests/ut/cpp/hierarchical/test_scheduler.cpp index 2fc7ba8c1..90125cdf1 100644 --- a/tests/ut/cpp/hierarchical/test_scheduler.cpp +++ b/tests/ut/cpp/hierarchical/test_scheduler.cpp @@ -18,7 +18,9 @@ #include #include #include +#include #include +#include #include #include "call_config.h" @@ -56,9 +58,19 @@ struct MockMailboxWorker { uint64_t tensor_key; // first tensor's `data` field (unique per submit in tests) }; + struct ControlRecord { + uint64_t command; + uint64_t offset; + uint64_t nbytes; + std::vector payload; + }; + alignas(8) std::array mailbox{}; std::vector dispatched; std::mutex dispatched_mu; + std::vector controls; + std::vector shared_memory_data; + std::mutex control_mu; std::mutex run_mu; std::condition_variable run_cv; @@ -165,12 +177,46 @@ struct MockMailboxWorker { } else if (s == MailboxState::CONTROL_REQUEST) { // Acknowledge the control request so a future test using // WorkerThread::control_* doesn't hang on the spin-poll. - // No memory operation is simulated — result stays zero. int32_t zero_err = 0; std::memcpy(mailbox.data() + MAILBOX_OFF_ERROR, &zero_err, sizeof(int32_t)); std::memset(mailbox.data() + MAILBOX_OFF_ERROR_MSG, 0, MAILBOX_ERROR_MSG_SIZE); uint64_t zero_result = 0; std::memcpy(mailbox.data() + CTRL_OFF_RESULT, &zero_result, sizeof(uint64_t)); + uint64_t command = 0; + uint64_t offset = 0; + uint64_t nbytes = 0; + std::memcpy(&command, mailbox.data() + MAILBOX_OFF_CALLABLE, sizeof(uint64_t)); + std::memcpy(&offset, mailbox.data() + CTRL_OFF_ARG1, sizeof(uint64_t)); + std::memcpy(&nbytes, mailbox.data() + CTRL_OFF_ARG2, sizeof(uint64_t)); + { + std::lock_guard lk(control_mu); + ControlRecord rec{command, offset, nbytes, {}}; + if (command == CTRL_SHARED_MEMORY_INFO) { + uint64_t data_bytes = shared_memory_data.size(); + uint64_t signal_count = 2; + uint64_t flags = 0; + std::memcpy(mailbox.data() + CTRL_OFF_ARG2, &data_bytes, sizeof(uint64_t)); + std::memcpy(mailbox.data() + CTRL_OFF_ARG3, &signal_count, sizeof(uint64_t)); + std::memcpy(mailbox.data() + CTRL_OFF_ARG4, &flags, sizeof(uint64_t)); + } else if (command == CTRL_SHARED_MEMORY_WRITE && nbytes != 0) { + rec.payload.resize(static_cast(nbytes)); + std::memcpy(rec.payload.data(), mailbox.data() + CTRL_OFF_PAYLOAD, static_cast(nbytes)); + if (offset + nbytes > shared_memory_data.size()) shared_memory_data.resize(offset + nbytes); + std::memcpy( + shared_memory_data.data() + offset, mailbox.data() + CTRL_OFF_PAYLOAD, + static_cast(nbytes) + ); + } else if (command == CTRL_SHARED_MEMORY_READ) { + if (offset + nbytes <= shared_memory_data.size() && nbytes != 0) { + std::memcpy( + mailbox.data() + CTRL_OFF_PAYLOAD, shared_memory_data.data() + offset, + static_cast(nbytes) + ); + } + std::memcpy(mailbox.data() + CTRL_OFF_RESULT, &nbytes, sizeof(uint64_t)); + } + controls.push_back(std::move(rec)); + } write_state(MailboxState::CONTROL_DONE); } else if (s == MailboxState::SHUTDOWN) { return; @@ -265,6 +311,88 @@ struct SchedulerFixture : public ::testing::Test { // Tests // --------------------------------------------------------------------------- +TEST_F(SchedulerFixture, SharedMemoryReadChunksOverMailboxPayload) { + const uint64_t offset = 5; + const size_t nbytes = CTRL_PAYLOAD_CAPACITY * 2 + 17; + mock_worker.shared_memory_data.resize(static_cast(offset) + nbytes); + for (size_t i = 0; i < mock_worker.shared_memory_data.size(); ++i) { + mock_worker.shared_memory_data[i] = static_cast(i % 251); + } + + auto *wt = manager.get_worker(WorkerType::NEXT_LEVEL, 0); + ASSERT_NE(wt, nullptr); + std::vector out = wt->control_shared_memory_read(0x42, offset, nbytes); + + ASSERT_EQ(out.size(), nbytes); + EXPECT_EQ( + out, std::vector( + mock_worker.shared_memory_data.begin() + static_cast(offset), + mock_worker.shared_memory_data.begin() + static_cast(offset + nbytes) + ) + ); + std::lock_guard lk(mock_worker.control_mu); + ASSERT_EQ(mock_worker.controls.size(), 4u); + EXPECT_EQ(mock_worker.controls[0].command, CTRL_SHARED_MEMORY_INFO); + EXPECT_EQ(mock_worker.controls[1].command, CTRL_SHARED_MEMORY_READ); + EXPECT_EQ(mock_worker.controls[1].offset, offset); + EXPECT_EQ(mock_worker.controls[1].nbytes, CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ(mock_worker.controls[2].offset, offset + CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ(mock_worker.controls[2].nbytes, CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ(mock_worker.controls[3].offset, offset + CTRL_PAYLOAD_CAPACITY * 2); + EXPECT_EQ(mock_worker.controls[3].nbytes, 17u); +} + +TEST_F(SchedulerFixture, SharedMemoryReadRejectsOutOfRangeBeforeChunkRead) { + mock_worker.shared_memory_data.resize(32); + + auto *wt = manager.get_worker(WorkerType::NEXT_LEVEL, 0); + ASSERT_NE(wt, nullptr); + EXPECT_THROW(static_cast(wt->control_shared_memory_read(0x42, 28, 8)), std::out_of_range); + + std::lock_guard lk(mock_worker.control_mu); + ASSERT_EQ(mock_worker.controls.size(), 1u); + EXPECT_EQ(mock_worker.controls[0].command, CTRL_SHARED_MEMORY_INFO); +} + +TEST_F(SchedulerFixture, SharedMemoryWriteChunksOverMailboxPayload) { + const uint64_t offset = 7; + std::vector payload(CTRL_PAYLOAD_CAPACITY * 2 + 19); + for (size_t i = 0; i < payload.size(); ++i) + payload[i] = static_cast(i % 253); + + auto *wt = manager.get_worker(WorkerType::NEXT_LEVEL, 0); + ASSERT_NE(wt, nullptr); + wt->control_shared_memory_write(0x43, offset, payload.data(), payload.size()); + + std::lock_guard lk(mock_worker.control_mu); + ASSERT_EQ(mock_worker.controls.size(), 3u); + EXPECT_EQ(mock_worker.controls[0].command, CTRL_SHARED_MEMORY_WRITE); + EXPECT_EQ(mock_worker.controls[0].offset, offset); + EXPECT_EQ(mock_worker.controls[0].nbytes, CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ( + mock_worker.controls[0].payload, std::vector(payload.begin(), payload.begin() + CTRL_PAYLOAD_CAPACITY) + ); + EXPECT_EQ(mock_worker.controls[1].offset, offset + CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ(mock_worker.controls[1].nbytes, CTRL_PAYLOAD_CAPACITY); + EXPECT_EQ( + mock_worker.controls[1].payload, + std::vector(payload.begin() + CTRL_PAYLOAD_CAPACITY, payload.begin() + CTRL_PAYLOAD_CAPACITY * 2) + ); + EXPECT_EQ(mock_worker.controls[2].offset, offset + CTRL_PAYLOAD_CAPACITY * 2); + EXPECT_EQ(mock_worker.controls[2].nbytes, 19u); + EXPECT_EQ( + mock_worker.controls[2].payload, + std::vector(payload.begin() + CTRL_PAYLOAD_CAPACITY * 2, payload.end()) + ); + EXPECT_EQ( + std::vector( + mock_worker.shared_memory_data.begin() + static_cast(offset), + mock_worker.shared_memory_data.begin() + static_cast(offset + payload.size()) + ), + payload + ); +} + TEST_F(SchedulerFixture, IndependentTaskDispatchedAndConsumed) { auto args_a = single_tensor_args(0xCAFE, TensorArgType::OUTPUT); auto res = orch.submit_next_level(42, args_a, cfg); diff --git a/tests/ut/py/test_worker/test_host_device_comm_hw.py b/tests/ut/py/test_worker/test_host_device_comm_hw.py new file mode 100644 index 000000000..d7ed6dcee --- /dev/null +++ b/tests/ut/py/test_worker/test_host_device_comm_hw.py @@ -0,0 +1,176 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +"""a2a3 onboard host/device channel + shared-memory smoke tests.""" + +import pytest + +import simpler.worker as worker_mod +from simpler.task_interface import CallConfig +from simpler.worker import Worker + + +@pytest.fixture +def l3_a2a3_worker(st_device_ids): + device_id = int(st_device_ids[0]) + worker = Worker( + level=3, + platform="a2a3", + runtime="tensormap_and_ringbuffer", + device_ids=[device_id], + num_sub_workers=0, + ) + try: + worker.init() + except FileNotFoundError as e: + pytest.skip(f"a2a3 runtime binaries unavailable: {e}") + try: + yield worker + finally: + worker.close() + + +def _test_ctrl(name: str) -> int: + if not hasattr(worker_mod, name): + pytest.fail(f"missing {name}; a2a3 onboard smoke needs child-side queue seeding helper") + return int(getattr(worker_mod, name)) + + +def _child_channel_send_l2(worker: Worker, channel: int, route: int, payload: bytes, correlation_id: int) -> None: + worker._chip_control_payload( + 0, + _test_ctrl("_CTRL_TEST_CHANNEL_SEND_L2"), + arg0=channel, + arg1=route, + arg2=len(payload), + arg3=correlation_id, + payload=payload, + ) + + +def _child_channel_recv_l2(worker: Worker, channel: int, capacity: int = 64) -> tuple[bytes, int, int]: + nbytes, payload, _arg1, _arg2, route, correlation_id = worker._chip_control_payload( + 0, + _test_ctrl("_CTRL_TEST_CHANNEL_RECV_L2"), + arg0=channel, + arg1=capacity, + arg2=1000, + recv_capacity=capacity, + ) + return payload[:nbytes], route, correlation_id + + +@pytest.mark.requires_hardware +@pytest.mark.platforms(["a2a3"]) +@pytest.mark.device_count(1) +def test_a2a3_onboard_channel_round_trip(l3_a2a3_worker): + opened: dict[str, int] = {} + + def open_and_send(orch, _args, _cfg): + channel = orch.open_channel( + worker_id=0, + cpu_to_l2_lanes=1, + l2_to_cpu_lanes=1, + lane_depth=4, + max_message_bytes=64, + ) + opened["channel"] = channel + orch.channel_send(0, channel, route=7, data=b"cpu-to-l2", correlation_id=0x1234) + + l3_a2a3_worker.run(open_and_send, args=None, config=CallConfig()) + + channel = opened["channel"] + payload, route, correlation_id = _child_channel_recv_l2(l3_a2a3_worker, channel) + assert payload == b"cpu-to-l2" + assert route == 7 + assert correlation_id == 0x1234 + + _child_channel_send_l2(l3_a2a3_worker, channel, route=3, payload=b"l2-to-cpu", correlation_id=0x5678) + + received: dict[str, tuple[bytes, int, int]] = {} + + def recv_and_close(orch, _args, _cfg): + try: + received["message"] = orch.channel_recv(0, channel, capacity=64, timeout_us=1000) + finally: + orch.close_channel(0, channel) + + l3_a2a3_worker.run(recv_and_close, args=None, config=CallConfig()) + + data, route, correlation_id = received["message"] + assert data == b"l2-to-cpu" + assert route == 3 + assert correlation_id == 0x5678 + + +@pytest.mark.requires_hardware +@pytest.mark.platforms(["a2a3"]) +@pytest.mark.device_count(1) +def test_a2a3_onboard_shared_memory_chunked_round_trip(l3_a2a3_worker): + payload = bytes(i % 251 for i in range(worker_mod._CTRL_PAYLOAD_CAPACITY + 23)) + seen: dict[str, object] = {} + + def shm_round_trip(orch, _args, _cfg): + memory = orch.open_shared_memory(0, data_bytes=len(payload) + 16, signal_count=2, flags=7) + try: + seen["info"] = orch.shared_memory_info(0, memory) + orch.shared_memory_write(0, memory, 5, payload) + seen["readback"] = orch.shared_memory_read(0, memory, 5, len(payload)) + orch.shared_memory_notify(0, memory, signal_id=1, value=9) + orch.shared_memory_wait(0, memory, signal_id=1, target=9, timeout_us=0) + finally: + orch.close_shared_memory(0, memory) + + l3_a2a3_worker.run(shm_round_trip, args=None, config=CallConfig()) + + host_ptr, device_ptr, data_bytes, signal_count, flags = seen["info"] + assert host_ptr == 0 + assert device_ptr != 0 + assert data_bytes == len(payload) + 16 + assert signal_count == 2 + assert flags == 7 + assert seen["readback"] == payload + + +@pytest.mark.requires_hardware +@pytest.mark.platforms(["a2a3"]) +@pytest.mark.device_count(1) +def test_a2a3_onboard_host_device_errors_are_diagnostic(l3_a2a3_worker): + def invalid_open(orch, _args, _cfg): + orch.open_channel( + worker_id=0, + cpu_to_l2_lanes=1, + l2_to_cpu_lanes=1, + lane_depth=3, + max_message_bytes=64, + ) + + with pytest.raises(RuntimeError, match="open_channel.*invalid config"): + l3_a2a3_worker.run(invalid_open, args=None, config=CallConfig()) + + def oversized_send(orch, _args, _cfg): + channel = orch.open_channel( + worker_id=0, + cpu_to_l2_lanes=1, + l2_to_cpu_lanes=1, + lane_depth=4, + max_message_bytes=64, + ) + try: + orch.channel_send( + 0, + channel, + route=1, + data=b"x" * (worker_mod._CTRL_PAYLOAD_CAPACITY + 1), + correlation_id=0, + ) + finally: + orch.close_channel(0, channel) + + with pytest.raises(ValueError, match="payload too large"): + l3_a2a3_worker.run(oversized_send, args=None, config=CallConfig()) diff --git a/tests/ut/py/test_worker/test_host_device_comm_sim.py b/tests/ut/py/test_worker/test_host_device_comm_sim.py new file mode 100644 index 000000000..26282dac4 --- /dev/null +++ b/tests/ut/py/test_worker/test_host_device_comm_sim.py @@ -0,0 +1,144 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +"""Sim L3 host/device channel + shared-memory mailbox protocol smoke tests.""" + +import pytest + +import simpler.worker as worker_mod +from simpler.task_interface import CallConfig +from simpler.worker import Worker + + +_SIM_PLATFORMS = ("a2a3sim", "a5sim") + + +@pytest.fixture(params=_SIM_PLATFORMS, ids=_SIM_PLATFORMS) +def sim_platform(request): + selected = request.config.getoption("--platform", default=None) + if selected and selected != request.param: + pytest.skip(f"requires --platform {request.param}") + return request.param + + +@pytest.fixture +def l3_sim_worker(sim_platform): + worker = Worker( + level=3, + platform=sim_platform, + runtime="tensormap_and_ringbuffer", + device_ids=[0], + num_sub_workers=0, + ) + try: + worker.init() + except FileNotFoundError as e: + pytest.skip(f"{sim_platform} runtime binaries unavailable: {e}") + try: + yield worker + finally: + worker.close() + + +def _test_ctrl(name: str) -> int: + if not hasattr(worker_mod, name): + pytest.fail(f"missing {name}; L3 channel smoke needs child-side queue seeding helper") + return int(getattr(worker_mod, name)) + + +def _child_channel_send_l2(worker: Worker, channel: int, route: int, payload: bytes, correlation_id: int) -> None: + worker._chip_control_payload( + 0, + _test_ctrl("_CTRL_TEST_CHANNEL_SEND_L2"), + arg0=channel, + arg1=route, + arg2=len(payload), + arg3=correlation_id, + payload=payload, + ) + + +def _child_channel_recv_l2(worker: Worker, channel: int, capacity: int = 64) -> tuple[bytes, int, int]: + nbytes, payload, _arg1, _arg2, route, correlation_id = worker._chip_control_payload( + 0, + _test_ctrl("_CTRL_TEST_CHANNEL_RECV_L2"), + arg0=channel, + arg1=capacity, + arg2=1000, + recv_capacity=capacity, + ) + return payload[:nbytes], route, correlation_id + + +def test_l3_channel_mailbox_protocol_round_trip(l3_sim_worker): + opened: dict[str, int] = {} + + def open_and_send(orch, _args, _cfg): + channel = orch.open_channel( + worker_id=0, + cpu_to_l2_lanes=1, + l2_to_cpu_lanes=1, + lane_depth=4, + max_message_bytes=64, + ) + opened["channel"] = channel + orch.channel_send(0, channel, route=7, data=b"cpu-to-l2", correlation_id=0x1234) + + l3_sim_worker.run(open_and_send, args=None, config=CallConfig()) + + channel = opened["channel"] + payload, route, correlation_id = _child_channel_recv_l2(l3_sim_worker, channel) + assert payload == b"cpu-to-l2" + assert route == 7 + assert correlation_id == 0x1234 + + _child_channel_send_l2(l3_sim_worker, channel, route=3, payload=b"l2-to-cpu", correlation_id=0x5678) + + received: dict[str, tuple[bytes, int, int]] = {} + + def recv_from_l2(orch, _args, _cfg): + received["message"] = orch.channel_recv(0, channel, capacity=64, timeout_us=1000) + + l3_sim_worker.run(recv_from_l2, args=None, config=CallConfig()) + + data, route, correlation_id = received["message"] + assert data == b"l2-to-cpu" + assert route == 3 + assert correlation_id == 0x5678 + + def close_channel(orch, _args, _cfg): + orch.close_channel(0, channel) + + l3_sim_worker.run(close_channel, args=None, config=CallConfig()) + + +def test_l3_shared_memory_mailbox_protocol_smoke(l3_sim_worker): + payload = bytes(i % 251 for i in range(worker_mod._CTRL_PAYLOAD_CAPACITY + 23)) + seen: dict[str, object] = {} + + def shm_round_trip(orch, _args, _cfg): + memory = orch.open_shared_memory(0, data_bytes=len(payload) + 16, signal_count=2, flags=7) + try: + info = orch.shared_memory_info(0, memory) + seen["info"] = info + orch.shared_memory_write(0, memory, 5, payload) + seen["readback"] = orch.shared_memory_read(0, memory, 5, len(payload)) + orch.shared_memory_notify(0, memory, signal_id=1, value=9) + orch.shared_memory_wait(0, memory, signal_id=1, target=9, timeout_us=0) + finally: + orch.close_shared_memory(0, memory) + + l3_sim_worker.run(shm_round_trip, args=None, config=CallConfig()) + + host_ptr, device_ptr, data_bytes, signal_count, flags = seen["info"] + assert host_ptr == 0 + assert device_ptr != 0 + assert data_bytes == len(payload) + 16 + assert signal_count == 2 + assert flags == 7 + assert seen["readback"] == payload diff --git a/tests/ut/py/test_worker/test_host_worker.py b/tests/ut/py/test_worker/test_host_worker.py index 4a5d11079..f8e546659 100644 --- a/tests/ut/py/test_worker/test_host_worker.py +++ b/tests/ut/py/test_worker/test_host_worker.py @@ -17,9 +17,17 @@ from multiprocessing.shared_memory import SharedMemory import pytest +import _task_interface as ti # pyright: ignore[reportMissingImports] from _task_interface import MAX_REGISTERED_CALLABLE_IDS # pyright: ignore[reportMissingImports] from simpler.task_interface import ChipCallable, DataType, TaskArgs, TensorArgType -from simpler.worker import Worker +import simpler.worker as worker_mod +from simpler.worker import ( + _CTRL_PAYLOAD_CAPACITY, + _CTRL_SHARED_MEMORY_INFO, + _CTRL_SHARED_MEMORY_READ, + _CTRL_SHARED_MEMORY_WRITE, + Worker, +) # --------------------------------------------------------------------------- # Helpers @@ -44,6 +52,149 @@ def _increment_counter(buf) -> None: struct.pack_into("i", buf, 0, v + 1) +# --------------------------------------------------------------------------- +# Test: L3 shared-memory metadata +# --------------------------------------------------------------------------- + + +def test_worker_control_protocol_constants_match_binding(): + names = ( + "CTRL_MALLOC CTRL_FREE CTRL_COPY_TO CTRL_COPY_FROM CTRL_PREPARE CTRL_REGISTER CTRL_UNREGISTER " + "CTRL_OPEN_CHANNEL CTRL_CLOSE_CHANNEL CTRL_CHANNEL_SEND CTRL_CHANNEL_RECV CTRL_OPEN_SHARED_MEMORY " + "CTRL_CLOSE_SHARED_MEMORY CTRL_SHARED_MEMORY_INFO CTRL_SHARED_MEMORY_READ CTRL_SHARED_MEMORY_WRITE " + "CTRL_SHARED_MEMORY_NOTIFY CTRL_SHARED_MEMORY_WAIT CTRL_OFF_ARG0 CTRL_OFF_ARG1 CTRL_OFF_ARG2 " + "CTRL_OFF_RESULT CTRL_OFF_ARG3 CTRL_OFF_ARG4 CTRL_OFF_PAYLOAD CTRL_PAYLOAD_CAPACITY " + "CTRL_SHM_NAME_BYTES" + ).split() + for name in names: + assert getattr(worker_mod, f"_{name}") == getattr(ti, name) + assert worker_mod._OFF_ARGS == ti.MAILBOX_OFF_ARGS + assert worker_mod._MAILBOX_ARGS_CAPACITY == ti.MAILBOX_ARGS_CAPACITY + + +class TestSharedMemoryInfo: + def test_l3_shared_memory_info_hides_child_host_ptr(self, monkeypatch): + hw = Worker(level=3, device_ids=[0], num_sub_workers=0) + + def fake_chip_control_payload(*args, **kwargs): + return 0x12345678, b"", 0xABCDEF00, 256, 3, 7 + + monkeypatch.setattr(hw, "_chip_control_payload", fake_chip_control_payload) + + host_ptr, device_ptr, data_bytes, signal_count, flags = hw.shared_memory_info(0x42) + + assert host_ptr == 0 + assert device_ptr == 0xABCDEF00 + assert data_bytes == 256 + assert signal_count == 3 + assert flags == 7 + + def test_l3_shared_memory_read_chunks_over_mailbox_payload(self, monkeypatch): + hw = Worker(level=3, device_ids=[0], num_sub_workers=0) + nbytes = _CTRL_PAYLOAD_CAPACITY * 2 + 17 + offset = 11 + payload_data = bytes((i % 251 for i in range(offset + nbytes))) + calls = [] + + def fake_chip_control_payload( + worker_id, + sub_cmd, + arg0=0, + arg1=0, + arg2=0, + arg3=0, + payload=b"", + recv_capacity=0, + ): + assert worker_id == 0 + assert arg0 == 0x42 + assert arg3 == 0 + assert payload == b"" + if sub_cmd == _CTRL_SHARED_MEMORY_INFO: + calls.append((arg1, arg2)) + return 0, b"", 0xABCDEF00, len(payload_data), 2, 0 + assert sub_cmd == _CTRL_SHARED_MEMORY_READ + assert 0 <= arg2 <= _CTRL_PAYLOAD_CAPACITY + assert recv_capacity == arg2 + calls.append((arg1, arg2)) + chunk = payload_data[arg1 : arg1 + arg2] + return len(chunk), chunk, 0, 0, 0, 0 + + monkeypatch.setattr(hw, "_chip_control_payload", fake_chip_control_payload) + + assert hw.shared_memory_read(0x42, offset, nbytes) == payload_data[offset : offset + nbytes] + + assert calls == [ + (0, 0), + (11, _CTRL_PAYLOAD_CAPACITY), + (11 + _CTRL_PAYLOAD_CAPACITY, _CTRL_PAYLOAD_CAPACITY), + (11 + _CTRL_PAYLOAD_CAPACITY * 2, 17), + ] + + def test_l3_shared_memory_read_rejects_out_of_range_before_chunk_read(self, monkeypatch): + hw = Worker(level=3, device_ids=[0], num_sub_workers=0) + calls = [] + + def fake_chip_control_payload( + worker_id, + sub_cmd, + arg0=0, + arg1=0, + arg2=0, + arg3=0, + payload=b"", + recv_capacity=0, + ): + calls.append(sub_cmd) + if sub_cmd == _CTRL_SHARED_MEMORY_INFO: + return 0, b"", 0xABCDEF00, 32, 2, 0 + if sub_cmd == _CTRL_SHARED_MEMORY_READ: + return arg2, bytes(arg2), 0, 0, 0, 0 + raise AssertionError(f"unexpected control command {sub_cmd}") + + monkeypatch.setattr(hw, "_chip_control_payload", fake_chip_control_payload) + + with pytest.raises(ValueError, match="shared_memory_read out of range"): + hw.shared_memory_read(0x42, 28, 8) + + assert calls == [_CTRL_SHARED_MEMORY_INFO] + + def test_l3_shared_memory_write_chunks_over_mailbox_payload(self, monkeypatch): + hw = Worker(level=3, device_ids=[0], num_sub_workers=0) + payload = bytes((i % 253 for i in range(_CTRL_PAYLOAD_CAPACITY * 2 + 19))) + calls = [] + + def fake_chip_control_payload( + worker_id, + sub_cmd, + arg0=0, + arg1=0, + arg2=0, + arg3=0, + payload=b"", + recv_capacity=0, + ): + assert worker_id == 0 + assert sub_cmd == _CTRL_SHARED_MEMORY_WRITE + assert arg0 == 0x43 + assert arg2 == len(payload) + assert arg3 == 0 + assert recv_capacity == 0 + assert len(payload) <= _CTRL_PAYLOAD_CAPACITY + calls.append((arg1, payload)) + return 0, b"", 0, 0, 0, 0 + + monkeypatch.setattr(hw, "_chip_control_payload", fake_chip_control_payload) + + hw.shared_memory_write(0x43, 23, payload) + + assert calls == [ + (23, payload[:_CTRL_PAYLOAD_CAPACITY]), + (23 + _CTRL_PAYLOAD_CAPACITY, payload[_CTRL_PAYLOAD_CAPACITY : _CTRL_PAYLOAD_CAPACITY * 2]), + (23 + _CTRL_PAYLOAD_CAPACITY * 2, payload[_CTRL_PAYLOAD_CAPACITY * 2 :]), + ] + + # --------------------------------------------------------------------------- # Test: lifecycle (init / close without submitting any tasks) # ---------------------------------------------------------------------------