+113. [ ] Build a native graph-compatible capture backend, done when the tracing/capture hot path is implemented in C++/CUDA or Triton rather than Python, eliminating Python interception overhead, extra GPU syncs, and host-side copies. This is the concrete path to lowering the current ~12–14% online overhead. The current design explicitly keeps the serving kernel unchanged and pays overhead in tracing *around* it (`sidecar/verilm/capture.py`), not in replacing model math. The native backend should be fused into or adjacent to the existing serving kernels so that accumulator capture happens inside the kernel launch rather than as a separate Python-orchestrated step. This also directly enables CUDA graph compatibility (#93): if capture hooks live at the CUDA/driver level rather than in Python wrappers, they survive graph replay. Milestones: (a) profile the current capture path to identify the dominant cost (Python interception vs. sync vs. copy vs. materialization); (b) prototype a native capture hook for `cutlass_scaled_mm` that writes accumulators to a pre-allocated ring buffer without a Python round-trip; (c) benchmark the prototype against the Python path on the same model/workload; (d) integrate with the existing retained-state pipeline. This is distinct from the canonical deterministic attention kernel (#22/#23), which targets exactness rather than performance and may well be *slower* than FlashAttention. The two projects are complementary but should not be conflated: the capture backend lowers overhead on the kept approximate-attention path; the deterministic kernel optionally eliminates the approximate region.
0 commit comments