Add native capture backend item (#113) to roadmap

unbalancedparentheses · unbalancedparentheses · commit a45fd837f2c6 · 2026-04-02T14:02:02.000+02:00
Concrete performance path to lower the ~12-14% online overhead:
C++/CUDA/Triton capture backend fused into serving kernels,
eliminating Python interception, extra syncs, and host copies.
Also enables CUDA graph compatibility (#93). Distinct from the
deterministic attention kernel (#22/#23) which targets exactness.
diff --git a/roadmap.md b/roadmap.md
@@ -78,6 +78,7 @@ The kept canonical sampled path already exists in the live server. The shell/tai
 45. [ ] Reduce audit payload structurally where possible, done when unnecessary full vectors are removed without weakening the strongest routine/deep-audit semantics. Priority here is lossless shell compression that preserves the full exact Freivalds/bridge checks: width-packed accumulator encodings, tighter binary layouts, and proof-layout improvements. Do not treat dropping shell matrices as the default bandwidth solution.
 46. [ ] Add streaming or incremental audit open if it materially helps, done when it measurably reduces peak memory or audit-open time and is benchmarked against the baseline.
 47. [ ] Lower online inference overhead further if it remains worthwhile, done when the remaining online-cost work has either been completed and benchmarked or consciously stopped because the returns are too small.
+113. [ ] Build a native graph-compatible capture backend, done when the tracing/capture hot path is implemented in C++/CUDA or Triton rather than Python, eliminating Python interception overhead, extra GPU syncs, and host-side copies. This is the concrete path to lowering the current ~12–14% online overhead. The current design explicitly keeps the serving kernel unchanged and pays overhead in tracing *around* it (`sidecar/verilm/capture.py`), not in replacing model math. The native backend should be fused into or adjacent to the existing serving kernels so that accumulator capture happens inside the kernel launch rather than as a separate Python-orchestrated step. This also directly enables CUDA graph compatibility (#93): if capture hooks live at the CUDA/driver level rather than in Python wrappers, they survive graph replay. Milestones: (a) profile the current capture path to identify the dominant cost (Python interception vs. sync vs. copy vs. materialization); (b) prototype a native capture hook for `cutlass_scaled_mm` that writes accumulators to a pre-allocated ring buffer without a Python round-trip; (c) benchmark the prototype against the Python path on the same model/workload; (d) integrate with the existing retained-state pipeline. This is distinct from the canonical deterministic attention kernel (#22/#23), which targets exactness rather than performance and may well be *slower* than FlashAttention. The two projects are complementary but should not be conflated: the capture backend lowers overhead on the kept approximate-attention path; the deterministic kernel optionally eliminates the approximate region.
 48. [ ] Rebenchmark after each meaningful online-path change, done when every material serving-path change has fresh benchmark data attached to it.
 
 ## Docs / Publication