Skip to content

Commit a45fd83

Browse files
Add native capture backend item (#113) to roadmap
Concrete performance path to lower the ~12-14% online overhead: C++/CUDA/Triton capture backend fused into serving kernels, eliminating Python interception, extra syncs, and host copies. Also enables CUDA graph compatibility (#93). Distinct from the deterministic attention kernel (#22/#23) which targets exactness.
1 parent bef4f75 commit a45fd83

1 file changed

Lines changed: 1 addition & 0 deletions

File tree

roadmap.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ The kept canonical sampled path already exists in the live server. The shell/tai
7878
45. [ ] Reduce audit payload structurally where possible, done when unnecessary full vectors are removed without weakening the strongest routine/deep-audit semantics. Priority here is lossless shell compression that preserves the full exact Freivalds/bridge checks: width-packed accumulator encodings, tighter binary layouts, and proof-layout improvements. Do not treat dropping shell matrices as the default bandwidth solution.
7979
46. [ ] Add streaming or incremental audit open if it materially helps, done when it measurably reduces peak memory or audit-open time and is benchmarked against the baseline.
8080
47. [ ] Lower online inference overhead further if it remains worthwhile, done when the remaining online-cost work has either been completed and benchmarked or consciously stopped because the returns are too small.
81+
113. [ ] Build a native graph-compatible capture backend, done when the tracing/capture hot path is implemented in C++/CUDA or Triton rather than Python, eliminating Python interception overhead, extra GPU syncs, and host-side copies. This is the concrete path to lowering the current ~12–14% online overhead. The current design explicitly keeps the serving kernel unchanged and pays overhead in tracing *around* it (`sidecar/verilm/capture.py`), not in replacing model math. The native backend should be fused into or adjacent to the existing serving kernels so that accumulator capture happens inside the kernel launch rather than as a separate Python-orchestrated step. This also directly enables CUDA graph compatibility (#93): if capture hooks live at the CUDA/driver level rather than in Python wrappers, they survive graph replay. Milestones: (a) profile the current capture path to identify the dominant cost (Python interception vs. sync vs. copy vs. materialization); (b) prototype a native capture hook for `cutlass_scaled_mm` that writes accumulators to a pre-allocated ring buffer without a Python round-trip; (c) benchmark the prototype against the Python path on the same model/workload; (d) integrate with the existing retained-state pipeline. This is distinct from the canonical deterministic attention kernel (#22/#23), which targets exactness rather than performance and may well be *slower* than FlashAttention. The two projects are complementary but should not be conflated: the capture backend lowers overhead on the kept approximate-attention path; the deterministic kernel optionally eliminates the approximate region.
8182
48. [ ] Rebenchmark after each meaningful online-path change, done when every material serving-path change has fresh benchmark data attached to it.
8283

8384
## Docs / Publication

0 commit comments

Comments
 (0)