@@ -203,6 +203,162 @@ let config = MultiPhaseConfig::new()
203203
204204Run the PageRank example: ` cargo run -p ringkernel --example pagerank_reduction --features cuda `
205205
206+ ### PTX Compilation Cache (in ringkernel-cuda)
207+
208+ Disk-based PTX caching for faster kernel loading:
209+
210+ - ** ` PtxCache ` ** - SHA-256 content-based caching with compute capability awareness
211+ - ** ` PtxCacheStats ` ** - Hit/miss statistics for cache performance monitoring
212+ - ** ` PtxCacheError ` ** - Descriptive error types for cache operations
213+ - Default location: ` ~/.cache/ringkernel/ptx/ `
214+ - Environment variable: ` RINGKERNEL_PTX_CACHE_DIR `
215+
216+ ``` rust
217+ use ringkernel_cuda :: compile :: {PtxCache , PtxCacheStats };
218+
219+ let cache = PtxCache :: new ()? ;
220+ let hash = PtxCache :: hash_source (cuda_source );
221+
222+ if let Some (ptx ) = cache . get (& hash , " sm_89" )? {
223+ // Use cached PTX
224+ } else {
225+ let ptx = compile_ptx (cuda_source )? ;
226+ cache . put (& hash , " sm_89" , & ptx )? ;
227+ }
228+ ```
229+
230+ ### GPU Stratified Memory Pool (in ringkernel-cuda)
231+
232+ Size-stratified GPU VRAM pooling with O(1) allocation:
233+
234+ - ** ` GpuStratifiedPool ` ** - 6 size classes (256B to 256KB) with free lists
235+ - ** ` GpuPoolConfig ` ** - Configuration with presets: ` for_graph_analytics() ` , ` for_simulation() `
236+ - ** ` GpuSizeClass ` ** - Size class enum for bucket selection
237+ - ** ` GpuPoolDiagnostics ` ** - Utilization monitoring and statistics
238+
239+ ``` rust
240+ use ringkernel_cuda :: memory_pool :: {GpuStratifiedPool , GpuPoolConfig , GpuSizeClass };
241+
242+ let config = GpuPoolConfig :: for_graph_analytics ();
243+ let mut pool = GpuStratifiedPool :: new (& device , config )? ;
244+ pool . warm_bucket (GpuSizeClass :: Size1KB , 100 )? ; // Pre-allocate
245+
246+ let ptr = pool . allocate (512 )? ; // O(1) from free list
247+ pool . deallocate (ptr , 512 )? ;
248+ ```
249+
250+ ### Multi-Stream Execution (in ringkernel-cuda)
251+
252+ CUDA stream management for compute/transfer overlap:
253+
254+ - ** ` StreamManager ` ** - Multi-stream with compute and transfer streams
255+ - ** ` StreamConfig ` ** - Configuration with presets: ` minimal() ` , ` performance() `
256+ - ** ` StreamId ` ** - ` Compute(usize) ` , ` Transfer ` , ` Default `
257+ - ** ` StreamPool ` ** - Load-balanced stream assignment
258+ - ** ` OverlapMetrics ` ** - Compute/transfer overlap measurement
259+
260+ ``` rust
261+ use ringkernel_cuda :: stream :: {StreamManager , StreamConfig , StreamId };
262+
263+ let manager = StreamManager :: new (& device , StreamConfig :: performance ())? ;
264+ manager . record_event (" kernel_done" , StreamId :: Compute (0 ))? ;
265+ manager . stream_wait_event (StreamId :: Transfer , " kernel_done" )? ;
266+ ```
267+
268+ ### Benchmark Framework (in ringkernel-core)
269+
270+ Comprehensive benchmarking with regression detection (feature-gated via ` benchmark ` ):
271+
272+ - ** ` Benchmarkable ` trait** - Generic interface for workloads
273+ - ** ` BenchmarkSuite ` ** - Orchestration with multiple report formats
274+ - ** ` BenchmarkConfig ` ** - Presets: ` quick() ` , ` comprehensive() ` , ` ci() `
275+ - ** ` BenchmarkResult ` ** - Throughput, timing, custom metrics
276+ - ** ` RegressionReport ` ** - Baseline comparison with status tracking
277+ - ** ` Statistics ` ** - ConfidenceInterval, DetailedStatistics, ScalingMetrics
278+
279+ ``` rust
280+ use ringkernel_core :: benchmark :: {BenchmarkSuite , BenchmarkConfig , Benchmarkable };
281+
282+ let mut suite = BenchmarkSuite :: new (BenchmarkConfig :: comprehensive ());
283+ suite . run_all_sizes (& MyWorkload );
284+
285+ println! (" {}" , suite . generate_markdown_report ());
286+ if let Some (report ) = suite . compare_to_baseline () {
287+ println! (" Regressions: {}" , report . regression_count);
288+ }
289+ ```
290+
291+ ### Hybrid CPU-GPU Dispatcher (in ringkernel-core)
292+
293+ Intelligent workload routing with adaptive thresholds:
294+
295+ - ** ` HybridDispatcher ` ** - Automatic CPU/GPU routing with learning
296+ - ** ` HybridWorkload ` trait** - ` execute_cpu() ` / ` execute_gpu() ` interface
297+ - ** ` ProcessingMode ` ** - ` GpuOnly ` , ` CpuOnly ` , ` Hybrid ` , ` Adaptive `
298+ - ** ` HybridConfig ` ** - Presets: ` cpu_only() ` , ` gpu_only() ` , ` adaptive() `
299+ - ** ` HybridStats ` ** - Execution counts and adaptive threshold history
300+
301+ ``` rust
302+ use ringkernel_core :: hybrid :: {HybridDispatcher , HybridConfig };
303+
304+ let dispatcher = HybridDispatcher :: new (HybridConfig :: adaptive ());
305+ let result = dispatcher . execute (& workload ); // Automatic routing
306+ ```
307+
308+ ### Resource Guard (in ringkernel-core)
309+
310+ Memory limit enforcement with reservations:
311+
312+ - ** ` ResourceGuard ` ** - Configurable limits with safety margin
313+ - ** ` ReservationGuard ` ** - RAII wrapper for guaranteed allocations
314+ - ** ` MemoryEstimator ` trait** - Workload memory estimation
315+ - ** ` MemoryEstimate ` ** - Primary, auxiliary, peak bytes with confidence
316+ - ** ` LinearEstimator ` ** - Simple linear estimator
317+ - System utilities: ` get_total_memory() ` , ` get_available_memory() `
318+
319+ ``` rust
320+ use ringkernel_core :: resource :: {ResourceGuard , MemoryEstimate };
321+
322+ let guard = ResourceGuard :: with_max_memory (4 * 1024 * 1024 * 1024 );
323+ let reservation = guard . reserve (512 * 1024 * 1024 )? ;
324+ // Automatically released on drop
325+ ```
326+
327+ ### Kernel Mode Selection (in ringkernel-cuda)
328+
329+ Intelligent kernel launch configuration:
330+
331+ - ** ` KernelMode ` ** - ` ElementCentric ` , ` SoA ` , ` Tiled ` , ` WarpCooperative ` , ` Auto `
332+ - ** ` AccessPattern ` ** - ` Coalesced ` , ` Stencil ` , ` Irregular ` , ` Reduction ` , ` Scatter ` , ` Gather `
333+ - ** ` WorkloadProfile ` ** - Element count, bytes per element, access pattern
334+ - ** ` GpuArchitecture ` ** - Presets: ` volta() ` , ` ampere() ` , ` ada() ` , ` hopper() `
335+ - ** ` KernelModeSelector ` ** - Optimal mode selection and launch config generation
336+ - ** ` LaunchConfig ` ** - Complete kernel launch configuration
337+
338+ ``` rust
339+ use ringkernel_cuda :: launch_config :: {KernelModeSelector , WorkloadProfile , AccessPattern };
340+
341+ let selector = KernelModeSelector :: with_defaults ();
342+ let profile = WorkloadProfile :: new (1_000_000 , 64 )
343+ . with_access_pattern (AccessPattern :: Stencil { radius : 1 });
344+ let config = selector . launch_config (selector . select (& profile ), 1_000_000 );
345+ ```
346+
347+ ### Partitioned Queues (in ringkernel-core)
348+
349+ Multi-partition message queues for reduced contention:
350+
351+ - ** ` PartitionedQueue ` ** - Hash-based routing by source kernel ID
352+ - ** ` PartitionedQueueStats ` ** - Per-partition statistics with load imbalance metric
353+
354+ ``` rust
355+ use ringkernel_core :: queue :: PartitionedQueue ;
356+
357+ let queue = PartitionedQueue :: new (4 , 1024 ); // 4 partitions
358+ queue . try_enqueue (envelope )? ; // Routed by source_kernel
359+ let msg = queue . try_dequeue_any (); // Round-robin dequeue
360+ ```
361+
206362### Enterprise Features (in ringkernel-core)
207363
208364The following enterprise-grade features provide production-ready infrastructure:
@@ -594,10 +750,10 @@ let handle = CudaPersistentHandle::new(simulation, "fdtd_3d");
594750
595751### Test Count Summary
596752
597- 900 + tests across the workspace:
598- - ringkernel-core: 457 tests (including memory pool, analytics context, pressure reactions, enterprise security, auth, RBAC, tenancy, rate limiting, TLS, logging, alerting, recovery)
753+ 950 + tests across the workspace:
754+ - ringkernel-core: 538 tests (including memory pool, analytics context, pressure reactions, enterprise security, auth, RBAC, tenancy, rate limiting, TLS, logging, alerting, recovery, benchmark framework, hybrid dispatcher, resource guard, partitioned queues )
599755- ringkernel-cpu: 11 tests
600- - ringkernel-cuda: 52 tests (reduction cache, phases, K2K, persistent actors)
756+ - ringkernel-cuda: 52+ tests (reduction cache, phases, K2K, persistent actors, PTX cache, GPU memory pool, stream manager, kernel mode selection )
601757- ringkernel-cuda-codegen: 190+ tests (loops, shared memory, ring kernels, K2K, envelope format, energy calculation, checksums, 120+ GPU intrinsics)
602758- ringkernel-wgpu-codegen: 55+ tests (types, intrinsics, transpiler, validation, 2D/3D/4D shared memory)
603759- ringkernel-ir: 40+ tests (IR nodes, CUDA lowering, MSL lowering, messaging nodes, HLC nodes)
@@ -806,7 +962,7 @@ let _ = device.poll(wgpu::PollType::wait_indefinitely());
806962
807963## Dependency Versions
808964
809- Key workspace dependencies (as of v0.3.2 ):
965+ Key workspace dependencies (as of v0.4.0 ):
810966
811967| Category | Package | Version | Notes |
812968| ----------| ---------| ---------| -------|
@@ -816,6 +972,7 @@ Key workspace dependencies (as of v0.3.2):
816972| ** GPU** | cudarc | 0.18.2 | CUDA bindings |
817973| ** GPU** | wgpu | 27.0 | WebGPU (Arc-based) |
818974| ** GPU** | metal | 0.31 | Apple Metal |
975+ | ** Crypto** | sha2 | 0.10 | PTX cache hashing |
819976| ** Web** | axum | 0.8 | HTTP framework |
820977| ** Web** | tower | 0.5 | Service abstractions |
821978| ** gRPC** | tonic | 0.14 | gRPC framework |
0 commit comments