Skip to content

Commit 0e3be33

Browse files
mivertowskiclaude
andcommitted
Release v0.4.0: GPU infrastructure generalization
Extract ~7,000 lines of GPU infrastructure from RustGraph into RingKernel, making proven capabilities available to all users. ## New Modules ### ringkernel-cuda - PTX Compilation Cache: SHA-256 content-based disk caching with CC-awareness - GPU Stratified Memory Pool: 6 size classes (256B-256KB) with O(1) allocation - Multi-Stream Manager: Compute/transfer overlap with event synchronization - Kernel Mode Selection: Intelligent launch configuration based on workload ### ringkernel-core - Benchmark Framework: Generic benchmarking with regression detection - Hybrid CPU-GPU Dispatcher: Adaptive threshold learning for routing - Resource Guard: Memory limits with reservations and safety margins - Partitioned Queues: Hash-based routing for reduced contention ## Changes - Version bump: 0.3.2 → 0.4.0 - Test count: 900+ → 950+ - Added sha2 dependency for PTX cache hashing - Updated all documentation (README, CLAUDE.md, CHANGELOG) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 987b982 commit 0e3be33

50 files changed

Lines changed: 9012 additions & 79 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 388 additions & 1 deletion
Large diffs are not rendered by default.

CLAUDE.md

Lines changed: 161 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,162 @@ let config = MultiPhaseConfig::new()
203203

204204
Run the PageRank example: `cargo run -p ringkernel --example pagerank_reduction --features cuda`
205205

206+
### PTX Compilation Cache (in ringkernel-cuda)
207+
208+
Disk-based PTX caching for faster kernel loading:
209+
210+
- **`PtxCache`** - SHA-256 content-based caching with compute capability awareness
211+
- **`PtxCacheStats`** - Hit/miss statistics for cache performance monitoring
212+
- **`PtxCacheError`** - Descriptive error types for cache operations
213+
- Default location: `~/.cache/ringkernel/ptx/`
214+
- Environment variable: `RINGKERNEL_PTX_CACHE_DIR`
215+
216+
```rust
217+
use ringkernel_cuda::compile::{PtxCache, PtxCacheStats};
218+
219+
let cache = PtxCache::new()?;
220+
let hash = PtxCache::hash_source(cuda_source);
221+
222+
if let Some(ptx) = cache.get(&hash, "sm_89")? {
223+
// Use cached PTX
224+
} else {
225+
let ptx = compile_ptx(cuda_source)?;
226+
cache.put(&hash, "sm_89", &ptx)?;
227+
}
228+
```
229+
230+
### GPU Stratified Memory Pool (in ringkernel-cuda)
231+
232+
Size-stratified GPU VRAM pooling with O(1) allocation:
233+
234+
- **`GpuStratifiedPool`** - 6 size classes (256B to 256KB) with free lists
235+
- **`GpuPoolConfig`** - Configuration with presets: `for_graph_analytics()`, `for_simulation()`
236+
- **`GpuSizeClass`** - Size class enum for bucket selection
237+
- **`GpuPoolDiagnostics`** - Utilization monitoring and statistics
238+
239+
```rust
240+
use ringkernel_cuda::memory_pool::{GpuStratifiedPool, GpuPoolConfig, GpuSizeClass};
241+
242+
let config = GpuPoolConfig::for_graph_analytics();
243+
let mut pool = GpuStratifiedPool::new(&device, config)?;
244+
pool.warm_bucket(GpuSizeClass::Size1KB, 100)?; // Pre-allocate
245+
246+
let ptr = pool.allocate(512)?; // O(1) from free list
247+
pool.deallocate(ptr, 512)?;
248+
```
249+
250+
### Multi-Stream Execution (in ringkernel-cuda)
251+
252+
CUDA stream management for compute/transfer overlap:
253+
254+
- **`StreamManager`** - Multi-stream with compute and transfer streams
255+
- **`StreamConfig`** - Configuration with presets: `minimal()`, `performance()`
256+
- **`StreamId`** - `Compute(usize)`, `Transfer`, `Default`
257+
- **`StreamPool`** - Load-balanced stream assignment
258+
- **`OverlapMetrics`** - Compute/transfer overlap measurement
259+
260+
```rust
261+
use ringkernel_cuda::stream::{StreamManager, StreamConfig, StreamId};
262+
263+
let manager = StreamManager::new(&device, StreamConfig::performance())?;
264+
manager.record_event("kernel_done", StreamId::Compute(0))?;
265+
manager.stream_wait_event(StreamId::Transfer, "kernel_done")?;
266+
```
267+
268+
### Benchmark Framework (in ringkernel-core)
269+
270+
Comprehensive benchmarking with regression detection (feature-gated via `benchmark`):
271+
272+
- **`Benchmarkable` trait** - Generic interface for workloads
273+
- **`BenchmarkSuite`** - Orchestration with multiple report formats
274+
- **`BenchmarkConfig`** - Presets: `quick()`, `comprehensive()`, `ci()`
275+
- **`BenchmarkResult`** - Throughput, timing, custom metrics
276+
- **`RegressionReport`** - Baseline comparison with status tracking
277+
- **`Statistics`** - ConfidenceInterval, DetailedStatistics, ScalingMetrics
278+
279+
```rust
280+
use ringkernel_core::benchmark::{BenchmarkSuite, BenchmarkConfig, Benchmarkable};
281+
282+
let mut suite = BenchmarkSuite::new(BenchmarkConfig::comprehensive());
283+
suite.run_all_sizes(&MyWorkload);
284+
285+
println!("{}", suite.generate_markdown_report());
286+
if let Some(report) = suite.compare_to_baseline() {
287+
println!("Regressions: {}", report.regression_count);
288+
}
289+
```
290+
291+
### Hybrid CPU-GPU Dispatcher (in ringkernel-core)
292+
293+
Intelligent workload routing with adaptive thresholds:
294+
295+
- **`HybridDispatcher`** - Automatic CPU/GPU routing with learning
296+
- **`HybridWorkload` trait** - `execute_cpu()` / `execute_gpu()` interface
297+
- **`ProcessingMode`** - `GpuOnly`, `CpuOnly`, `Hybrid`, `Adaptive`
298+
- **`HybridConfig`** - Presets: `cpu_only()`, `gpu_only()`, `adaptive()`
299+
- **`HybridStats`** - Execution counts and adaptive threshold history
300+
301+
```rust
302+
use ringkernel_core::hybrid::{HybridDispatcher, HybridConfig};
303+
304+
let dispatcher = HybridDispatcher::new(HybridConfig::adaptive());
305+
let result = dispatcher.execute(&workload); // Automatic routing
306+
```
307+
308+
### Resource Guard (in ringkernel-core)
309+
310+
Memory limit enforcement with reservations:
311+
312+
- **`ResourceGuard`** - Configurable limits with safety margin
313+
- **`ReservationGuard`** - RAII wrapper for guaranteed allocations
314+
- **`MemoryEstimator` trait** - Workload memory estimation
315+
- **`MemoryEstimate`** - Primary, auxiliary, peak bytes with confidence
316+
- **`LinearEstimator`** - Simple linear estimator
317+
- System utilities: `get_total_memory()`, `get_available_memory()`
318+
319+
```rust
320+
use ringkernel_core::resource::{ResourceGuard, MemoryEstimate};
321+
322+
let guard = ResourceGuard::with_max_memory(4 * 1024 * 1024 * 1024);
323+
let reservation = guard.reserve(512 * 1024 * 1024)?;
324+
// Automatically released on drop
325+
```
326+
327+
### Kernel Mode Selection (in ringkernel-cuda)
328+
329+
Intelligent kernel launch configuration:
330+
331+
- **`KernelMode`** - `ElementCentric`, `SoA`, `Tiled`, `WarpCooperative`, `Auto`
332+
- **`AccessPattern`** - `Coalesced`, `Stencil`, `Irregular`, `Reduction`, `Scatter`, `Gather`
333+
- **`WorkloadProfile`** - Element count, bytes per element, access pattern
334+
- **`GpuArchitecture`** - Presets: `volta()`, `ampere()`, `ada()`, `hopper()`
335+
- **`KernelModeSelector`** - Optimal mode selection and launch config generation
336+
- **`LaunchConfig`** - Complete kernel launch configuration
337+
338+
```rust
339+
use ringkernel_cuda::launch_config::{KernelModeSelector, WorkloadProfile, AccessPattern};
340+
341+
let selector = KernelModeSelector::with_defaults();
342+
let profile = WorkloadProfile::new(1_000_000, 64)
343+
.with_access_pattern(AccessPattern::Stencil { radius: 1 });
344+
let config = selector.launch_config(selector.select(&profile), 1_000_000);
345+
```
346+
347+
### Partitioned Queues (in ringkernel-core)
348+
349+
Multi-partition message queues for reduced contention:
350+
351+
- **`PartitionedQueue`** - Hash-based routing by source kernel ID
352+
- **`PartitionedQueueStats`** - Per-partition statistics with load imbalance metric
353+
354+
```rust
355+
use ringkernel_core::queue::PartitionedQueue;
356+
357+
let queue = PartitionedQueue::new(4, 1024); // 4 partitions
358+
queue.try_enqueue(envelope)?; // Routed by source_kernel
359+
let msg = queue.try_dequeue_any(); // Round-robin dequeue
360+
```
361+
206362
### Enterprise Features (in ringkernel-core)
207363

208364
The following enterprise-grade features provide production-ready infrastructure:
@@ -594,10 +750,10 @@ let handle = CudaPersistentHandle::new(simulation, "fdtd_3d");
594750

595751
### Test Count Summary
596752

597-
900+ tests across the workspace:
598-
- ringkernel-core: 457 tests (including memory pool, analytics context, pressure reactions, enterprise security, auth, RBAC, tenancy, rate limiting, TLS, logging, alerting, recovery)
753+
950+ tests across the workspace:
754+
- ringkernel-core: 538 tests (including memory pool, analytics context, pressure reactions, enterprise security, auth, RBAC, tenancy, rate limiting, TLS, logging, alerting, recovery, benchmark framework, hybrid dispatcher, resource guard, partitioned queues)
599755
- ringkernel-cpu: 11 tests
600-
- ringkernel-cuda: 52 tests (reduction cache, phases, K2K, persistent actors)
756+
- ringkernel-cuda: 52+ tests (reduction cache, phases, K2K, persistent actors, PTX cache, GPU memory pool, stream manager, kernel mode selection)
601757
- ringkernel-cuda-codegen: 190+ tests (loops, shared memory, ring kernels, K2K, envelope format, energy calculation, checksums, 120+ GPU intrinsics)
602758
- ringkernel-wgpu-codegen: 55+ tests (types, intrinsics, transpiler, validation, 2D/3D/4D shared memory)
603759
- ringkernel-ir: 40+ tests (IR nodes, CUDA lowering, MSL lowering, messaging nodes, HLC nodes)
@@ -806,7 +962,7 @@ let _ = device.poll(wgpu::PollType::wait_indefinitely());
806962

807963
## Dependency Versions
808964

809-
Key workspace dependencies (as of v0.3.2):
965+
Key workspace dependencies (as of v0.4.0):
810966

811967
| Category | Package | Version | Notes |
812968
|----------|---------|---------|-------|
@@ -816,6 +972,7 @@ Key workspace dependencies (as of v0.3.2):
816972
| **GPU** | cudarc | 0.18.2 | CUDA bindings |
817973
| **GPU** | wgpu | 27.0 | WebGPU (Arc-based) |
818974
| **GPU** | metal | 0.31 | Apple Metal |
975+
| **Crypto** | sha2 | 0.10 | PTX cache hashing |
819976
| **Web** | axum | 0.8 | HTTP framework |
820977
| **Web** | tower | 0.5 | Service abstractions |
821978
| **gRPC** | tonic | 0.14 | gRPC framework |

0 commit comments

Comments
 (0)