rasptorch is an experimental deep learning library inspired by PyTorch, built with a singular focus: making complex neural networks practical and efficient to run on resource-constrained hardware like the Raspberry Pi 5, by leveraging its GPU capabilities via Vulkan.
The library operates on a multi-layered architecture to maximize hardware utilization:
- CPU Backend (Software): Uses a pure NumPy-backed autograd engine and
nnmodule for reliable computation when GPU acceleration is unavailable. - GPU Backend (Hardware): Features an experimental Vulkan backend for high-speed tensor operations (elementwise math, matmul, reductions) directly on the Pi 5's GPU.
- Interface: Provides a streamlined CLI/Streamlit UI for interactive model building, training, persistence, and inspection.
The Vulkan path relies on real compute shaders compiled to SPIR-V, giving deep control over the underlying hardware.
rasptorch now exposes a backend abstraction API so compute backends can be registered and connected at runtime:
import rasptorch
# Inspect availability
print(rasptorch.available_backends()) # {'cpu': True, 'vulkan': ..., 'opencl': ..., 'cuda': ...}
# Try to connect a backend (falls back to CPU in non-strict mode)
active = rasptorch.connect_backend("vulkan", strict=False)
print(active.name)Built-in backend adapters:
numpy(NumPy adapter; internal key:cpu) - Pure NumPy autogradvulkan(rasptorch Vulkan kernels, with optional CPU fallback) - Optimized for Raspberry Pi 4/5 β‘opencl(pyopencl when available, optional CPU fallback)cuda(CuPy when available, with PyTorch CUDA fallback, optional CPU fallback)
CLI helpers:
rasptorch backend list
rasptorch backend connect numpy
rasptorch backend connect vulkan --strict
# Benchmark with auto-tuned Vulkan kernel and submission batching
rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --vulkan-kernel auto --vulkan-autotune-submit --seed 42Note: User-facing CLI/UI labels the CPU backend as
numpy. Vulkan benchmark mode uses resident buffers (upload once, repeated on-device matmul, download once). Performance (Optimized): Vulkan achieves ~564 GFLOPS (78% of NumPy on matmul_vec4 with auto-tuning). For very large square workloads (for example, 4096x4096), the benchmark now prefers the stablematmul_vec4_wide_tiledpath by default. If you want to pin that path manually, use--vulkan-kernel matmul_vec4_wide_tiled --vulkan-submit-every 8.--vulkan-kernel autoprobesmatmul_vec4_wide_tiled,matmul,matmul_vec4,matmul_vec4_tiled,matmul_a_bt, andmatmul_a_bt_tiled(when available) and keeps the faster path. If Vulkan hitsVkErrorDeviceLost, lower--vulkan-submit-every(for example,4or1) or use auto-tuning. Recommended: Use--vulkan-autotune-submitto jointly probe kernel + submit chunk and pick the fastest stable combo. Optimizations: Command buffer batching, memory-mapped buffers, auto kernel selection.
- Tensor Operations: Support for elementwise math, matrix multiplication (
matmul), reductions, indexing, reshaping, stacking, and broadcasting. - Layers: Includes standard neural network blocks:
Linear,MLP,CNN,GRU,Transformer, normalization layers, activations, pooling, embeddings, and attention. - Training Tools: Full suite of tools including optimizers (
SGD), learning-rate schedulers, gradient clipping, and regularization helpers. - Persistence: Ability to save and load checkpoint weights without needing the full
torchdependency. - Interfaces: CLI (
rasptorch chat) and Streamlit UI (rasptorch ui).
A. Basic Install (CPU Only): To get the core library components running on the CPU:
pip install rasptorchB. Development Install (Full Capability): For local development and access to all potential backends:
pip install -e ".[dev]"C. GPU Mode Prerequisites: To utilize the GPU backend, you must meet these prerequisites: D. GPU Backend Dependencies: For systems with a proper GPU setup (e.g., Raspberry Pi 5), install the GPU backend dependencies using:
# Prerequisites for Vulkan backend (Raspberry Pi 4/5 & Linux)
sudo apt update
sudo apt install -y glslc (for raspberry pi 4/5 & linux operating systems)
# For Windows, install the Vulkan SDK from LunarG:
./VulkanSDK-Installer.exe --accept-licenses --default-answer --confirm-command install
# Then install rasptorch with GPU support after installing prerequisites (for either platform):
pip install rasptorch[gpu]You can also install only the backend(s) you need instead of the entire gpu extra. Examples:
# Vulkan-only (recommended for Raspberry Pi 4/5)
pip install rasptorch[vulkan]
# OpenCL-only
pip install rasptorch[opencl]
# CUDA-only (for NVIDIA systems with CUDA installed)
pip install rasptorch[cuda]
# Vulkan + OpenCL
pip install "rasptorch[vulkan,opencl]"
#### 2. Quick Run Examples
**Start the Interactive Shell:**
```bash
uv run rasptorch chat
Launch the Web UI:
uv run rasptorch ui(This will usually open at http://localhost:8501)
Viewing Help: To see all available CLI subcommands:
uv run rasptorch --helpThe main.py script controls the operational mode:
cpu: Pure NumPy autograd execution on the CPU.gpu: Executes the training loop explicitly using the Vulkan backend kernels.gpu-autograd: An experimental mode for tracing gradients across the GPU pipeline.
Example Training Command:
uv run main.py --device gpu --epochs 50 --batch-size 32 --lr 0.01rasptorch provides a built-in benchmark tool for comparing backend performance on matrix multiplication:
Quick Benchmark (Single Size):
# Benchmark with default settings (2048x2048 matmul, 100 iterations)
uv run rasptorch backend benchmark
# Benchmark with custom size and multiple backends
uv run rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --seed 42Performance Results (Windows 11, NVIDIA GeForce RTX 3070 Ti):
Measured backend scaling with:
uv run rasptorch --json backend benchmark --backends numpy,opencl,vulkan,cuda --vulkan-kernel matmul_vec4_wide_tiled --vulkan-submit-every 8 --iterations 5 --warmup 1 --seed 42| Backend | 256Γ256 (GFLOPS) | 512Γ512 (GFLOPS) | 1024Γ1024 (GFLOPS) | 2048Γ2048 (GFLOPS) | 4096Γ4096 (GFLOPS) |
|---|---|---|---|---|---|
| NumPy | 101.01 | 368.14 | 720.12 | 816.57 | 883.75 |
| OpenCL | 44.24 | 68.20 | 107.01 | 97.75 | 46.87 |
Vulkan (matmul_vec4_wide_tiled) |
259.43 | 654.88 | 1043.37 | 809.62 | 534.92 |
| CUDA | 116.48 | 380.50 | 1063.82 | 2136.75 | 3990.81 |
Measured with:
uv run rasptorch --json backend benchmark --backends vulkan --vulkan-submit-every 8 --size 2048 --iterations 5 --warmup 1 --seed 42 --vulkan-kernel <kernel>| Vulkan Kernel | GFLOPS | Status |
|---|---|---|
matmul |
492.73 | ok |
matmul_tiled |
497.81 | ok |
matmul_vec4 |
493.15 | ok |
matmul_vec4_tiled |
484.03 | ok |
matmul_vec4_wide_tiled |
778.27 | ok |
matmul_a_bt |
150.90 | ok |
matmul_a_bt_tiled |
173.23 | ok |
Vulkan Kernel Selection:
The --vulkan-kernel auto flag intelligently probes available kernels:
matmul- Basic single-threaded implementationmatmul_vec4- SIMD-style vec4 operationsmatmul_vec4_tiled- vec4 plus shared-memory tilingmatmul_vec4_wide_tiled- vec4 plus wider output tiling for better reusematmul_a_bt- Matrix transpose optimization (for A @ B.T)matmul_a_bt_tiled- Tiled transpose optimization (fastest when applicable)
Advanced Tuning:
# Auto-tune both kernel AND submission batching strategy
uv run rasptorch --json backend benchmark --backends vulkan --size 2048 \
--iterations 100 --warmup 20 \
--vulkan-kernel auto \
--vulkan-autotune-submit \
--seed 42
# Manual kernel selection with custom batch submission
uv run rasptorch --json backend benchmark --backends vulkan \
--vulkan-kernel matmul_a_bt_tiled \
--vulkan-submit-every 4 \
--size 2048 --iterations 100Large Workload Guidance:
- For 4096x4096 matmul, use the default auto mode or pin
matmul_vec4_wide_tileddirectly. - If a custom Vulkan kernel triggers
VkErrorDeviceLost, reduce--vulkan-submit-everyfirst. - The benchmark path uses resident buffers, so the large-size slowdown is kernel quality rather than repeated host-device transfers.
Output Format:
Results are provided in JSON format (with --json flag) including:
status: "ok" or "unavailable"elapsed_seconds: Total benchmark timeiterations_per_second: Throughput metricestimated_gflops: Floating-point performancechecksum: Verification resultkernel: Selected kernel name (for auto mode)submit_every: Submission batch size (for Vulkan)
Optimization Tips:
- Use
--vulkan-autotune-submitfor best results (probes kernel + batch combinations) - If you see
VkErrorDeviceLost, reduce--vulkan-submit-every(try4or1) - Larger problem sizes better amortize GPU setup overhead
- Command buffer batching (
--vulkan-submit-every) balances latency and throughput
For detailed optimization guide, see VULKAN_OPTIMIZATION.md.
Basic tensor math is performed via:
# Create tensors
uv run rasptorch tensor random --shape 2,3,4
uv run rasptorch tensor ones --shape 5,10The results show the low-level tensor capabilities.
Models are defined using structured commands:
# Simple MLP
uv run rasptorch model mlp --layers "64,32,16,2"
# Complex CNN
uv run rasptorch model cnn --in-channels 3 --out-channels "32,64,128"Managing the lifecycle:
uv run rasptorch model list
uv run rasptorch model save --model-id <id> --path model.pth- Performance: The fastest paths are those that keep the computation entirely on the GPU and minimize data transfer across the PCIe bus.
- Fallback: If GPU operations fail due to driver issues, the system gracefully falls back to the CPU NumPy path, but performance will suffer.
- Advanced Use: For understanding the deep dive into custom kernel optimization, please refer to the source code in the
rasptorch/gpu_demo.pyandrasptorch/main.pyscripts.