🍓🎇 rasptorch

rasptorch is an experimental deep learning library inspired by PyTorch, built with a singular focus: making complex neural networks practical and efficient to run on resource-constrained hardware like the Raspberry Pi 5, by leveraging its GPU capabilities via Vulkan.

✨ Core Concepts

The library operates on a multi-layered architecture to maximize hardware utilization:

CPU Backend (Software): Uses a pure NumPy-backed autograd engine and nn module for reliable computation when GPU acceleration is unavailable.
GPU Backend (Hardware): Features an experimental Vulkan backend for high-speed tensor operations (elementwise math, matmul, reductions) directly on the Pi 5's GPU.
Interface: Provides a streamlined CLI/Streamlit UI for interactive model building, training, persistence, and inspection.

The Vulkan path relies on real compute shaders compiled to SPIR-V, giving deep control over the underlying hardware.

🔌 Backend Abstraction (Connectable Backends)

rasptorch now exposes a backend abstraction API so compute backends can be registered and connected at runtime:

import rasptorch

# Inspect availability
print(rasptorch.available_backends())  # {'cpu': True, 'vulkan': ..., 'opencl': ..., 'cuda': ...}

# Try to connect a backend (falls back to CPU in non-strict mode)
active = rasptorch.connect_backend("vulkan", strict=False)
print(active.name)

Built-in backend adapters:

numpy (NumPy adapter; internal key: cpu) - Pure NumPy autograd
vulkan (rasptorch Vulkan kernels, with optional CPU fallback) - Optimized for Raspberry Pi 4/5 ⚡
opencl (pyopencl when available, optional CPU fallback)
cuda (CuPy when available, with PyTorch CUDA fallback, optional CPU fallback)

CLI helpers:

rasptorch backend list
rasptorch backend connect numpy
rasptorch backend connect vulkan --strict
# Benchmark with auto-tuned Vulkan kernel and submission batching
rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --vulkan-kernel auto --vulkan-autotune-submit --seed 42

Note: User-facing CLI/UI labels the CPU backend as numpy. Vulkan benchmark mode uses resident buffers (upload once, repeated on-device matmul, download once). Performance (Optimized): Vulkan achieves ~564 GFLOPS (78% of NumPy on matmul_vec4 with auto-tuning). For very large square workloads (for example, 4096x4096), the benchmark now prefers the stable matmul_vec4_wide_tiled path by default. If you want to pin that path manually, use --vulkan-kernel matmul_vec4_wide_tiled --vulkan-submit-every 8. --vulkan-kernel auto probes matmul_vec4_wide_tiled, matmul, matmul_vec4, matmul_vec4_tiled, matmul_a_bt, and matmul_a_bt_tiled (when available) and keeps the faster path. If Vulkan hits VkErrorDeviceLost, lower --vulkan-submit-every (for example, 4 or 1) or use auto-tuning. Recommended: Use --vulkan-autotune-submit to jointly probe kernel + submit chunk and pick the fastest stable combo. Optimizations: Command buffer batching, memory-mapped buffers, auto kernel selection.

📚 What's Included (Core Features)

Tensor Operations: Support for elementwise math, matrix multiplication (matmul), reductions, indexing, reshaping, stacking, and broadcasting.
Layers: Includes standard neural network blocks: Linear, MLP, CNN, GRU, Transformer, normalization layers, activations, pooling, embeddings, and attention.
Training Tools: Full suite of tools including optimizers (SGD), learning-rate schedulers, gradient clipping, and regularization helpers.
Persistence: Ability to save and load checkpoint weights without needing the full torch dependency.
Interfaces: CLI (rasptorch chat) and Streamlit UI (rasptorch ui).

🚀 Getting Started

1. Installation

A. Basic Install (CPU Only): To get the core library components running on the CPU:

pip install rasptorch

B. Development Install (Full Capability): For local development and access to all potential backends:

pip install -e ".[dev]"

C. GPU Mode Prerequisites: To utilize the GPU backend, you must meet these prerequisites: D. GPU Backend Dependencies: For systems with a proper GPU setup (e.g., Raspberry Pi 5), install the GPU backend dependencies using:

# Prerequisites for Vulkan backend (Raspberry Pi 4/5 & Linux)
sudo apt update
sudo apt install -y glslc (for raspberry pi 4/5 & linux operating systems)

# For Windows, install the Vulkan SDK from LunarG:
./VulkanSDK-Installer.exe --accept-licenses --default-answer --confirm-command install

# Then install rasptorch with GPU support after installing prerequisites (for either platform):
pip install rasptorch[gpu]

You can also install only the backend(s) you need instead of the entire gpu extra. Examples:

# Vulkan-only (recommended for Raspberry Pi 4/5)
pip install rasptorch[vulkan]

# OpenCL-only
pip install rasptorch[opencl]

# CUDA-only (for NVIDIA systems with CUDA installed)
pip install rasptorch[cuda]

# Vulkan + OpenCL
pip install "rasptorch[vulkan,opencl]"


#### 2. Quick Run Examples

**Start the Interactive Shell:**
```bash
uv run rasptorch chat

Launch the Web UI:

uv run rasptorch ui

(This will usually open at http://localhost:8501)

Viewing Help: To see all available CLI subcommands:

uv run rasptorch --help

⚙️ Execution Modes & Workflows

The main.py script controls the operational mode:

cpu: Pure NumPy autograd execution on the CPU.
gpu: Executes the training loop explicitly using the Vulkan backend kernels.
gpu-autograd: An experimental mode for tracing gradients across the GPU pipeline.

Example Training Command:

uv run main.py --device gpu --epochs 50 --batch-size 32 --lr 0.01

📊 Benchmarks

rasptorch provides a built-in benchmark tool for comparing backend performance on matrix multiplication:

Quick Benchmark (Single Size):

# Benchmark with default settings (2048x2048 matmul, 100 iterations)
uv run rasptorch backend benchmark

# Benchmark with custom size and multiple backends
uv run rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --seed 42

Performance Results (Windows 11, NVIDIA GeForce RTX 3070 Ti):

Measured backend scaling with:

uv run rasptorch --json backend benchmark --backends numpy,opencl,vulkan,cuda --vulkan-kernel matmul_vec4_wide_tiled --vulkan-submit-every 8 --iterations 5 --warmup 1 --seed 42

256×256 → 4096×4096 Scaling

Backend	256×256 (GFLOPS)	512×512 (GFLOPS)	1024×1024 (GFLOPS)	2048×2048 (GFLOPS)	4096×4096 (GFLOPS)
NumPy	101.01	368.14	720.12	816.57	883.75
OpenCL	44.24	68.20	107.01	97.75	46.87
Vulkan (`matmul_vec4_wide_tiled`)	259.43	654.88	1043.37	809.62	534.92
CUDA	116.48	380.50	1063.82	2136.75	3990.81

Vulkan Kernel Comparison (2048×2048)

Measured with:

uv run rasptorch --json backend benchmark --backends vulkan --vulkan-submit-every 8 --size 2048 --iterations 5 --warmup 1 --seed 42 --vulkan-kernel <kernel>

Vulkan Kernel	GFLOPS	Status
`matmul`	492.73	ok
`matmul_tiled`	497.81	ok
`matmul_vec4`	493.15	ok
`matmul_vec4_tiled`	484.03	ok
`matmul_vec4_wide_tiled`	778.27	ok
`matmul_a_bt`	150.90	ok
`matmul_a_bt_tiled`	173.23	ok

Vulkan Kernel Selection: The --vulkan-kernel auto flag intelligently probes available kernels:

matmul - Basic single-threaded implementation
matmul_vec4 - SIMD-style vec4 operations
matmul_vec4_tiled - vec4 plus shared-memory tiling
matmul_vec4_wide_tiled - vec4 plus wider output tiling for better reuse
matmul_a_bt - Matrix transpose optimization (for A @ B.T)
matmul_a_bt_tiled - Tiled transpose optimization (fastest when applicable)

Advanced Tuning:

# Auto-tune both kernel AND submission batching strategy
uv run rasptorch --json backend benchmark --backends vulkan --size 2048 \
  --iterations 100 --warmup 20 \
  --vulkan-kernel auto \
  --vulkan-autotune-submit \
  --seed 42

# Manual kernel selection with custom batch submission
uv run rasptorch --json backend benchmark --backends vulkan \
  --vulkan-kernel matmul_a_bt_tiled \
  --vulkan-submit-every 4 \
  --size 2048 --iterations 100

Large Workload Guidance:

For 4096x4096 matmul, use the default auto mode or pin matmul_vec4_wide_tiled directly.
If a custom Vulkan kernel triggers VkErrorDeviceLost, reduce --vulkan-submit-every first.
The benchmark path uses resident buffers, so the large-size slowdown is kernel quality rather than repeated host-device transfers.

Output Format: Results are provided in JSON format (with --json flag) including:

status: "ok" or "unavailable"
elapsed_seconds: Total benchmark time
iterations_per_second: Throughput metric
estimated_gflops: Floating-point performance
checksum: Verification result
kernel: Selected kernel name (for auto mode)
submit_every: Submission batch size (for Vulkan)

Optimization Tips:

Use --vulkan-autotune-submit for best results (probes kernel + batch combinations)
If you see VkErrorDeviceLost, reduce --vulkan-submit-every (try 4 or 1)
Larger problem sizes better amortize GPU setup overhead
Command buffer batching (--vulkan-submit-every) balances latency and throughput

For detailed optimization guide, see VULKAN_OPTIMIZATION.md.

🧠 Advanced Topics

1. Tensor Operations

Basic tensor math is performed via:

# Create tensors
uv run rasptorch tensor random --shape 2,3,4
uv run rasptorch tensor ones --shape 5,10

The results show the low-level tensor capabilities.

2. Model Definition

Models are defined using structured commands:

# Simple MLP
uv run rasptorch model mlp --layers "64,32,16,2"
# Complex CNN
uv run rasptorch model cnn --in-channels 3 --out-channels "32,64,128"

Managing the lifecycle:

uv run rasptorch model list
uv run rasptorch model save --model-id <id> --path model.pth

🩹 Troubleshooting & Best Practices

Performance: The fastest paths are those that keep the computation entirely on the GPU and minimize data transfer across the PCIe bus.
Fallback: If GPU operations fail due to driver issues, the system gracefully falls back to the CPU NumPy path, but performance will suffer.
Advanced Use: For understanding the deep dive into custom kernel optimization, please refer to the source code in the rasptorch/gpu_demo.py and rasptorch/main.py scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.env		.env
.github		.github
openclaw/skills/rasptorch		openclaw/skills/rasptorch
rasptorch		rasptorch
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
cnn_demo.py		cnn_demo.py
demo_model.pth		demo_model.pth
essentials_demo.py		essentials_demo.py
gpu_demo.py		gpu_demo.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_import.py		test_import.py
torch_bridge_demo.py		torch_bridge_demo.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍓🎇 rasptorch

✨ Core Concepts

🔌 Backend Abstraction (Connectable Backends)

📚 What's Included (Core Features)

🚀 Getting Started

1. Installation

⚙️ Execution Modes & Workflows

📊 Benchmarks

256×256 → 4096×4096 Scaling

Vulkan Kernel Comparison (2048×2048)

🧠 Advanced Topics

1. Tensor Operations

2. Model Definition

🩹 Troubleshooting & Best Practices

About

Uh oh!

Releases 10

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍓🎇 rasptorch

✨ Core Concepts

🔌 Backend Abstraction (Connectable Backends)

📚 What's Included (Core Features)

🚀 Getting Started

1. Installation

⚙️ Execution Modes & Workflows

📊 Benchmarks

256×256 → 4096×4096 Scaling

Vulkan Kernel Comparison (2048×2048)

🧠 Advanced Topics

1. Tensor Operations

2. Model Definition

🩹 Troubleshooting & Best Practices

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages