Add CUDA QJL kernels and proper package structure by johndpope · Pull Request #3 · tonbistudio/turboquant-pytorch

johndpope · 2026-03-26T08:21:19Z

Summary

Integrate QJL CUDA kernels for fused quantization and attention score computation (addresses cuda version? #2)
Restructure flat files into turboquant/ package with setup.py for pip install -e .
Add cuda_backend.py with automatic CUDA/PyTorch dispatch and full PyTorch fallback
Add benchmark and validation scripts

CUDA Kernels Added

Kernel	Purpose
`qjl_quant_kernel.cu`	Fused random projection + sign quantization + outlier separation
`qjl_score_kernel.cu`	Fused attention score from 1-bit quantized keys
`qjl_gqa_score_kernel.cu`	Grouped Query Attention variant
`quantization.cu`	Quantized batched matmul for value reconstruction

Build

# PyTorch-only (no CUDA kernels needed)
pip install -e .

# With CUDA kernels
CUDA_HOME=/usr/local/cuda python setup.py build_ext --inplace

Validation Results

Tested on Qwen2.5-3B-Instruct (36 layers, 72 KV heads) with real KV cache data:

2K Context (2065 tokens)

Config	Cache Size	Compression	Cosine Sim	Top-1 Match	Top-5 Match
FP16	72.6 MB	1.0x	—	—	—
4-bit	19.0 MB	3.8x	0.9988	86.1%	95.8%
3-bit	14.5 MB	5.0x	0.9961	84.7%	94.4%
2-bit	9.9 MB	7.3x	0.9897	63.9%	83.3%

4K Context (4090 tokens)

Config	Cache Size	Compression	Cosine Sim	Top-1 Match	Top-5 Match
FP16	143.8 MB	1.0x	—	—	—
4-bit	37.6 MB	3.8x	0.9986	91.7%	94.4%
3-bit	28.6 MB	5.0x	0.9955	72.2%	90.3%
2-bit	19.7 MB	7.3x	0.9878	65.3%	83.3%

8K Context (8221 tokens)

Config	Cache Size	Compression	Cosine Sim	Top-1 Match	Top-5 Match
FP16	289.0 MB	1.0x	—	—	—
4-bit	75.6 MB	3.8x	0.9983	86.1%	95.8%
3-bit	57.6 MB	5.0x	0.9945	84.7%	93.1%
2-bit	39.5 MB	7.3x	0.9851	68.1%	87.5%

Synthetic Benchmark (Lloyd-Max codebook)

Bits	Measured MSE	Theoretical Bound	Ratio
1-bit	0.361	0.680	0.53x
2-bit	0.116	0.170	0.68x
3-bit	0.034	0.043	0.79x
4-bit	0.009	0.011	0.87x

All MSE values are well within the paper's theoretical upper bounds.

Needle-in-Haystack Retrieval

Perfect 9/9 exact match across all bit-widths (2, 3, 4) and context lengths (512, 2048, 8192).

Key Takeaway

3-bit is the practical sweet spot: 5x compression with 99.5% attention fidelity. At 128K context, this means ~3.6 GB KV cache instead of ~18 GB — fitting entirely on a single 24GB GPU.

Test plan

python -m turboquant.test_turboquant — all synthetic tests pass
python -m turboquant.validate — real model validation on Qwen2.5-3B
python -m turboquant.benchmark_cuda — CUDA kernel benchmarks
CUDA kernels build with setup.py build_ext --inplace

Tested on: RTX PRO 4000 Blackwell (24GB), RTX 5090 (32GB), CUDA 12.9, PyTorch 2.11

🤖 Generated with Claude Code

Integrate QJL CUDA kernels from amirzandieh/QJL for fused quantization and attention score computation. Restructure flat files into turboquant/ package with setup.py for installable distribution. New files: - turboquant/cuda_backend.py: QJL CUDA kernel wrappers with PyTorch fallback - turboquant/csrc/*.cu: CUDA kernels (quant, score, gqa_score, quantization) - turboquant/benchmark_cuda.py: PyTorch vs CUDA kernel benchmarks - setup.py: pip-installable package with optional CUDA build Validation results on Qwen2.5-3B-Instruct (8K context): - 3-bit: 5.0x compression, 0.9945 cosine sim, 289MB -> 57.6MB - 4-bit: 3.8x compression, 0.9983 cosine sim, 289MB -> 75.6MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cesarfrancots mentioned this pull request Mar 29, 2026

fix: package structure, memory-correct compression, codebook dedup #11

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA QJL kernels and proper package structure#3

Add CUDA QJL kernels and proper package structure#3
johndpope wants to merge 1 commit intotonbistudio:masterfrom
johndpope:feat/cuda-qjl-kernels

johndpope commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johndpope commented Mar 26, 2026

Summary

CUDA Kernels Added

Build

Validation Results

2K Context (2065 tokens)

4K Context (4090 tokens)

8K Context (8221 tokens)

Synthetic Benchmark (Lloyd-Max codebook)

Needle-in-Haystack Retrieval

Key Takeaway

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant