Add CUDA QJL kernels and proper package structure#3
Open
johndpope wants to merge 1 commit intotonbistudio:masterfrom
Open
Add CUDA QJL kernels and proper package structure#3johndpope wants to merge 1 commit intotonbistudio:masterfrom
johndpope wants to merge 1 commit intotonbistudio:masterfrom
Conversation
Integrate QJL CUDA kernels from amirzandieh/QJL for fused quantization and attention score computation. Restructure flat files into turboquant/ package with setup.py for installable distribution. New files: - turboquant/cuda_backend.py: QJL CUDA kernel wrappers with PyTorch fallback - turboquant/csrc/*.cu: CUDA kernels (quant, score, gqa_score, quantization) - turboquant/benchmark_cuda.py: PyTorch vs CUDA kernel benchmarks - setup.py: pip-installable package with optional CUDA build Validation results on Qwen2.5-3B-Instruct (8K context): - 3-bit: 5.0x compression, 0.9945 cosine sim, 289MB -> 57.6MB - 4-bit: 3.8x compression, 0.9983 cosine sim, 289MB -> 75.6MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
turboquant/package withsetup.pyforpip install -e .cuda_backend.pywith automatic CUDA/PyTorch dispatch and full PyTorch fallbackCUDA Kernels Added
From amirzandieh/QJL (Apache-2.0):
qjl_quant_kernel.cuqjl_score_kernel.cuqjl_gqa_score_kernel.cuquantization.cuBuild
Validation Results
Tested on Qwen2.5-3B-Instruct (36 layers, 72 KV heads) with real KV cache data:
2K Context (2065 tokens)
4K Context (4090 tokens)
8K Context (8221 tokens)
Synthetic Benchmark (Lloyd-Max codebook)
All MSE values are well within the paper's theoretical upper bounds.
Needle-in-Haystack Retrieval
Perfect 9/9 exact match across all bit-widths (2, 3, 4) and context lengths (512, 2048, 8192).
Key Takeaway
3-bit is the practical sweet spot: 5x compression with 99.5% attention fidelity. At 128K context, this means ~3.6 GB KV cache instead of ~18 GB — fitting entirely on a single 24GB GPU.
Test plan
python -m turboquant.test_turboquant— all synthetic tests passpython -m turboquant.validate— real model validation on Qwen2.5-3Bpython -m turboquant.benchmark_cuda— CUDA kernel benchmarkssetup.py build_ext --inplaceTested on: RTX PRO 4000 Blackwell (24GB), RTX 5090 (32GB), CUDA 12.9, PyTorch 2.11
🤖 Generated with Claude Code