TurboQuant

Near-optimal vector quantization for .NET 8+

Compress embeddings to 2-4 bits with <0.5% quality loss. Zero dependencies. AOT-ready.

Implements Algorithm 1 from TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh et al., ICLR 2026). Lloyd-Max quantization on the exact Beta(d/2, 1/2) distribution with randomized rotation.

768-dim float32 embedding  ──>  TurboQuant 4-bit  ──>  512 bytes
       3072 bytes                                       6x smaller
                                                        cosine > 0.995

Install

dotnet add package TurboQuant

30-second Example

using TurboQuant;

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

var packed = quantizer.Quantize(embedding);        // float[768] -> PackedVector
var restored = quantizer.Dequantize(packed);       // PackedVector -> float[768]
var score = quantizer.ApproxSimilarity(a, b);      // LUT-based, zero allocation

// Zero-alloc dequantize into existing buffer
Span<float> buffer = stackalloc float[768];
quantizer.Dequantize(packed, buffer);

// Serialize for storage
var bytes = packed.ToBytes();
var loaded = PackedVector.FromBytes(bytes);

Why TurboQuant?

	TurboQuant	Product Quantization	Scalar Quantization
Quality (4-bit)	cosine > 0.995	cosine ~ 0.96	cosine ~ 0.98
Quantization speed	~44 us/vec	~500 ms/vec	~1 us/vec
Similarity search	LUT, zero alloc	Asymmetric dist	Full decompress
Needs training data?	No	Yes (codebook training)	No
Online / streaming?	Yes	No (batch only)	Yes
Theory-backed MSE?	Yes (proven optimal)	No	No

Performance

Benchmarked on Apple Silicon (ARM64), .NET 10, Release mode:

Operation	dim=256	dim=768	Heap Alloc
`Quantize`	8 us	57 us	152 B
`Dequantize`	42 us	42 us	1 KB
`ApproxSimilarity`	0.2 us	0.8 us	0 B
`DotProduct` (SIMD)	20 ns	66 ns	0 B
`L2Norm` (SIMD)	13 ns	51 ns	0 B
`CosineSimilarity` (SIMD)	35 ns	126 ns	0 B

SIMD operations use Vector<T> for automatic ISA selection (NEON on ARM, SSE/AVX2/AVX-512 on x86) with dual accumulators to hide data-dependency latency.

ApproxSimilarity uses a precomputed 16x16 lookup table for 4-bit. No unpacking, no centroid lookup. Direct byte-level scan.

QuantizeBatch processes vectors sequentially with zero overhead beyond the per-vector allocation.

Compression Quality

Validated against the paper's theoretical bounds:

Bits	Ratio	D_mse	Cosine	Paper Upper Bound
4-bit	~6-8x	<= 0.0095	> 0.995	0.011
3-bit	~8-10x	<= 0.035	> 0.98	0.043
2-bit	~12-16x	<= 0.118	> 0.93	0.170

Ratios depend on whether the dimension is a power of 2 (no padding needed) or not.

Use Cases

Embedding Compression (OpenAI, Ollama, Azure AI)

var embedding = await embeddingModel.GenerateAsync(["your text"]);
var vector = embedding[0].Vector.ToArray();

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();
var packed = quantizer.Quantize(vector);

// Store packed vectors in your DB, cache, or memory
// Search directly on compressed data (zero allocation):
var score = quantizer.ApproxSimilarity(queryPacked, docPacked);

100K embeddings (768d): 292 MB -> 37 MB

RAG Pipeline: Compressed Vector Store

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();
var store = new Dictionary<string, PackedVector>();

// Index
foreach (var doc in documents)
{
    var emb = await generator.GenerateAsync([doc.Text]);
    store[doc.Id] = quantizer.Quantize(emb[0].Vector.ToArray());
}

// Search (LUT-based, zero allocation)
var queryPacked = quantizer.Quantize(queryEmbedding);
var topK = store
    .Select(kv => (kv.Key, Score: quantizer.ApproxSimilarity(queryPacked, kv.Value)))
    .OrderByDescending(x => x.Score)
    .Take(10);

LLM KV Cache: Extend Context Length

var (cache, window) = TurboQuantBuilder
    .Create(dim: 128)             // head dimension
    .WithKeyBits(4)               // keys: 4-bit (attention precision)
    .WithValueBits(2)             // values: 2-bit (more aggressive)
    .WithResidualWindow(tokens: 128)
    .BuildKVCache();

// Append during generation. Recent tokens stay in float32,
// older tokens are automatically compressed
window.Append(keyVector, valueVector);

// Retrieve for attention computation
var allKeys = window.GetAllKeys();     // recent: exact, older: decompressed
var allValues = window.GetAllValues();

32 layers x 32 heads x 4K tokens x 128d: 32 GB -> ~6 GB

Microsoft Agent Framework: Compressed RAG

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.ChatCompletion;
using TurboQuant;

// Compressed document store
var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();
var docs = new Dictionary<string, (PackedVector Vec, string Title, string Text, string Link)>();

// Index documents (once)
foreach (var doc in documents)
{
    var embedding = await embeddingModel.GenerateAsync(doc.Text);
    docs[doc.Id] = (quantizer.Quantize(embedding), doc.Title, doc.Text, doc.Link);
}

// Search adapter: quantize query, find top matches via ApproxSimilarity
Task<IEnumerable<TextSearchProvider.TextSearchResult>> SearchAdapter(
    string query, CancellationToken ct)
{
    var queryEmbedding = embeddingModel.Generate(query);
    var queryPacked = quantizer.Quantize(queryEmbedding);

    var results = docs.Values
        .Select(d => (d, Score: quantizer.ApproxSimilarity(queryPacked, d.Vec)))
        .OrderByDescending(x => x.Score)
        .Take(5)
        .Select(x => new TextSearchProvider.TextSearchResult
        {
            Text = x.d.Text,
            SourceName = x.d.Title,
            SourceLink = x.d.Link
        });

    return Task.FromResult(results);
}

// Create agent with compressed RAG
var agent = azureOpenAIClient
    .GetChatClient(deploymentName)
    .AsAIAgent(new ChatClientAgentOptions
    {
        ChatOptions = new() { Instructions = "Answer questions using the provided context." },
        AIContextProviders = [new TextSearchProvider(SearchAdapter, new()
        {
            SearchTime = TextSearchProviderOptions.TextSearchBehavior.BeforeAIInvoke
        })]
    });

100K documents (768d): 292 MB -> 37 MB in the search adapter, 66 ns per similarity comparison.

Semantic Kernel: Compressed Memory

public class CompressedMemoryStore : IMemoryStore
{
    private readonly TurboQuantMSE _quantizer;
    private readonly ConcurrentDictionary<string,
        ConcurrentDictionary<string, (PackedVector Vec, MemoryRecord Rec)>> _collections = new();

    public CompressedMemoryStore(int dim) =>
        _quantizer = TurboQuantBuilder.Create(dim).WithBits(4).BuildMSE();

    public Task<string> UpsertAsync(string collection, MemoryRecord record, CancellationToken ct)
    {
        var store = _collections.GetOrAdd(collection, _ => new());
        var packed = _quantizer.Quantize(record.Embedding.Vector.ToArray());
        store[record.Metadata.Id] = (packed, record);
        return Task.FromResult(record.Metadata.Id);
    }

    public Task<MemoryRecord?> GetNearestMatchAsync(
        string collection, ReadOnlyMemory<float> query, double minScore, CancellationToken ct)
    {
        if (!_collections.TryGetValue(collection, out var store))
            return Task.FromResult<MemoryRecord?>(null);

        var qPacked = _quantizer.Quantize(query.Span);
        return Task.FromResult(store.Values
            .Select(d => (d.Rec, Score: (double)_quantizer.ApproxSimilarity(qPacked, d.Vec)))
            .Where(x => x.Score >= minScore)
            .MaxBy(x => x.Score).Rec);
    }
}

ML.NET Pipeline

var quantizer = TurboQuantBuilder.Create(dim: 384).WithBits(4).BuildMSE();

var pipeline = mlContext.Transforms.CustomMapping<Input, Output>(
    (input, output) =>
    {
        var packed = quantizer.Quantize(input.Features);
        output.Compressed = quantizer.Dequantize(packed);
    },
    contractName: "TurboQuant");

ONNX Runtime: Post-Inference Compression

using var session = new InferenceSession("embedding_model.onnx");
var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

using var results = session.Run(inputs);
var outputTensor = results.First().AsEnumerable<float>().ToArray();

var compressed = new PackedVector[batchSize];
quantizer.QuantizeBatch(outputTensor, compressed);

Serialization

// Save
var bytes = packed.ToBytes();
File.WriteAllBytes("vector.bin", bytes);

// Load
var loaded = PackedVector.FromBytes(File.ReadAllBytes("vector.bin"));
var restored = quantizer.Dequantize(loaded);

Rotation Strategies

TurboQuant defaults to random orthogonal rotation (QR decomposition of a Gaussian matrix), which matches the paper exactly and works with any dimension without overhead.

For power-of-2 dimensions (128, 256, 512, 1024), the Hadamard rotation is an excellent alternative: identical quality, O(d log d) time, and O(d) memory instead of O(d^2).

For non-power-of-2 dimensions (384, 768, 1536), Hadamard pads internally to the next power of 2, which increases packed data size by up to 33%. A Trace.TraceWarning is emitted when this happens.

// Default: random orthogonal (paper-correct, any dimension)
var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

// Fast path for power-of-2 dimensions (recommended for KV cache with head dim 64/128)
var fast = TurboQuantBuilder.Create(dim: 1024).WithBits(4).WithHadamardRotation().BuildMSE();

	Random Orthogonal (default)	Hadamard
Time	O(d^2)	O(d log d)
Memory	O(d^2) per quantizer	O(d) per quantizer
Packed size	Exact (d coordinates)	Exact if power-of-2, padded otherwise
dim=768 quantize	~57 us	~1.3 us (but 33% more storage)
dim=1024 quantize	~100 us	~1.6 us (no padding)
D_mse	= theoretical	= theoretical (power-of-2), better (non-power-of-2)

Best practice: if your embedding model supports configurable dimensions (e.g. OpenAI, nomic), choose a power-of-2 dimension (512, 1024) to get the best of both strategies.

Note on seed correlation: the rotation matrix is generated from a random seed. If your test vectors are generated from the same seed (e.g. new Random(42) for both), the rotation and data become correlated, causing degraded quality. This does not happen in practice (embeddings come from a model, not from Random), but use distinct seeds in synthetic benchmarks.

How It Works

                    ┌──────────────┐
   float[768]  ──>  │  Normalize   │ store ||x||
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │   Rotate     │ random orthogonal (default) or Hadamard
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │  Quantize    │ per-coordinate Lloyd-Max on Beta(d/2, 1/2)
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │  Bit-Pack    │ 4-bit: 2 values/byte, 3-bit: 8 values/3 bytes
                    └──────┬───────┘
                           v
                    PackedVector (data + norm)

The rotation ensures each coordinate is approximately Beta-distributed regardless of input, enabling provably near-optimal scalar quantization. D_mse is within factor sqrt(3)pi/2 ~ 2.7x of the information-theoretic lower bound (Theorem 3 in the paper).

ApproxSimilarity computes cosine similarity directly on packed bytes via a precomputed 16x16 centroid dot-product lookup table (4-bit). No unpacking, no centroid lookup, zero heap allocation.

Project Layout

src/TurboQuant/
  Core/
    Rotation/       RandomRotation (default, paper-correct), HadamardRotation (fast, power-of-2)
    Codebook/       BetaCodebook (Lloyd-Max on exact Beta distribution, with LUT)
    Packing/        BitPacker (2/3/4-bit), PackedVector (serializable, immutable)
    Quantizers/     TurboQuantMSE, TurboQuantAsymmetric
    Simd/           DotProductSimd (Vector<T> auto-ISA: NEON/SSE/AVX2/AVX-512)
  Cache/            KVCache, ResidualWindow
  Diagnostics/      CompressionStats (thread-safe), QuantizationBenchmark
  TurboQuantBuilder.cs

Build & Test

dotnet build TurboQuantNet.slnx
dotnet test
dotnet run --project samples/TurboQuant.Demo
dotnet run -c Release --project benchmarks/TurboQuant.Benchmarks

License

MIT. Based on the paper by Zandieh, Daliri, Hadian & Mirrokni (Google Research / NYU / Google DeepMind).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
benchmarks/TurboQuant.Benchmarks		benchmarks/TurboQuant.Benchmarks
samples/TurboQuant.Demo		samples/TurboQuant.Demo
src/TurboQuant		src/TurboQuant
tests/TurboQuant.Tests		tests/TurboQuant.Tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
README.md		README.md
TurboQuantNet.slnx		TurboQuantNet.slnx
global.json		global.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant

Install

30-second Example

Why TurboQuant?

Performance

Compression Quality

Use Cases

Embedding Compression (OpenAI, Ollama, Azure AI)

RAG Pipeline: Compressed Vector Store

LLM KV Cache: Extend Context Length

Microsoft Agent Framework: Compressed RAG

Semantic Kernel: Compressed Memory

ML.NET Pipeline

ONNX Runtime: Post-Inference Compression

Serialization

Rotation Strategies

How It Works

Project Layout

Build & Test

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant

Install

30-second Example

Why TurboQuant?

Performance

Compression Quality

Use Cases

Embedding Compression (OpenAI, Ollama, Azure AI)

RAG Pipeline: Compressed Vector Store

LLM KV Cache: Extend Context Length

Microsoft Agent Framework: Compressed RAG

Semantic Kernel: Compressed Memory

ML.NET Pipeline

ONNX Runtime: Post-Inference Compression

Serialization

Rotation Strategies

How It Works

Project Layout

Build & Test

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages