Skip to content

feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573

Draft
shikaku2 wants to merge 2 commits into
leejet:masterfrom
shikaku2:feat/rmse-aio-lazyload
Draft

feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573
shikaku2 wants to merge 2 commits into
leejet:masterfrom
shikaku2:feat/rmse-aio-lazyload

Conversation

@shikaku2
Copy link
Copy Markdown

@shikaku2 shikaku2 commented May 28, 2026

Summary

Three related improvements to the --convert workflow and inference startup, focused on reducing disk footprint, RAM usage during conversion, and VRAM requirements at inference time.


1. RMSE-guided mixed-precision quantization (--rmse <threshold>)

Adds a --rmse <pct> flag to convert mode that automatically selects per-tensor quantization types by running a two-pass sweep.

How it works:

  • Pass 1: for each tensor, loads it as f32, tests candidate quant types (f16, Q8_0, Q6_K, Q5_1, Q5_0, Q4_K, Q4_0, IQ4_NL, Q3_K, Q2_K) from highest to lowest quality, and records the lowest-quality type that keeps RMSE within the threshold.
  • Pass 2: quantizes and writes each tensor at its assigned type.

Peak RAM during conversion = f32 size of the single largest tensor (not the full model). The two-pass design is streaming — no full model is held in memory at once.

Results on SD3.5 Large:

The original model files are all F16/BF16 — no F32 in distribution:

File Format Size
sd3.5_large.safetensors F16 + BF16 16.5 GB
t5xxl_fp16.safetensors F16 9.8 GB
clip_g.safetensors F16 1.4 GB
clip_l.safetensors F16 0.25 GB
Total (4 files) ~27.9 GB

RMSE quantization results (all bundled into a single AIO GGUF):

Target Size Reduction
F16 baseline (4 files) ~27.9 GB
1% RMSE ~14 GB −50%
3% RMSE ~13 GB −53%
6% RMSE ~12 GB −57%

At 1% RMSE, most tensors land on Q4_K or Q5_K. RMSE is a tensor-level metric, not a perceptual one — see visual comparison below.


2. All-in-one GGUF bundling (--convert with multiple component flags)

--convert now accepts separate component files (--clip_l, --clip_g, --t5xxl, --diffusion-model, --llm, --vae) and writes them all into a single output GGUF, including metadata that allows the loader to identify each component.

Before this, distributing a quantized model required shipping 4–6 separate files and passing each as a CLI flag. After, a single .gguf is self-contained and loadable with just -m.

# Before
./sd -m sd3.5_large.safetensors --clip_l clip_l.safetensors --clip_g clip_g.safetensors \
    --t5xxl t5xxl_fp16.safetensors ...

# After (convert once)
./sd --convert --diffusion-model sd3.5_large.safetensors \
    --clip_l clip_l.safetensors --clip_g clip_g.safetensors \
    --t5xxl t5xxl_fp16.safetensors -o sd3.5_large_aio.gguf

# Then run with a single file
./sd -m sd3.5_large_aio.gguf ...

This is convenience packaging — no quality or performance change.


3. Lazy-load / staged VRAM eviction (-ll / --lazy-load)

Adds a -ll/--lazy-load flag that enables mmap-backed model loading and staged RAM eviction across the inference pipeline.

Problem: Systems with limited VRAM cannot run the full pipeline when all components (text encoders + diffusion model + VAE) are loaded simultaneously.

How it works:

  • Model is loaded via mmap. Pages are only read from disk when accessed.
  • After each pipeline stage completes, madvise(MADV_DONTNEED) is called on that component's tensors, releasing physical pages without invalidating pointers.
  • Eviction is sequential: text encoders → diffusion model → VAE. Since these stages don't overlap, peak VRAM/RAM is the max of any single component rather than the sum.
  • Auto-enables VAE tiling when -ll is active, to avoid a single large allocation. (SD3.5 VAE decode at 1024×1024 would require a ~4.6 GiB VkBuffer which exceeds the Vulkan maxMemoryAllocationSize = 4 GiB hard limit on many GPUs.)

Also applies to --convert: lazy-load + threading reduces peak RAM during quantization significantly — useful for generating quants on machines without large RAM.

This feature is architecture-agnostic (UNet, DiT, Flux, WAN, etc.) and works with both AIO GGUFs and separately-loaded component files.


Visual comparison

SD3.5 Large, Seed 42, 1024×1024, 20 steps. Baseline = F16 safetensors with -ll. All RMSE variants are AIO GGUF with -ll. Hardware: RX 9060 XT 16 GB, Vulkan.

"a cute cat"

F16 baseline 1% RMSE 3% RMSE 6% RMSE

"a vintage photograph of an old phonograph sitting on a table"

F16 baseline 1% RMSE 3% RMSE 6% RMSE

"a serene japanese garden with cherry blossoms at sunset"

F16 baseline 1% RMSE 3% RMSE 6% RMSE

Testing

  • SD3.5 Large at 1024×1024, 20 steps, Vulkan backend (RX 9060 XT 16GB)
  • 5 diverse prompts tested with -ll enabled; all succeeded
  • VAE tiling: ~3.5s (49 tiles) vs ~65s CPU fallback without tiling
  • RMSE conversion tested at 1%, 3%, 6% thresholds on SD3.5 Large

Known limitations / future work

  • RMSE threshold tuning is model-dependent; 1% is a reasonable starting point but systematic perceptual quality evaluation would help establish better defaults
  • -ll eviction is currently Linux-only (madvise path); Windows/macOS gracefully skip eviction but still benefit from mmap loading
  • AIO bundling does not yet validate that bundled components are compatible with each other

shikaku2 and others added 2 commits May 27, 2026 20:17
…flag

- --rmse <pct>: streaming two-pass mixed-precision quantization; peak RAM
  = f32 size of single largest tensor, not the full model
- --convert with multiple component flags (--clip_l, --clip_g, --t5xxl,
  --diffusion-model, --llm, --vae) bundles into a single AIO GGUF
- -ll/--lazy-load: mmap-backed loading with staged madvise(MADV_DONTNEED)
  eviction after each pipeline stage; auto-enables VAE tiling to avoid
  4+ GiB single allocations that exceed Vulkan maxMemoryAllocationSize

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seed 42, 1024x1024, 20 steps. Baseline = F16 safetensors with -ll.
Variants: 1%/3%/6% RMSE AIO GGUF. Prompts: cat, phonograph, garden.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant