Add Qwen2.5-32B-BitNet support with per-linear RMSNorm#522
Open
vladimirmoushkov wants to merge 1 commit intomicrosoft:mainfrom
Open
Add Qwen2.5-32B-BitNet support with per-linear RMSNorm#522vladimirmoushkov wants to merge 1 commit intomicrosoft:mainfrom
vladimirmoushkov wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Adds per-linear RMSNorm support to the Qwen2 forward pass in llama.cpp,
enabling inference of BitNet QAT-trained Qwen2.5-32B models.
The per-linear RMSNorm applies a separate RMSNorm before each quantized
linear projection (Q, K, V, O, gate, up, down), adding 448 small norm
weight tensors (7 per layer x 64 layers). These are loaded as optional
tensors from the GGUF with names like:
blk.{layer}.attn_q.rms_norm.weight
blk.{layer}.ffn_gate.rms_norm.weight
etc.
When the norm weights are absent (standard BitNet models), the forward
pass falls through to the existing code path — no regression.
The qwen2-per-linear-rmsnorm.patch contains the full diff against the
llama.cpp submodule (3rdparty/llama.cpp/src/llama.cpp).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@vladimirmoushkov please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for running QAT-trained Qwen2.5-32B-BitNet models that use
per-linear RMSNorm — a separate RMSNorm layer applied before each quantized
linear projection to stabilize ternary activations.
Changes
Patch file for
3rdparty/llama.cpp/src/llama.cpp:attn_q.rms_norm,attn_k.rms_norm,attn_v.rms_norm,attn_output.rms_norm,ffn_gate.rms_norm,ffn_down.rms_norm,ffn_up.rms_normprojections and gate/up/down FFN projections when norm weights are present
TENSOR_NOT_REQUIRED— whenabsent (standard models), the existing code path runs unchanged
Context
Qwen2.5-32B can be converted to ternary {-1, 0, +1} weights via
Quantization-Aware Training (QAT) with Straight-Through Estimator. The
per-linear RMSNorm stabilizes activations entering each ternary matmul,
which was key to reaching loss 1.93 at 32B scale. Without it, ternary
quantization at this scale produces divergent activations.
Results
Testing