fix: preserve fused 3D expert tensors for Qwen3.5 MoE in torch_dist→H…#1904
Open
rouchenzi wants to merge 1 commit into
Open
fix: preserve fused 3D expert tensors for Qwen3.5 MoE in torch_dist→H…#1904rouchenzi wants to merge 1 commit into
rouchenzi wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The current
convert_torch_dist_to_hf.pyhas two issues when converting Qwen3.5 MoE checkpoints:[num_experts, hidden, intermediate]), but the conversion splits them into per-expert 2D tensors, producing a checkpoint inconsistent with the original HF format: https://huggingface.co/Qwen/Qwen3.5-35B-A3B--add-missing-from-origin-hf, the script produces both the split per-expert 2D tensors AND the fused 3D tensors copied from origin, resulting in duplicate expert weights in two different formats.This fix preserves the original 3D tensor layout for Qwen3.5 MoE without affecting other models. As a side benefit, loading the correct fused format is ~21x faster
Changes
_FUSED_EXPERT_MODELSregistry (currently["qwen3_5moe"])_use_fused_experts(model_name, key_name)helper — returns True only for Qwen3.5 MoE non-MTP layersmodel_namethroughget_expert_param→get_layer_param→get_named_params→save_tensorsTest
1. Fix validation
Round-trip conversion test on Qwen3.5-35B-A3B (HF → torch_dist → HF), comparing output keys against the original HF checkpoint (1811 keys):
Without fix:
--add-missing-from-origin-hf: 32531 keys (both fused 3D from origin AND split 2D from conversion)With fix:
2. No impact on other models
The fix only triggers when
model_namecontains"qwen3_5moe". Checked all models inmegatron_to_hfconverter , none of their config names (qwen3moeconfig,deepseekv3config,chatglmconfig,qwen3nextconfig,llamaconfig, etc.) contain this substring.3. Pre-commit checks pass