-
Notifications
You must be signed in to change notification settings - Fork 432
[dev] split fused moe experts to ensure quantization #2464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| from llmcompressor import model_free_ptq | ||
|
|
||
| MODEL_ID = "Qwen/Qwen3.5-35B-A3B" | ||
| SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W8A8-INT8" | ||
|
|
||
| # Apply W8A8 to the model | ||
| # Once quantized, the model is saved | ||
| # using compressed-tensors to the SAVE_DIR. | ||
| model_free_ptq( | ||
| model_stub=MODEL_ID, | ||
| save_directory=SAVE_DIR, | ||
| scheme="W8A8", | ||
| ignore=[ | ||
| "lm_head", | ||
| "re:.*mlp.gate$", | ||
| "re:.*mlp.shared_expert_gate.*", | ||
| "re:.*norm.*", | ||
| "re:.*embed_tokens.*", | ||
| "re:.*visual.*", | ||
| "re:.*conv1d.*" | ||
| ], | ||
| max_workers=15, | ||
| device="cuda:0", | ||
| ) |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -77,6 +77,7 @@ def process_file( | |||||||||||||
| """ | ||||||||||||||
| assert not is_microscale_scheme(scheme), "Use `_process_file_microscale_scheme`" | ||||||||||||||
| tensors = load_file(file_path) | ||||||||||||||
| tensors = split_fused_moe_experts(tensors) | ||||||||||||||
|
|
||||||||||||||
| if converter is not None: | ||||||||||||||
| converter.process(tensors) | ||||||||||||||
|
|
@@ -194,3 +195,63 @@ def process_file_microscale_scheme( | |||||||||||||
| total_size = sum(tensor.nbytes for tensor in tensors.values()) | ||||||||||||||
| weight_map = {key: os.path.basename(save_path) for key in tensors.keys()} | ||||||||||||||
| return total_size, weight_map | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def split_fused_moe_experts(tensors: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]: | ||||||||||||||
| """ | ||||||||||||||
| Find fused MoE experts (with gate_up_proj/down_proj). | ||||||||||||||
| Split them from 3D tensors into individual 2D expert tensors. | ||||||||||||||
|
|
||||||||||||||
| Args: | ||||||||||||||
| tensors: Dictionary of loaded tensors from safetensors file | ||||||||||||||
|
|
||||||||||||||
| Returns: | ||||||||||||||
| New dictionary with split expert weights | ||||||||||||||
| """ | ||||||||||||||
| _tensors = {} | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rename as fused_tensors |
||||||||||||||
|
|
||||||||||||||
| for name, tensor in tensors.items(): | ||||||||||||||
| # Check if this is a MoE expert weight (3D tensor for experts) | ||||||||||||||
| if tensor.ndim == 3 and ("experts.gate_up_proj" in name or "experts.down_proj" in name): | ||||||||||||||
| # Get number of experts | ||||||||||||||
| num_experts = tensor.shape[0] | ||||||||||||||
|
|
||||||||||||||
| if "gate_up_proj" in name: | ||||||||||||||
| # gate_up_proj is typically [num_experts, 2*intermediate, hidden] | ||||||||||||||
| if tensor.shape[1] % 2 != 0: | ||||||||||||||
| print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}") | ||||||||||||||
| continue | ||||||||||||||
|
|
||||||||||||||
|
Comment on lines
+222
to
+224
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two issues here:
The
Suggested change
|
||||||||||||||
| hidden_size = tensor.shape[1] // 2 | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The variable
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| # Split into individual experts | ||||||||||||||
| for expert_idx in range(num_experts): | ||||||||||||||
| expert_tensor = tensor[expert_idx] # [2*hidden, intermediate] | ||||||||||||||
|
|
||||||||||||||
| # Split gate and up projections | ||||||||||||||
| gate_proj = expert_tensor[:hidden_size, :] | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be a little more readable to just do the full indexing at the same time instead of doing it piecemeal, these are all views so performance will be equivalent |
||||||||||||||
| up_proj = expert_tensor[hidden_size:, :] | ||||||||||||||
|
|
||||||||||||||
| # Create new key names | ||||||||||||||
| base_key = name.replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}") | ||||||||||||||
| _tensors[base_key + ".gate_proj.weight"] = gate_proj | ||||||||||||||
| _tensors[base_key + ".up_proj.weight"] = up_proj | ||||||||||||||
|
Comment on lines
+236
to
+238
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The logic for generating new tensor names for the split
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| print(f"Split {name} into {num_experts} experts") | ||||||||||||||
|
|
||||||||||||||
| elif "down_proj" in name: | ||||||||||||||
| # down_proj is typically [num_experts, hidden, intermediate] | ||||||||||||||
| # Split into individual experts | ||||||||||||||
| for expert_idx in range(num_experts): | ||||||||||||||
| down_proj = tensor[expert_idx] # [hidden, intermediate] | ||||||||||||||
|
|
||||||||||||||
| # Create new key name | ||||||||||||||
| new_key = name.replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight" | ||||||||||||||
| _tensors[new_key] = down_proj | ||||||||||||||
|
Comment on lines
+249
to
+250
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to the
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| print(f"Split {name} into {num_experts} experts") | ||||||||||||||
| else: | ||||||||||||||
| # Non-MoE or non-3D tensors, keep as is | ||||||||||||||
| _tensors[name] = tensor | ||||||||||||||
|
|
||||||||||||||
| return _tensors | ||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable
_tensorsis unconventional for a local variable. A more descriptive name likeprocessed_tensorsornew_tensorswould improve readability. The_prefix is usually reserved for internal/private variables by convention. This change would need to be applied throughout the function.