Skip to content

Conversation

@isharif168
Copy link
Contributor

@isharif168 isharif168 commented Nov 24, 2025

  1. Add GptOssModel.load_per_expert_unfused_w4a8 helper to handle per-expert unfused MoE weights (gate_proj, up_proj, down_proj) in W4A8 checkpoints and map them into the fused FusedMoE layout (w13_* and w2_* parameters).
    • Handles .weight, .weight_scale, .bias, and .input_scale suffixes.
    • For biases, manually slices and writes into the appropriate columns of w13_bias (gate vs up) and w2_bias, supporting both 1D and 2D parameter layouts and using expert_id to pick the correct expert slice when the source tensor has an extra expert dimension.
    • For weights/scales, delegates to a custom weight_loader when present, falling back to default_weight_loader otherwise, and surfaces whether the mapping was successfully handled.

  2. Extend _load_weights_other to:
    • Detect W4A8 (int4 weights, int8 activations) from self.config.quantization_config.config_groups["group_0"] and gate the new MoE path on is_w4a8.
    • Precompute expert_params_mapping via FusedMoE.make_expert_params_mapping(...) for the MoE gate/up/down projections.
    • For W4A8 models, first attempt load_per_expert_unfused_w4a8 for each weight tensor; if handled, mark the target param as loaded and skip the rest of the logic for that tensor. This allows loading checkpoints that store MoE weights per expert in an unfused form while still using FusedMoE fused storage internally.

  3. Harden the generic stacking/renaming path in _load_weights_other:
    • After substituting weight_name with param_name, skip if the resulting name is not in params_dict to avoid key errors on non-matching shards.
    • Ensure loaded_params.add(name) is only called when a parameter is actually found and loaded (both in the stacked-path and the final “other weights” path).

  4. Update GptOssForCausalLM metadata and mapping to match the new weight layout:
    • Expand packed_modules_mapping to include:
    • "qkv": ["q_proj", "k_proj", "v_proj"] for packed attention projections.
    • "gate_up_proj": ["gate_proj", "up_proj"] for packed MoE gate+up projections used by compressed-tensors.
    • Extend hf_to_vllm_mapper mappings:
    • Map .self_attn. → .attn. (existing behavior).
    • Map .qkv. → .qkv_proj. to align HF checkpoints that use a qkv naming convention with vLLM’s QKVParallelLinear.
    • Map .mlp.experts.experts. → .mlp.experts. to flatten HF’s expert naming into the fused MoE layout.
    • Add suffix mappings for MoE/compressed-tensors artifacts:
    • Gate+up and down projection blocks/scales (MXFP4 and other formats) to w13_* and w2_*.
    • Bias variants (.gate_up_proj_bias, .down_proj_bias) to w13_bias and w2_bias.

Purpose

To use dynamic_int4_moe for gpt_oss model

Test Plan

Tested with GPTOSS model quantized with llmcompresser

Test Result

GPTOSS model passed


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading int4 weights for the GPT-OSS model, specifically for the compressed-tensors format. The changes include new methods to handle fused and unfused expert tensors, as well as specific logic for bias loading. While the implementation is comprehensive, I've identified a logical redundancy in the weight loading dispatch that could lead to inefficiencies and potential bugs. My review comment details this issue and suggests a fix to clarify the logic and improve correctness.

Comment on lines 560 to 562
if "mlp.experts." in name and (
".gate_proj" in name or ".up_proj" in name or ".down_proj" in name
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for loading expert weights can redundantly attempt to load bias tensors. An expert bias is first processed by load_int4_bias_gptoss. If that method returns False, the execution falls through to this block, where load_unfused_expert_weight is called, which also contains logic to handle biases. This is inefficient and can lead to incorrect behavior if the generic bias loading logic in load_unfused_expert_weight is not appropriate for this model's int4 biases.

To avoid this redundancy and potential issue, the condition should be modified to explicitly exclude bias tensors from being handled by load_unfused_expert_weight.

Suggested change
if "mlp.experts." in name and (
".gate_proj" in name or ".up_proj" in name or ".down_proj" in name
):
if "mlp.experts." in name and not name.endswith(".bias") and (
".gate_proj" in name or ".up_proj" in name or ".down_proj" in name
):

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 514 to 518
use_ep = self.parallel_config.enable_expert_parallel
tp_size, tp_rank = FusedMoEParallelConfig.flatten_tp_across_dp(
tp_size=get_tensor_model_parallel_world_size(),
dp_size=get_dp_group().world_size,
dp_rank=get_dp_group().rank_in_group,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use existing TP flatten helper in int4 weight loader

When compressed-tensors W4A8 weights are loaded, load_weights_int4 calls FusedMoEParallelConfig.flatten_tp_across_dp, but that helper does not exist (only flatten_tp_across_dp_and_pcp is defined). Hitting the W4A8 path will therefore raise an AttributeError before any weights are processed, making int4 compressed-tensors checkpoints unloadable.

Useful? React with 👍 / 👎.

@isharif168 isharif168 force-pushed the handle_int4_weight_loading branch 3 times, most recently from 7bae249 to 5844ef1 Compare November 24, 2025 16:11
- int4 gate, down and up weights are separated
- Combite the gate and up to single w13 tensors
- Also load the bias tensors
- Create a separate utility for loading int4 weights

Signed-off-by: Sharif Inamdar <[email protected]>
@isharif168 isharif168 force-pushed the handle_int4_weight_loading branch from 5844ef1 to f259d75 Compare November 24, 2025 16:45
@heheda12345
Copy link
Collaborator

CC @yewentao256

group0 = (qc or {}).get("config_groups", {}).get("group_0", {})
w = group0.get("weights") or {}
ia = group0.get("input_activations") or {}
is_w4a8 = (w.get("num_bits") == 4) and (ia.get("num_bits") == 8)
Copy link
Contributor

@nikhil-arm nikhil-arm Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an api to check this here:

Will it be a good idea to re-use this api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion , but no we cant use that since it’s private and tied to CompressedTensorsConfig
Here we are just reading the flags from the config file and making the decision, to keep it simpler

tp_rank_start = tp_rank * per_rank_intermediate_size
tp_rank_end = min((tp_rank + 1) * per_rank_intermediate_size, intermediate_size)

# W4A8 detection (int4 weights, int8 activations)
Copy link
Contributor

@fadara01 fadara01 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking, can we apply the logic that this PR applies in process_weights_after_loading instead of needing to change the modeling file?, if not, then why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking, can we apply the logic that this PR applies in process_weights_after_loading instead of needing to change the modeling file?, if not, then why not?

No we cannot do in process_weights_after_loading because we need to first load the weights into w13_weights tensors and then it can be processed later, here we are getting all gate, up and down tensors as separate and then we are fusing them and loading the correct weights.

@fadara01
Copy link
Contributor

fadara01 commented Nov 26, 2025

Thanks for your PR @isharif168
The description currently assumes that the reviewer understands why the points you mention above are needed, could you please elaborate more on each of the points?
This also seems to depend on other PRs in llm_compressor, could you link this here? and generally explain how they're related and why these changes are needed.

loaded_params.add(name)
return loaded_params

def load_per_expert_unfused_w4a8(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice for us not to specialize these functions for specific quantization schemes.
why can't the w4a8 be an argument for this function, instead of baking it into the name/impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice for us not to specialize these functions for specific quantization schemes. why can't the w4a8 be an argument for this function, instead of baking it into the name/impl?

I think its good to have different weight specific function and then we can call those specific function from load_weights_other, since all other combinations can have different weight loading scheme, we can do that if we have any in future

@isharif168
Copy link
Contributor Author

Thanks for your PR @isharif168 The description currently assumes that the reviewer understands why the points you mention above are needed, could you please elaborate more on each of the points? This also seems to depend on other PRs in llm_compressor, could you link this here? and generally explain how they're related and why these changes are needed.

yes will make it more descriptive

@isharif168 isharif168 changed the title Add support to load int4 weights for CPU Support compressed-tensors W4A8 MoE checkpoints in GptOssModel weight loader for CPU Nov 26, 2025
@isharif168
Copy link
Contributor Author

Hi @fadara01 @nikhil-arm
Can you please help to review once again ?

Thanks.

@fadara01
Copy link
Contributor

fadara01 commented Dec 1, 2025

@jeejeelee / @mgoin could you please take a look at this?

@isharif168
Copy link
Contributor Author

isharif168 commented Dec 4, 2025

Hi @jeejeelee @mgoin

Can you please help to review the change once?

Here is the summary of the change and its dependencies
-> For int4 support first we need to convert the experts to linear layers so that the quantization becomes easier. This is done with the llm-compressor change vllm-project/llm-compressor#2091 and then quantize the model to int4 weights

-> This change takes the separate (gate/up/down) weights and then combines the gate and up weights to create w13_weights since the vllm expects in that format, and then loads the weights and bias (more details in the description)

Here is the sample output with int4

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI:

  1. Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress.
  2. Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science.
  3. Supportive design

We should provide a short answer.assistantfinalHere are three reasons to use AI:

  1. Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs.
  2. Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills.
  3. Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life.
    finish_reason: stop
    num_tokens: 226

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

4 participants