[QUARK-402] Add Quark GLM4.7-MXFP4 support#223
Conversation
8d605d9 to
03fff40
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds support for Quark GLM4.7-MXFP4 quantization by implementing packed/merged module handling for layer-specific quantization exclusion. The changes enable proper handling of scenarios where users want to exclude specific component layers (e.g., gate_proj, up_proj) from quantization when they are packed into a single merged layer (e.g., gate_up_proj).
Changes:
- Added
build_packed_components_mappingutility function to create inverse mappings from packed parameter names to their component checkpoint weight names - Extended
should_ignore_layerfunction to check if any components of a packed module should be excluded from quantization - Added
prefixparameter toColumnParallelLinear,MergedColumnParallelLinear,QKVParallelLinear, andRowParallelLinearclasses to enable per-layer quantization config evaluation - Added
packed_componentsfield toQuantizationConfigto store the inverse mapping - Implemented
build_inverse_mappinginModelRunnerto populatepacked_componentsbefore model instantiation
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| atom/models/utils.py | Added build_packed_components_mapping function and extended should_ignore_layer to handle packed modules |
| atom/model_ops/linear.py | Added prefix parameter to ColumnParallelLinear, MergedColumnParallelLinear, QKVParallelLinear, and RowParallelLinear for layer-specific quantization handling |
| atom/model_engine/model_runner.py | Added build_inverse_mapping method to build packed components mapping before model initialization |
| atom/config.py | Added packed_components field to QuantizationConfig |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, @thpereir, could you post the commands you used for testing? |
|
I used to serve: To run lm-eval: |
03fff40 to
fb84a80
Compare
|
@haoyangli0109 made more changes to fix the issues with |
fb84a80 to
227ea42
Compare
227ea42 to
d176130
Compare
d176130 to
8de79f0
Compare
- TP4 weight loading crash (moe.py _load_w13/_load_w2): Derived shard sizes from loaded_weight.shape instead of padded expert_data.shape to handle MXFP4 padding (384-512). - num_sms() returning None on ROCm (triton_kernels/target_info.py): Added or is_hip() to the CUDA branch. - Custom routing for grouped topk + sigmoid (fused_moe_triton.py): Added routing_from_topk() bridge function since triton_kernels.routing.routing() only supports softmax + basic topk. Modified Mxfp4MoEMethod.apply() to use FusedMoE.select_experts for routing with the triton matmul_ogs for compute. - Uninitialized bias causing NaN (glm4_moe.py): FusedMoE defaulted has_bias=True, creating torch.empty bias tensors that were never loaded (GLM-4.7 has no expert biases). Fixed with has_bias=getattr(config, "moe_ffn_bias", False). - Fused SwiGLU activation mismatch (fused_moe_triton.py) he final fix: - triton_kernels' swiglu_fn expects interleaved [gate0, up0, gate1, up1, ...] layout, but w13 weights produce concatenated [gate|up] Uses non-standard s*sigmoid(1.702*s)*(linear+1) instead of standard silu(gate)*up Fix: Bypassed fused SwiGLU; run matmul_ogs without activation, then manually apply F.silu(gate) * up on the concatenated output
8de79f0 to
0e9e395
Compare
|
@valarLip done requested changes and rebased the branch, this is ready for review |
Motivation
Technical Details
Test Plan
Test Result
Server:
lm-eval
GSM 8k accuracy
Submission Checklist