Skip to content

Fix QMoE CPU Operator#27360

Open
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu
Open

Fix QMoE CPU Operator#27360
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260216/fix_qmoe_cpu

Conversation

@tianleiwu
Copy link
Contributor

This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation.

Changes

1. QMoE CPU Operator Fixes

  • Corrected Bias Handling: Renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using DirectQ4Gemm.
  • SwiGLU Attribute Update: Switched from swiglu_interleaved to swiglu_fusion in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards.

2. MLAS Documentation

  • Clarified Buffer Shapes: Added explicit documentation to MlasQ4GemmPackB to specify that the input FpData buffer expects a shape of [K, N]. This helps prevent layout-related errors in future integrations.

3. Test Updates

  • PyTorch Parity Fixes: Refactored onnxruntime/test/python/transformers/test_qmoe_cpu.py to use swiglu_fusion and improved the test structure for better parity checks with PyTorch.

Verification

  • Verified by running test_qmoe_cpu.py to ensure all QMoE parity tests pass on CPU.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in the QMoE CPU operator implementation, specifically correcting bias handling logic and updating attribute naming to match the actual C++ implementation. The changes also improve MLAS documentation for better clarity on input buffer layout requirements.

Changes:

  • Fixed FC2 bias handling in QMoE CPU operator by tracking when MLAS DirectQ4Gemm adds bias
  • Added transpose logic to convert weight matrices from [N, K] to [K, N] layout required by MlasQ4GemmPackB
  • Updated Python tests to use swiglu_fusion attribute instead of incorrect swiglu_interleaved attribute
  • Enhanced MLAS documentation to clarify that MlasQ4GemmPackB expects FpData with shape [K, N]
  • Added proper bias collection and passing in Python test infrastructure

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/mlas/inc/mlas_q4.h Updated documentation for MlasQ4GemmPackB to clarify FpData shape [K, N] and parameter meanings
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc Added transpose logic for weight matrices, renamed fc2_bias_handled_by_q4_gemm to fc2_bias_added_by_mlas, removed unused fc1_used_direct_q4 flag
onnxruntime/test/python/transformers/test_qmoe_cpu.py Migrated from swiglu_interleaved to swiglu_fusion attribute, added bias collection/passing logic, updated swiglu function signature, improved weight interleaving for swiglu_fusion=1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant