Qwen3-Next Gated FullAttention Implementation #2529

Rohan-Bierneni · 2025-10-21T17:29:19Z

Description

This pr adds the Qwen3-Next Gated Full Attention implementation to the existing Qwen3-Next code in qwen3.py.

Current implementation in maxtext uses normal Attention, but qwen3 next requires attention with 2 slight tweaks: an output gate and partial ROPE applied to 25% of head_dim. This pr adds this functionality by building a custom attention class for Qwen3-Next on top of AttentionOp for the core attention calculation.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/448407748

Tests

I have added a testcase that compares the pytorch ref to the jax ref and compares the output tensors after the gated fullattention layer.

This tests is passing after running this command: pytest -vvs tests/check_qwen3_next_vs_reference.py::TestQwen3Next::test_full_attention_jax_vs_pytorch: https://paste.googleplex.com/5626786360721408

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

src/MaxText/layers/qwen3.py

tests/check_qwen3_next_vs_reference.py

RissyRan

Thanks for the great work! LGTM at high level.

Let's try to keep Qwen3 decoder layer simple and calling needed functions from embeddings, normalizations, and attentions, etc.

src/MaxText/layers/embeddings.py

src/MaxText/configs/base.yml

src/MaxText/layers/embeddings.py

src/MaxText/layers/normalizations.py

src/MaxText/layers/qwen3.py

tests/check_qwen3_next_vs_reference.py

RissyRan

A few minor comments.

src/MaxText/layers/attentions.py

src/MaxText/pyconfig.py

src/MaxText/layers/qwen3.py

src/MaxText/configs/base.yml

RissyRan · 2025-10-31T06:24:53Z

Oh, I think you will need to squash those 26 commits into 1

github-actions · 2025-10-31T06:25:30Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This Pull Request introduces the Qwen3-Next Gated FullAttention implementation, including custom RMSNorm and partial Rotary Embedding. The changes are well-structured, with new components moved to appropriate layers and a comprehensive test added for verification.

🔍 General Feedback

The refactoring of Qwen3NextRMSNorm and Qwen3NextRMSNormGated to normalizations.py is a good improvement for modularity.
The addition of test_full_attention_jax_vs_pytorch_attention is crucial for ensuring correctness and alignment with the PyTorch reference.
Some minor improvements in code clarity and consistency in variable naming and docstrings have been suggested.

src/MaxText/layers/attentions.py

src/MaxText/layers/embeddings.py

src/MaxText/layers/normalizations.py

src/MaxText/layers/qwen3.py

tests/check_qwen3_next_vs_reference.py

shuningjin

Thank you! LGTM.

src/MaxText/layers/attention_op.py

src/MaxText/layers/attentions.py

src/MaxText/layers/moe.py

src/MaxText/layers/attentions.py

github-actions · 2025-11-03T20:10:37Z

🤖 Hi @Rohan-Bierneni, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This Pull Request introduces the Qwen3-Next Gated FullAttention implementation, including custom RMSNorm and partial Rotary Embedding. The changes are well-structured, integrating new components into existing layers and configurations. The addition of a comprehensive test case comparing JAX and PyTorch implementations is a significant improvement for verifying correctness.

🔍 General Feedback

The refactoring of normalization classes into normalizations.py enhances modularity and reusability.
The detailed docstrings for new classes and functions are highly beneficial for understanding the Qwen3-Next specific implementations.
The validation logic in pyconfig.py for partial_rotary_factor ensures proper configuration usage.

src/MaxText/checkpointing.py

src/MaxText/configs/base.yml

src/MaxText/configs/models/qwen3-next-80b-a3b.yml

src/MaxText/layers/attentions.py

tests/check_qwen3_next_vs_reference.py

parambole

Awesome work ! LGTM

Ported some of the pytorch ref functions Added all test code and verified testcase passes Removed caching logic and debug statements Fixed testcase and jax gating logic Resolved scaling factor adjustment Remove debug statements move partial rope logic to embeddings.py Moved partial rope logic to embeddings.py remove old partial rope code Resolved comments from pr review Removed qwen3rmsnorm function from qwen3.py Removed initialization for using Attention() Qwen3NextFullAttention working with Attention() instead of attention_op() resolved some comments from pr related to Qwen3NextRMSNorm Cleaned up code and now works with Attention() integration Add pyconfig check for rotary_dim Change Qwen3NextRMSNorm to match base RMSNorm impl Fixed bug with running maxtext train command with qwen3 next Updated pytorch partial ROPE impl for unit test Fix indentation Fixed failing qwen3nextrmsnorm tests Update Qwen3NextRMSNormGated to also use scale for checkpointing Remove debug statements now all tests pass for rebase Resolved gemini-code-review bot comments Fixed nit comments based on review Undo commented out code for jax 0.7.0 compatability Run linter Fixed pyink error in embeddings.py Use nnx.data to wrap rmsnorm in qwen3nextrmsnorm Add qwen3 next flash attention test Remove skip_jax_distributed_system flag Add sharding for 4 devices Update ici fsdp param Update tpu sharding params revert test code increase batch size Try with dot_product try with relaxed atol rtol Update with dot product & flash attention tests add condition rtol & atol Create new jax pyconfig based on attention_type convert to helper function so pytest doesn't pick it up

RissyRan

Thank you!

Rohan-Bierneni · 2025-11-04T16:31:02Z

Manually adding pull ready tag since all tests pass, have 3 approvals, and all comments are resolved. Seems to be bug with a skipped check causing the tag to not be added.

Rohan-Bierneni requested review from A9isha, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners October 21, 2025 17:29

Rohan-Bierneni changed the title ~~Rbierneni qwen3 next fullattention~~ Qwen3 next fullattention Oct 21, 2025

Rohan-Bierneni changed the title ~~Qwen3 next fullattention~~ Qwen3-Next Gated FullAttention Implementation Oct 21, 2025