[tx] Implement expert parallelism #842

pcmoritz · 2026-01-05T19:45:37Z

This PR implements expert parallelism with jax.shard_map. The tokens are already replicated on each rank from the previous layer, so we just filter the tokens that should be processed by each local expert using the group_offset feature implemented in #860.

Here are some performance results, they were generated on 8xH1100 using

uv run --extra gpu --extra tinker -m tx.tinker.api     --base-model Qwen/Qwen3-30B-A3B     --backend-config
 '{"max_lora_adapters": 2, "max_lora_rank": 1, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

(or replacing expert_parallel_size with tensor_parallel_size for the TP case).

First with the new PR and EP:

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512
WARNING: WANDB_API_KEY environment variable not set. Skipping W&B logging. 
tinker_cookbook.utils.ml_log:475 [INFO] Logging to: /tmp/tinker-examples/sl-loop
__main__:49 [INFO] Using renderer: qwen3
__main__:52 [INFO] Loading dataset...
__main__:58 [INFO] Train batches: 74
tinker.lib.public_interfaces.service_client:61 [INFO] ServiceClient initialized for session session_4906965b
tinker_cookbook.checkpoint_utils:19 [INFO] No checkpoints found at /tmp/tinker-examples/sl-loop/checkpoints.jsonl
tinker_cookbook.checkpoint_utils:48 [INFO] No checkpoints found with key state_path in /tmp/tinker-examples/sl-loop
tinker.lib.public_interfaces.service_client:126 [INFO] Creating TrainingClient for model_id='model_ecb81c50'
__main__:78 [INFO] Training for 74 steps
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 0                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000100   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 34993      │
│ progress                       │ 0.000000   │
│ time_total                     │ 271.401147 │
│ train_mean_nll                 │ 2.839665   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 1                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric                         ┃ Value     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ learning_rate                  │ 0.000099  │
│ num_sequences                  │ 128       │
│ num_tokens                     │ 32341     │
│ progress                       │ 0.013514  │
│ time_total                     │ 47.546760 │
│ train_mean_nll                 │ 2.636626  │
└────────────────────────────────┴───────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 2                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric                         ┃ Value     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ learning_rate                  │ 0.000097  │
│ num_sequences                  │ 128       │
│ num_tokens                     │ 32905     │
│ progress                       │ 0.027027  │
│ time_total                     │ 40.470125 │
│ train_mean_nll                 │ 2.335052  │
└────────────────────────────────┴───────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 3                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Metric                         ┃ Value     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ learning_rate                  │ 0.000096  │
│ num_sequences                  │ 128       │
│ num_tokens                     │ 35807     │
│ progress                       │ 0.040541  │
│ time_total                     │ 41.397586 │
│ train_mean_nll                 │ 2.240090  │
└────────────────────────────────┴───────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl

Then with the new PR and TP (much slower):

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512
WARNING: WANDB_API_KEY environment variable not set. Skipping W&B logging. 
tinker_cookbook.utils.ml_log:475 [INFO] Logging to: /tmp/tinker-examples/sl-loop
__main__:49 [INFO] Using renderer: qwen3
__main__:52 [INFO] Loading dataset...
__main__:58 [INFO] Train batches: 74
tinker.lib.public_interfaces.service_client:61 [INFO] ServiceClient initialized for session session_0c5ed6e0
tinker_cookbook.checkpoint_utils:19 [INFO] No checkpoints found at /tmp/tinker-examples/sl-loop/checkpoints.jsonl
tinker_cookbook.checkpoint_utils:48 [INFO] No checkpoints found with key state_path in /tmp/tinker-examples/sl-loop
tinker.lib.public_interfaces.service_client:126 [INFO] Creating TrainingClient for model_id='model_c1342141'
__main__:78 [INFO] Training for 74 steps
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 0                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000100   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 34993      │
│ progress                       │ 0.000000   │
│ time_total                     │ 352.432744 │
│ train_mean_nll                 │ 2.835570   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 1                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000099   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 32341      │
│ progress                       │ 0.013514   │
│ time_total                     │ 121.366442 │
│ train_mean_nll                 │ 2.630018   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 2                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000097   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 32905      │
│ progress                       │ 0.027027   │
│ time_total                     │ 113.349039 │
│ train_mean_nll                 │ 2.326924   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 3                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000096   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 35807      │
│ progress                       │ 0.040541   │
│ time_total                     │ 113.365831 │
│ train_mean_nll                 │ 2.238438   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 4                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000095   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 35795      │
│ progress                       │ 0.054054   │
│ time_total                     │ 113.418735 │
│ train_mean_nll                 │ 2.176662   │
└────────────────────────────────┴────────────┘

And last with TP on current main, to make sure there is no performance regression:

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512
WARNING: WANDB_API_KEY environment variable not set. Skipping W&B logging. 
tinker_cookbook.utils.ml_log:475 [INFO] Logging to: /tmp/tinker-examples/sl-loop
__main__:49 [INFO] Using renderer: qwen3
__main__:52 [INFO] Loading dataset...
__main__:58 [INFO] Train batches: 74
tinker.lib.public_interfaces.service_client:61 [INFO] ServiceClient initialized for session session_5b10a9a0
tinker_cookbook.checkpoint_utils:19 [INFO] No checkpoints found at /tmp/tinker-examples/sl-loop/checkpoints.jsonl
tinker_cookbook.checkpoint_utils:48 [INFO] No checkpoints found with key state_path in /tmp/tinker-examples/sl-loop
tinker.lib.public_interfaces.service_client:126 [INFO] Creating TrainingClient for model_id='model_bc6d7add'
__main__:78 [INFO] Training for 74 steps
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 0                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000100   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 34993      │
│ progress                       │ 0.000000   │
│ time_total                     │ 419.397852 │
│ train_mean_nll                 │ 2.835570   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 1                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000099   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 32341      │
│ progress                       │ 0.013514   │
│ time_total                     │ 235.424837 │
│ train_mean_nll                 │ 2.642184   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 2                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000097   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 32905      │
│ progress                       │ 0.027027   │
│ time_total                     │ 227.473726 │
│ train_mean_nll                 │ 2.337117   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 3                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000096   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 35807      │
│ progress                       │ 0.040541   │
│ time_total                     │ 227.534498 │
│ train_mean_nll                 │ 2.240172   │
└────────────────────────────────┴────────────┘
tinker_cookbook.utils.ml_log:143 [INFO] Wrote metrics to /tmp/tinker-examples/sl-loop/metrics.jsonl
tinker_cookbook.utils.ml_log:195 [INFO] 
                    Step 4                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric                         ┃ Value      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ learning_rate                  │ 0.000095   │
│ num_sequences                  │ 128        │
│ num_tokens                     │ 35795      │
│ progress                       │ 0.054054   │
│ time_total                     │ 228.449649 │
│ train_mean_nll                 │ 2.175244   │
└────────────────────────────────┴────────────┘

gemini-code-assist

Code Review

This pull request implements expert parallelism for Mixture-of-Experts (MoE) layers, which is a significant enhancement for model scaling. The changes are well-structured, introducing a clean expert_parallel_dispatch_combine utility that encapsulates the parallelism logic using jax.shard_map. The modifications to LoRAExpert and Qwen3Experts to support this are logical and correctly integrated. The necessary configuration and device mesh updates are also included. Overall, this is a solid implementation. I've provided a few suggestions to improve code readability and maintainability.

skyrl-tx/tx/layers/util.py

skyrl-tx/tx/models/qwen3.py

This reverts commit 30e4488.

pcmoritz · 2026-01-11T20:00:11Z

skyrl-tx/tx/layers/lora.py

-        lora_A_reshaped = self.lora_A.value.reshape(num_flattened_groups, self.in_features, self.max_lora_rank)
-        lora_B_reshaped = self.lora_B.value.reshape(num_flattened_groups, self.max_lora_rank, self.out_features)
+        # Reshape LoRA weights in expert-first order (already local shards)
+        lora_A = self.lora_A.value.transpose((1, 0, 2, 3)).reshape(


I also tried without the transpose and just putting experts first in the weight tensors when initializing them -- the code is more complicated and also slower, e.g. 54s step time vs 40s with sl_loop.py and

uv run --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-30B-A3B --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 1, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

Though this is somewhat surprising and there might be more optimization potential in the future, but for now it is best to keep it as simple as possible.

pcmoritz · 2026-01-11T20:15:28Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces expert parallelism for Mixture-of-Experts (MoE) models, a significant enhancement for training large-scale models. The implementation correctly uses jax.shard_map to distribute expert computations across devices, with a new shard_map_ep utility function and corresponding updates to model partitioning specs. The changes also account for LoRA weights in the expert-parallel setup. My review focuses on improving robustness and usability. I've suggested adding an assertion to prevent silent errors from incorrect expert distribution, refactoring duplicated test setup code into a pytest fixture for better maintainability, and enhancing the standalone training script to support the new parallelism feature.

gemini-code-assist · 2026-01-11T20:20:32Z

skyrl-tx/tx/models/qwen3.py

+        def forward(experts, hidden_sorted, group_sizes, unsort_indices, adapter_sorted, routing_weights):
+            # Calculate local offset for this shard
+            ep_rank = jax.lax.axis_index("ep")
+            experts_per_rank = num_experts // jax.lax.axis_size("ep")


This integer division assumes that num_experts is evenly divisible by the number of devices in the 'ep' mesh axis. If it's not, this could lead to an incorrect number of experts being assigned per rank, causing silent errors or incorrect model behavior. It would be much safer to add an assertion to validate this assumption, ideally during model initialization.

For example, you could add this check in Qwen3Experts.__init__:

ep_size = get_abstract_mesh().shape.get("ep", 1) assert self.config.num_experts % ep_size == 0, f"Number of experts ({self.config.num_experts}) must be divisible by expert parallel size ({ep_size})."

gemini-code-assist · 2026-01-11T20:20:32Z

skyrl-tx/tests/models/test_qwen3.py

+    if not jax._src.xla_bridge.backends_are_initialized():
+        jax.config.update("jax_num_cpu_devices", ep * tp)


This device initialization logic is duplicated in test_qwen3_moe_layer_lora (lines 117-118). To improve maintainability and reduce redundancy, consider extracting this setup into a shared pytest fixture. The fixture could automatically run for tests that require ep and tp parameters, making the test suite cleaner and easier to manage.

skyrl-tx/tx/run/train.py

pcmoritz added 2 commits January 5, 2026 10:17

update

27a81ca

update

56cf3bd

pcmoritz added the tx label Jan 5, 2026

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

skyrl-tx/tx/layers/util.py Outdated Show resolved Hide resolved

skyrl-tx/tx/layers/util.py Outdated Show resolved Hide resolved

skyrl-tx/tx/models/qwen3.py Outdated Show resolved Hide resolved

pcmoritz added 5 commits January 5, 2026 12:57

update

30e4488

Revert "update"

14df7df

This reverts commit 30e4488.

simplify

678b293

update

a498953

use group_offset

807d50c

pcmoritz mentioned this pull request Jan 6, 2026

Expert parallelism and "Unimplemented group_offset support" for jax.lax.ragged_dot jax-ml/jax#34168

Open

pcmoritz added 20 commits January 9, 2026 12:09

update

7da5297

update

2d1fe78

simplify

5ef67a7

update

237e15b

update

b36e164

simplify code

ba2dea8

Merge branch 'main' into tx-ep-4

3a76ad7

add test

bba2c5b

fix tp

fe21f93

Merge branch 'main' into tx-ep-4

5bebcb8

update

2470ad6

simplify

6670c0b

update

ecd2de8

simplify

5b7d6b6

black

a4348d8

update

7b62cbd

update

e0dcd43

update

a6c41d1

simplify

97393c6

update

fc602c1

fix test

786e449

pcmoritz commented Jan 11, 2026

View reviewed changes

pcmoritz added 2 commits January 11, 2026 12:03

fix test

ed86518

simplify

789b441

gemini-code-assist bot reviewed Jan 11, 2026

View reviewed changes

pcmoritz added 2 commits January 11, 2026 13:00

add assertion

f64fa89

update test

8995379

pcmoritz mentioned this pull request Jan 11, 2026

[tx] Implement efficient kernel for ragged_dot that supports expert parallelism #862

Open

pcmoritz merged commit a6ba2fc into NovaSky-AI:main Jan 12, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] Implement expert parallelism #842

[tx] Implement expert parallelism #842

Uh oh!

pcmoritz commented Jan 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz Jan 11, 2026

Uh oh!

pcmoritz commented Jan 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 11, 2026

Uh oh!

gemini-code-assist bot Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if not jax._src.xla_bridge.backends_are_initialized():
		jax.config.update("jax_num_cpu_devices", ep * tp)

[tx] Implement expert parallelism #842

[tx] Implement expert parallelism #842

Uh oh!

Conversation

pcmoritz commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcmoritz commented Jan 5, 2026 •

edited

Loading