skyzh
diff --git a/‎batch-main.py‎
Lines changed: 14 additions & 2 deletions b/‎batch-main.py‎
Lines changed: 14 additions & 2 deletions
diff --git a/‎book/src/SUMMARY.md‎
Lines changed: 2 additions & 0 deletions b/‎book/src/SUMMARY.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎book/src/week2-02-quantized-matmul.md‎
Lines changed: 3 additions & 1 deletion b/‎book/src/week2-02-quantized-matmul.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎book/src/week3-03-moe.md‎
Lines changed: 294 additions & 0 deletions b/‎book/src/week3-03-moe.md‎
Lines changed: 294 additions & 0 deletions
@@ -35,9 +35,11 @@
 random.shuffle(prompts)
 
 parser.add_argument("--solution", type=str, default="tiny_llm")
+parser.add_argument("--loader", type=str, choices=["week2", "week3"], default="week2")
 parser.add_argument("--device", type=str, default="gpu")
 parser.add_argument("--batch-size", type=int, default=5)
 parser.add_argument("--prefill-step", type=int, default=128)
+parser.add_argument("--max-seq-len", type=int, default=512)
 parser.add_argument("--enable-flash-attn", action="store_true")
 parser.add_argument("--enable-thinking", action="store_true")
 args = parser.parse_args()
@@ -57,11 +59,20 @@
 mlx_model, tokenizer = load(args.model)
 
 with mx.stream(mx.gpu if args.device == "gpu" else mx.cpu):
+    dispatch_kwargs = {}
+    if args.loader == "week2":
+        dispatch_kwargs["enable_flash_attn"] = args.enable_flash_attn
+    elif args.enable_flash_attn:
+        print("--enable-flash-attn is only used by the week2 loader; ignoring it")
+
     print(
-        f"Using week2 loader with flash_attn={args.enable_flash_attn} thinking={args.enable_thinking} for {args.model}"
+        f"Using {args.loader} loader with thinking={args.enable_thinking} for {args.model}"
     )
     tiny_llm_model = models.dispatch_model(
-        args.model, mlx_model, week=2, enable_flash_attn=args.enable_flash_attn
+        args.model,
+        mlx_model,
+        week=int(args.loader.removeprefix("week")),
+        **dispatch_kwargs,
     )
     encoded_prompts = []
     for idx, prompt in enumerate(prompts):
@@ -81,6 +92,7 @@
         tiny_llm_model,
         tokenizer,
         encoded_prompts,
+        max_seq_len=args.max_seq_len,
         batch_size=args.batch_size,
         prefill_step=args.prefill_step,
     )
 
@@ -21,6 +21,8 @@
 - [Week 3: Serving]()
     - [Paged Attention, Part 1](./week3-01-paged-attention-part1.md)
     - [Paged Attention, Part 2](./week3-02-paged-attention-part2.md)
+    - [Mixture of Experts](./week3-03-moe.md)
+    - [Extended: Profiling](./week3-04-profiling.md)
 
 ---
 
 
@@ -321,7 +321,9 @@ src/tiny_llm/qwen3_week2.py
 
 Integrate your quantized matmul into the Week 2 Qwen3 model so that inference runs on quantized weights end-to-end.
 
-Change the weight type from `mx.array` to `QuantizedWeights` for all linear layers in attention (`wq/wk/wv/wo`) and MLP (`w_gate/w_up/w_down`). Replace every `linear(x, w)` call with `quantized_linear(x, w)`. In the model loading code, use `QuantizedWeights.from_mlx_layer(...)` to extract quantized weight information from each MLX linear layer, instead of calling `mx.dequantize` to get a full 16-bit matrix. Make sure the Week 1 loader still dequantizes (since Week 1 layers expect plain `mx.array`), while the Week 2 loader does **not** dequantize.
+Change the weight type from `mx.array` to `QuantizedWeights` for all linear layers in attention (`wq/wk/wv/wo`) and MLP (`w_gate/w_up/w_down`). Replace every `linear(x, w)` call with `quantized_linear(x, w)`. In the model loading code, use `QuantizedWeights.from_mlx_layer(...)` to extract quantized weight information from each MLX linear layer, instead of calling `mx.dequantize` to get a full 16-bit matrix. Make sure the Week 1 loader still dequantizes these projection weights (since Week 1 layers expect plain `mx.array`), while the Week 2 loader keeps them quantized.
+
+The input embedding is the main exception. `embed_tokens(input_ids)` is a row lookup, not a matrix multiplication, so it is not the operator implemented by `quantized_matmul`. For Week 2, first keep the input embedding on the existing `Embedding` path and focus quantized matmul on projection layers. If the model has a separate `lm_head`, that head is a normal linear projection and should use `quantized_linear`. If output weights are tied, `embedding.as_linear(h)` is the projection side of the embedding table; a later optimization can keep that table quantized, use `mx.quantized_matmul` for `as_linear`, and dequantize only the selected rows during lookup.
 
 Qwen3 MLX quantized layers may use **float16** or **bfloat16** for the tensors involved in dequantization. Your kernel should accept `scales`, `biases`, and activations in either dtype, require them to match, and return the same dtype. If you see `nan` or garbage output, a dtype mismatch is the most likely cause.
 
 
@@ -0,0 +1,294 @@
+# Week 3 Day 3: Mixture of Experts
+
+In this chapter, we will implement the feed-forward shape of **Mixture of
+Experts**, or **MoE**, for the Qwen3 family.
+
+So far, every transformer block in tiny-llm has used the same dense Qwen3 MLP:
+
+```plain
+x -> gate_proj
+x -> up_proj
+SiLU(gate_proj(x)) * up_proj(x) -> down_proj
+```
+
+That is a SwiGLU MLP. Every token visits the same weights.
+
+MoE changes only the feed-forward half of the transformer block. Instead of one
+dense MLP, the model owns many expert MLPs. A small router chooses which experts
+each token should use:
+
+```plain
+token hidden state -> router -> top-k experts -> weighted expert outputs
+```
+
+The attention path does not change. KV cache does not change. The sparse work is
+inside the MLP half of the block.
+
+**Readings**
+
+- [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538)
+- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
+- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
+
+## Dense MLP vs MoE MLP
+
+The dense Qwen3 MLP from Week 1 has one set of weights:
+
+```plain
+w_gate: hidden_dim, dim
+w_up:   hidden_dim, dim
+w_down: dim, hidden_dim
+```
+
+A Qwen3-MoE sparse block has a bank of those weights:
+
+```plain
+expert_gate: num_experts, moe_hidden_dim, dim
+expert_up:   num_experts, moe_hidden_dim, dim
+expert_down: num_experts, dim, moe_hidden_dim
+```
+
+The router produces one score per expert:
+
+```plain
+router_logits: B, L, num_experts
+router_probs:  softmax(router_logits)
+```
+
+Then the model picks `num_experts_per_tok` experts for each token:
+
+```plain
+expert_ids:    B, L, num_experts_per_tok
+expert_scores: B, L, num_experts_per_tok
+```
+
+For each token, only those selected experts run. Their outputs are weighted and
+summed:
+
+```plain
+output[token] = sum(score_i * expert_i(token))
+```
+
+That is the central MoE idea: the model can contain many parameters, but each
+token activates only a small subset of them.
+
+## Qwen3-MoE Shape
+
+Qwen3-MoE keeps the same attention structure as Qwen3, including QK norm, GQA,
+RoPE, and the same KV cache interface. It replaces some dense MLP layers with a
+sparse MoE block.
+
+The useful pieces are:
+
+- `gate`: a router linear layer from hidden size to `num_experts`
+- `switch_mlp`: many SwiGLU experts with `moe_intermediate_size`
+- `num_experts_per_tok`: how many experts a token uses
+- `norm_topk_prob`: whether selected expert scores are renormalized
+- `decoder_sparse_step` and `mlp_only_layers`: which layers are sparse vs dense
+
+There is no shared expert in the Qwen3-MoE block we are following. The sparse
+feed-forward output is just the weighted top-k expert mixture.
+
+## The MLX Primitive
+
+MLX does not give us a single high-level MoE block in `mlx.nn`. The relevant
+primitive for this chapter is `mx.gather_qmm`: it performs quantized matrix
+multiplication while selecting a different matrix for each row.
+
+For MoE, that means:
+
+```plain
+token rows:  N, D
+expert ids:  N
+weights:     E, O, D packed as 4-bit QuantizedWeights
+output:      N, O
+```
+
+The row with `expert_ids[i] = e` should multiply by `weights[e]`.
+
+When the expert ids are sorted, pass `sorted_indices=True`. Keep the inverse
+order from the sort so the result can be restored to the original token order.
+
+## Router Step
+
+The router is just a quantized linear layer:
+
+```python
+router_logits = quantized_linear(x, w_router)
+router_probs = softmax(router_logits, axis=-1)
+```
+
+For a batch of tokens:
+
+```plain
+x:             B, L, D
+router_logits: B, L, E
+router_probs:  B, L, E
+```
+
+where `E = num_experts`.
+
+Qwen3-MoE then uses top-k selection:
+
+```python
+expert_ids = argpartition(-router_probs, k)[:k]
+expert_scores = take_along_axis(router_probs, expert_ids)
+```
+
+If `norm_topk_prob` is true, renormalize `expert_scores` so the selected scores
+sum to 1 for each token.
+
+## Expert Step
+
+Each expert is the same kind of SwiGLU MLP we already know:
+
+```plain
+expert(x) = down_proj(SiLU(gate_proj(x)) * up_proj(x))
+```
+
+The implementation should build token-expert jobs, group them by expert, and run
+the expert projections with `mx.gather_qmm`:
+
+```plain
+selected expert ids -> expanded token-expert rows
+expanded rows -> sort/group by expert id
+grouped expert rows -> grouped gate/up projection
+SiLU(gate) * up -> grouped down projection
+restore original token/top-k order -> weighted sum
+```
+
+The reorder is part of the model implementation. It keeps all token rows for the
+same expert contiguous so the expert bank can be applied with grouped matrix
+multiplication.
+
+## Task 1: Grouped Expert Linear
+
+```
+src/tiny_llm/moe.py
+```
+
+Implement `grouped_expert_linear`. This is the MLX-shaped core of MoE.
+
+The function accepts:
+
+```plain
+x:           ..., D
+w_experts:   QuantizedWeights for num_experts, output_dim, D
+expert_ids:  ...
+```
+
+It returns:
+
+```plain
+out:         ..., output_dim
+```
+
+The implementation should:
+
+```plain
+1. flatten token rows and expert ids,
+2. sort rows by expert id,
+3. call mx.gather_qmm with sorted_indices=True,
+4. restore the original order.
+```
+
+For the grouped matmul, the shape should look like:
+
+```python
+out = mx.gather_qmm(
+    mx.expand_dims(grouped_rows, -2),
+    w_experts.weight,
+    w_experts.scales,
+    w_experts.biases,
+    lhs_indices=mx.arange(grouped_rows.shape[0]),
+    rhs_indices=grouped_expert_ids,
+    transpose=True,
+    group_size=w_experts.group_size,
+    bits=w_experts.bits,
+    mode=w_experts.mode,
+    sorted_indices=True,
+).squeeze(-2)
+```
+
+This task maps to the same idea as `QuantizedSwitchLinear` in `mlx-lm`: each
+token row uses a different packed expert matrix, and the expert ids choose the
+right matrix.
+
+## Task 2: Router Top-k
+
+```
+src/tiny_llm/moe.py
+```
+
+Implement `route_topk`. It accepts hidden states and router weights, then
+returns:
+
+- router probabilities
+- selected expert ids
+- selected expert scores
+
+Use `quantized_linear` and `softmax`. Use `mx.argpartition` to select the top
+`num_experts_per_tok` experts, then `mx.take_along_axis` to gather their scores.
+
+Keep `norm_topk_prob` as an argument because Qwen3-MoE stores this behavior in
+the model config.
+
+## Task 3: Qwen3 Sparse MoE Block
+
+```
+src/tiny_llm/moe.py
+```
+
+Implement `Moe` by composing Task 1 and Task 2:
+
+```plain
+hidden states -> route_topk
+hidden states + expert ids -> grouped gate projection
+hidden states + expert ids -> grouped up projection
+SiLU(gate) * up -> grouped down projection
+weighted sum over num_experts_per_tok
+```
+
+This completes the Qwen3-MoE sparse feed-forward block. There is no shared expert
+branch in this block.
+
+## Task 4: Integrate Qwen3-MoE Layers
+
+```
+src/tiny_llm/qwen3_week3.py
+src/tiny_llm/models.py
+```
+
+Add a Qwen3-MoE loader path that reuses the Week 3 Qwen3 attention and paged KV
+cache behavior, but swaps selected block MLPs for `Moe`.
+
+The model wrapper should:
+
+- keep Qwen3 attention unchanged,
+- use regular `Qwen3MLP` for `mlp_only_layers`,
+- use `Moe` for sparse layers selected by
+  `decoder_sparse_step`,
+- load router and expert weights as `QuantizedWeights` from the Qwen3-MoE MLX
+  model,
+- preserve the same decode call shape:
+
+```python
+logits = model(tokens, offset, cache)
+```
+
+No scheduler API change in `src/tiny_llm/batch.py` is required for correctness.
+
+Run this task through the normal generation entrypoints instead of adding a
+separate unit test. For example:
+
+```bash
+hf download Qwen/Qwen3-30B-A3B-MLX-4bit
+
+pdm run main --solution tiny_llm --loader week3 --model qwen3-30b-a3b \
+  --prompt "Give me a short introduction to mixture of experts."
+
+pdm run batch-main --solution tiny_llm --loader week3 --model qwen3-30b-a3b \
+  --batch-size 2 --prefill-step 16
+```
+
+{{#include copyright.md}}