Grug Demo, small scale feature maxed moe #3237
Conversation
|
🤖 Grug variant diff report
Artifact fallback: Download report bundle |
|
@claude thoughts? |
|
Claude finished @dlwh's task in 1m 55s —— View job Review of PR #3237: Grug Demo, small scale feature maxed MoE
Overall this looks like a solid grug variant following the copy-first workflow. The architecture innovations are well-documented in the docstrings, the results are linked, and the grugmuon change to support 3D expert weights is clean. A few observations:
|
dlwh
left a comment
There was a problem hiding this comment.
generally lg but the sharding stuff isn't quite there!
Keeping the experts sharded over the E axis should be relatively easy. I agree we need to replicate for the inner loop
| from jax.sharding import PartitionSpec as P, reshard | ||
|
|
||
| # Replicate then vmap the Newton-Schulz core over the batch/expert dim | ||
| x = reshard(x, P(None, None, None)) |
There was a problem hiding this comment.
imho better to not replicate but just keep it vmapped over expert
There was a problem hiding this comment.
probably best to do something like "replicate the last two dims but keep the first dim sharded"
|
|
||
| assert X.ndim == 2 | ||
| X = reshard(X, P(None, None)) | ||
| return _newtonschulz_core(X, steps=steps, eps=eps, coefficient_type=coefficient_type) |
|
hmm tests are failing because they keep changing, 3 times in 2 days (Marin header, resource spec, v4-8 vmem). Thinking it will be simpler for me to hold off on chasing tests until manual review is done, then I do one round to sync with latest tests and merge asap. |
|
sorry yeah lot of moving targets. i can help merge |
|
This pull request has been inactive for 23 days and is marked as stale. |
|
This pull request has been automatically closed due to inactivity. |
Demoing the grug workflow with a feature-maxed moe small scale variant.
eval/paloma/c4_en/bpb: 1.1136 @ 5000 steps, ~9.14e17 model FLOPs.
MoE with pick 2 of 16 routed experts.
https://wandb.ai/marin-community/dial_moe/runs/updated_max_02_embed.