-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Summary
Wire full TP support through miner and validator so runs work by setting torchtitan.tp_degree > 1 in hparams. The hparams surface and Titan parallelization are already present; we need to complete the plumbing across init, gradient pipeline, and checkpoints.
Scope
- Model init / mesh wiring
- Miner & Validator: build Titan LLaMA via our factory and parallelize with TP using existing helpers. Validate degrees/world-size via the factory checks.
- Keep validator mesh consistent with miner for evaluation parity.
- Gradient pipeline (DTensor-safe)
- Ensure
prepare_gradient_dict(...)owner/rendezvous logic correctly handles TP-sharded DTensors during encode→compress. - In
outer_step(...), keep the per-param flow: reconstruct dense grad on source, then broadcast/distribute_tensorinto DTensor grads on all ranks. Verify placements under TP.
- Checkpointing / catch-up
- Confirm DCP save/load and catch-up apply cleanly with TP sharding (Titan distributed state dicts + pointer publishing).
Acceptance criteria
- Setting
torchtitan.tp_degree > 1runs miner and validator without placement/shape errors and completes windows. - Gradients compress/apply under TP (no DTensor/mesh asserts in
prepare_gradient_dictorouter_step). - Checkpoints save, restore, and catch-up on TP meshes.
Notes
- Double-check
owned_params/ownership vs. TP partitioning to avoid double work or gaps.
Metadata
Metadata
Assignees
Labels
No labels