Skip to content

Enable Tensor Parallelism (TP) #599

@joellidin

Description

@joellidin

Summary

Wire full TP support through miner and validator so runs work by setting torchtitan.tp_degree > 1 in hparams. The hparams surface and Titan parallelization are already present; we need to complete the plumbing across init, gradient pipeline, and checkpoints.

Scope

  1. Model init / mesh wiring
  • Miner & Validator: build Titan LLaMA via our factory and parallelize with TP using existing helpers. Validate degrees/world-size via the factory checks.
  • Keep validator mesh consistent with miner for evaluation parity.
  1. Gradient pipeline (DTensor-safe)
  • Ensure prepare_gradient_dict(...) owner/rendezvous logic correctly handles TP-sharded DTensors during encode→compress.
  • In outer_step(...), keep the per-param flow: reconstruct dense grad on source, then broadcast/distribute_tensor into DTensor grads on all ranks. Verify placements under TP.
  1. Checkpointing / catch-up
  • Confirm DCP save/load and catch-up apply cleanly with TP sharding (Titan distributed state dicts + pointer publishing).

Acceptance criteria

  • Setting torchtitan.tp_degree > 1 runs miner and validator without placement/shape errors and completes windows.
  • Gradients compress/apply under TP (no DTensor/mesh asserts in prepare_gradient_dict or outer_step).
  • Checkpoints save, restore, and catch-up on TP meshes.

Notes

  • Double-check owned_params/ownership vs. TP partitioning to avoid double work or gaps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions