Skip to content

Create test in torchtitan to prevent backward incompatible changes to torchtitan/experiments/forge folder #627

@JenniferWang

Description

@JenniferWang

We've seen many backward incompatible changes in the titan repo that causes breakage in torchtitan/experiments/forge.

For example current titan nightly build is broken for forge (caused by pytorch/torchtitan@ff07852)

monarch._src.actor.actor_mesh.ActorError: Actor call TitanTrainer.setup failed.
 Traceback of where the remote call failed (most recent call last):
  File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 1167, in handle
    result = await the_method(*args, **kwargs)
  File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/forge/actors/trainer/titan.py", line 129, in setup
    self.engine = ForgeEngine(ForgeJobConfig(**engine_config))
  File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
    return f(*args, **kwargs)
  File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/torchtitan/experiments/forge/engine.py", line 104, in __init__
    dist_utils.set_determinism(
TypeError: set_determinism() missing 1 required positional argument: 'distinct_seed_mesh_dims'

In order to have a healthy nightly build, we need to add test in titan to capture these changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions