We've seen many backward incompatible changes in the titan repo that causes breakage in torchtitan/experiments/forge.
For example current titan nightly build is broken for forge (caused by pytorch/torchtitan@ff07852)
monarch._src.actor.actor_mesh.ActorError: Actor call TitanTrainer.setup failed.
Traceback of where the remote call failed (most recent call last):
File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 1167, in handle
result = await the_method(*args, **kwargs)
File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/forge/actors/trainer/titan.py", line 129, in setup
self.engine = ForgeEngine(ForgeJobConfig(**engine_config))
File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
return f(*args, **kwargs)
File "/home/jiyue/.fbpkg_conda_envs/forge-19456bd/lib/python3.10/site-packages/torchtitan/experiments/forge/engine.py", line 104, in __init__
dist_utils.set_determinism(
TypeError: set_determinism() missing 1 required positional argument: 'distinct_seed_mesh_dims'
In order to have a healthy nightly build, we need to add test in titan to capture these changes.