Skip to content

Conversation

@yanboliang
Copy link

@yanboliang yanboliang commented Dec 24, 2025

Run the following command:

NGPU=4 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=2 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes transformer_block_bucketing

Errors:

 traceback : Traceback (most recent call last):
    File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/train.py", line 669, in train
      self.train_step(data_iterator)
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/train.py", line 568, in train_step
      loss = self.forward_backward_step(input_dict, labels)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/train.py", line 543, in forward_backward_step
      pred = model_parts[0](inputs, **extra_inputs, **extra_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1780, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1791, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/experiments/compiler_toolkit/graph_utils.py", line 197, in forward
      dt_args, dt_kwargs = self.parallelize_inputs(self.parallel_dims, args, kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/experiments/compiler_toolkit/common_utils.py", line 36, in parallelize_inputs
      dt_args = tree_map(to_dtensor, args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/dist-packages/torch/utils/_pytree.py", line 1539, in tree_map
      return treespec.unflatten(map(func, *flat_args))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/dist-packages/torch/utils/_pytree.py", line 1280, in unflatten
      leaves = list(leaves)
               ^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/experiments/compiler_toolkit/common_utils.py", line 32, in to_dtensor
      tensor, parallel_dims.get_mesh("tp"), [Replicate()]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/mlx_devbox/users/yanbo.liang/playground/torchtitan/torchtitan/distributed/parallel_dims.py", line 272, in get_mesh
      raise ValueError(
  ValueError: Mesh 'tp' is not available. Ensure the corresponding parallelism dimension is enabled (size > 1).

Because if TP is not enabled, the input are regular tensors. cc @SherlockNoMad @yiming0416

@meta-cla
Copy link

meta-cla bot commented Dec 24, 2025

Hi @yanboliang!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla
Copy link

meta-cla bot commented Dec 24, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant