-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Since the torchtitan commit was synced to 5fb7cc2e3bbb9b9dc0ab7af34ed5cc58b5f32021, training with Torchtitan using the quick-start method no longer works.
Steps to Reproduce
git clone --recurse-submodules https://github.com/AMD-AGI/Primus
cd Primus
export DOCKER_IMAGE=rocm/primus:v25.9_gfx950
EXP=examples/torchtitan/configs/llama3.1_70B-BF16-pretrain.yaml bash examples/run_local_pretrain.sh
Error message
Traceback (most recent call last):
File "/home/kaarnold/Primus/examples/scripts/prepare_experiment.py", line 101, in <module>
main()
File "/home/kaarnold/Primus/examples/scripts/prepare_experiment.py", line 50, in main
config = PrimusParser().parse(args)
File "/home/kaarnold/Primus/primus/core/launcher/parser.py", line 196, in parse
self.parse_modules()
File "/home/kaarnold/Primus/primus/core/launcher/parser.py", line 291, in parse_modules
self.parse_trainer_module(module_name)
File "/home/kaarnold/Primus/primus/core/launcher/parser.py", line 281, in parse_trainer_module
yaml_utils.override_namespace(module_config, module.overrides)
File "/home/kaarnold/Primus/primus/core/utils/yaml_utils.py", line 167, in override_namespace
override_namespace(get_value_by_key(original_ns, key), new_value)
File "/home/kaarnold/Primus/primus/core/utils/yaml_utils.py", line 164, in override_namespace
raise Exception(f"Override namespace failed: can't find key({key}) in namespace {original_ns}")
Exception: Override namespace failed: can't find key(batch_size) in namespace namespace(dataset='c4', dataset_path=None, deterministic=False, enable_cpu_offload=False, gc_debug=False, gc_freq=50, global_batch_size=-1, local_batch_size=8, max_norm=1.0, mixed_precision_param='bfloat16', mixed_precision_reduce='float32', seed=None, seq_len=2048, steps=10000)
[NODE-0(smci355-ccs-aus-m14-09)] [ERROR] /home/kaarnold/Primus/examples/scripts/prepare_experiment.py failed, aborting.
(The torchtitan commit update included a move from using batch_size to local_batch_size. However, if the run config examples/torchtitan/configs/llama3.1_70B-BF16-pretrain.yaml is updated to use local_batch_size, more errors occur later on that were not occurring before the torchtitan commit was updated.)
Metadata
Metadata
Assignees
Labels
No labels