Open
Description
Environment
composer_collect_env
Collecting system information...
---------------------------------
System Environment Report
Created: 2025-02-02 12:32:57 CST
---------------------------------
PyTorch information
-------------------
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.5
Libc version: N/A
Python version: 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1
Versions of relevant libraries:
[pip3] numpy==2.1.3
[pip3] onnx==1.17.0
[pip3] onnxruntime==1.20.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.5.1
[pip3] torch-optimizer==0.3.0
[pip3] torchmetrics==1.6.0
[pip3] torchvision==0.20.1
[conda] Could not collect
Composer information
--------------------
Composer Version: 0.28.0
Composer Commit Hash: None
CPU Model: Apple M1
CPU Count: 8
Number of Nodes: 1
GPU Model: N/A
GPUs per Node: 0
GPU Count: 1
CUDA Device Count: 0
To reproduce
Steps to reproduce the behavior:
- install from source without GPU
- follow quickstart step 1 is OK, trainning is not:
composer train/train.py \
train/yamls/pretrain/mpt-125m.yaml \
variables.data_local=my-copy-c4 \
train_loader.dataset.split=train_small \
eval_loader.dataset.split=val_small \
max_duration=10ba \
eval_interval=0 \
save_folder=mpt-125m \
model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32
2025-02-02 11:16:09,562: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py:351: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP.
warnings.warn(
/Users/devworks/github.com/uv-llm/llmfoundry/utils/config_utils.py:525: UserWarning: Using `cfg.model.init_device='meta'` is only valid when using FSDP! Reverting to `cfg.model.init_device='cpu'`.
warnings.warn(
2025-02-02 11:16:09,567: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,935: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
2025-02-02 11:16:09,936: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: Instantiating an MPTForCausalLM model from /Users/devworks/github.com/uv-llm/llmfoundry/models/mpt/modeling_mpt.py
2025-02-02 11:16:10,611: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.
2025-02-02 11:16:12,196: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: MPTModel(
(wte): SharedEmbedding(50368, 768)
(wpe): Embedding(2048, 768)
(emb_drop): Dropout(p=0.0, inplace=False)
(blocks): ModuleList(
(0-11): 12 x MPTBlock(
(norm_1): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): MultiheadAttention(
(Wqkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(norm_2): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
(ffn): MPTMLP(
(up_proj): Linear(in_features=768, out_features=3072, bias=True)
(down_proj): Linear(in_features=3072, out_features=768, bias=True)
)
(resid_attn_dropout): Dropout(p=0.0, inplace=False)
(resid_ffn_dropout): Dropout(p=0.0, inplace=False)
)
)
(norm_f): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2025-02-02 11:16:12,197: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
2025-02-02 11:16:12,298: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
2025-02-02 11:16:12,299: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,301: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Run name: 1738516572-ginger-mushroom
/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2025-02-02 11:16:12,380: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
variables:
data_local: my-copy-c4
data_remote: null
max_seq_len: 2048
global_seed: 17
run_name: null
max_seq_len: 2048
run_name: null
model:
name: mpt_causal_lm
init_device: meta
d_model: 768
n_heads: 12
n_layers: 12
expansion_ratio: 4
max_seq_len: 2048
vocab_size: 50368
attn_config:
attn_impl: torch
loss_fn: torch_crossentropy
tokenizer:
name: EleutherAI/gpt-neox-20b
kwargs:
model_max_length: 2048
train_loader:
name: text
dataset:
local: my-copy-c4
remote: null
split: train_small
shuffle: true
max_seq_len: 2048
shuffle_seed: 17
drop_last: true
num_workers: 8
eval_loader:
name: text
dataset:
local: my-copy-c4
remote: null
split: val_small
shuffle: false
max_seq_len: 2048
shuffle_seed: 17
drop_last: false
num_workers: 8
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1
optimizer:
name: decoupled_adamw
lr: 0.0006
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0
max_duration: 10ba
eval_interval: 0
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
precision: fp32
fsdp_config: null
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}
save_folder: mpt-125m
n_gpus: 1
device_train_batch_size: 256
device_train_grad_accum: 16
merge: true
tp_config: null
n_params: 125311488
n_active_params: 125311488
n_trainable_params: 125311488
2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.FP32
******************************
Config:
algorithms:
gradient_clipping:
clipping_threshold: 1.0
clipping_type: norm
callbacks:
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}
speed_monitor:
window_size: 10
composer_commit_hash: None
composer_version: 0.28.0
console_log_interval: 1ba
device_eval_batch_size: 16
device_train_batch_size: 256
device_train_grad_accum: 16
device_train_microbatch_size: 16
enabled_algorithms/GradientClipping: true
eval_first: false
eval_interval: 0
eval_loader:
dataset:
local: my-copy-c4
max_seq_len: 2048
remote: null
shuffle: false
shuffle_seed: 17
split: val_small
drop_last: false
name: text
num_workers: 8
eval_subset_num_batches: -1
fsdp_config: null
global_train_batch_size: 256
log_to_console: true
max_duration: 10ba
max_seq_len: 2048
merge: true
model:
attn_config:
attn_impl: torch
d_model: 768
expansion_ratio: 4
init_device: meta
loss_fn: torch_crossentropy
max_seq_len: 2048
n_heads: 12
n_layers: 12
name: mpt_causal_lm
vocab_size: 50368
n_active_params: 125311488
n_gpus: 1
n_params: 125311488
n_trainable_params: 125311488
node_name: unknown because NODENAME environment variable not set
num_cpus_per_node: 1
num_nodes: 1
optimizer:
betas:
- 0.9
- 0.95
eps: 1.0e-08
lr: 0.0006
name: decoupled_adamw
weight_decay: 0.0
precision: fp32
progress_bar: false
rank_zero_seed: 17
run_name: null
save_folder: mpt-125m
scheduler:
alpha_f: 0.1
name: cosine_with_warmup
t_warmup: 100ba
seed: 17
time/remaining_estimate_unit: hours
tokenizer:
kwargs:
model_max_length: 2048
name: EleutherAI/gpt-neox-20b
tp_config: null
train_loader:
dataset:
local: my-copy-c4
max_seq_len: 2048
remote: null
shuffle: true
shuffle_seed: 17
split: train_small
drop_last: true
name: text
num_workers: 8
variables:
data_local: my-copy-c4
data_remote: null
global_seed: 17
max_seq_len: 2048
run_name: null
******************************
2025-02-02 11:16:12,497: rank0[2908][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
[rank0]: Traceback (most recent call last):
[rank0]: File "/Users/devworks/github.com/uv-llm/scripts/train/train.py", line 9, in <module>
[rank0]: train_from_yaml(yaml_path, args_list)
[rank0]: File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]: return train(yaml_cfg)
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 643, in train
[rank0]: trainer.fit()
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2297, in fit
[rank0]: self._train_loop()
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2447, in _train_loop
[rank0]: self._spin_dataloaders_to_cur_epoch()
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2381, in _spin_dataloaders_to_cur_epoch
[rank0]: for _ in dataloader:
[rank0]: ^^^^^^^^^^
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 479, in __iter__
[rank0]: self._iterator = self._get_iterator()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 415, in _get_iterator
[rank0]: return _MultiProcessingDataLoaderIter(self)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1138, in __init__
[rank0]: w.start()
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/process.py", line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 224, in _Popen
[rank0]: return _default_context.get_context().Process._Popen(process_obj)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]: super().__init__(process_obj)
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]: self._launch(process_obj)
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
[rank0]: reduction.dump(process_obj, fp)
[rank0]: File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
[rank0]: ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'
2025-02-02 11:16:12,512: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Engine closed.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 2908) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 2908) exited with code 1
Expected behavior
Step of the quickstart to be successful
Additional context
the line added for composer trainning model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32
si to allow it to run on m1 cpu
Not sure how to go about this error as the part above seems OK?
[rank0]: ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.