quickstart train on cpu with m1?


## Environment

```
composer_collect_env
Collecting system information...
---------------------------------
System Environment Report        
Created: 2025-02-02 12:32:57 CST
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.5
Libc version: N/A

Python version: 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1

Versions of relevant libraries:
[pip3] numpy==2.1.3
[pip3] onnx==1.17.0
[pip3] onnxruntime==1.20.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.5.1
[pip3] torch-optimizer==0.3.0
[pip3] torchmetrics==1.6.0
[pip3] torchvision==0.20.1
[conda] Could not collect


Composer information
--------------------
Composer Version: 0.28.0
Composer Commit Hash: None
CPU Model: Apple M1
CPU Count: 8
Number of Nodes: 1
GPU Model: N/A
GPUs per Node: 0
GPU Count: 1
CUDA Device Count: 0
```
## To reproduce

Steps to reproduce the behavior:
1. install from source without GPU
2. follow [quickstart](https://github.com/mosaicml/llm-foundry?tab=readme-ov-file#quickstart) step 1 is OK, trainning is not:
3. 
```
composer train/train.py \
  train/yamls/pretrain/mpt-125m.yaml \
  variables.data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=0 \
  save_folder=mpt-125m \
  model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32



2025-02-02 11:16:09,562: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Initializing dist with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Testing barrier with device...
2025-02-02 11:16:09,566: rank0[2908][MainThread]: DEBUG: llmfoundry.command_utils.train: Barrier test passed with device.
/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py:351: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP.
  warnings.warn(
/Users/devworks/github.com/uv-llm/llmfoundry/utils/config_utils.py:525: UserWarning: Using `cfg.model.init_device='meta'` is only valid when using FSDP! Reverting to `cfg.model.init_device='cpu'`.
  warnings.warn(
2025-02-02 11:16:09,567: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building tokenizer...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building train loader...
2025-02-02 11:16:09,922: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building eval loader...
2025-02-02 11:16:09,933: rank0[2908][MainThread]: INFO: streaming.base.dataset: Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64.
2025-02-02 11:16:09,935: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Initializing model...
MPTForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
2025-02-02 11:16:09,936: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: Instantiating an MPTForCausalLM model from /Users/devworks/github.com/uv-llm/llmfoundry/models/mpt/modeling_mpt.py
2025-02-02 11:16:10,611: rank0[2908][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.
2025-02-02 11:16:12,196: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: MPTModel(
  (wte): SharedEmbedding(50368, 768)
  (wpe): Embedding(2048, 768)
  (emb_drop): Dropout(p=0.0, inplace=False)
  (blocks): ModuleList(
    (0-11): 12 x MPTBlock(
      (norm_1): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): MultiheadAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm_2): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): MPTMLP(
        (up_proj): Linear(in_features=768, out_features=3072, bias=True)
        (down_proj): Linear(in_features=3072, out_features=768, bias=True)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_ffn_dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (norm_f): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2025-02-02 11:16:12,197: rank0[2908][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
2025-02-02 11:16:12,298: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Building trainer...
2025-02-02 11:16:12,299: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,301: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Run name: 1738516572-ginger-mushroom
/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/callbacks/memory_monitor.py:137: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2025-02-02 11:16:12,380: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2025-02-02 11:16:12,381: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Logging config
variables:
  data_local: my-copy-c4
  data_remote: null
  max_seq_len: 2048
  global_seed: 17
  run_name: null
max_seq_len: 2048
run_name: null
model:
  name: mpt_causal_lm
  init_device: meta
  d_model: 768
  n_heads: 12
  n_layers: 12
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50368
  attn_config:
    attn_impl: torch
  loss_fn: torch_crossentropy
tokenizer:
  name: EleutherAI/gpt-neox-20b
  kwargs:
    model_max_length: 2048
train_loader:
  name: text
  dataset:
    local: my-copy-c4
    remote: null
    split: train_small
    shuffle: true
    max_seq_len: 2048
    shuffle_seed: 17
  drop_last: true
  num_workers: 8
eval_loader:
  name: text
  dataset:
    local: my-copy-c4
    remote: null
    split: val_small
    shuffle: false
    max_seq_len: 2048
    shuffle_seed: 17
  drop_last: false
  num_workers: 8
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_adamw
  lr: 0.0006
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 10ba
eval_interval: 0
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
precision: fp32
fsdp_config: null
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
save_folder: mpt-125m
n_gpus: 1
device_train_batch_size: 256
device_train_grad_accum: 16
merge: true
tp_config: null
n_params: 125311488
n_active_params: 125311488
n_trainable_params: 125311488

2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: llmfoundry.command_utils.train: Starting training...
2025-02-02 11:16:12,495: rank0[2908][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.FP32
******************************
Config:
algorithms:
  gradient_clipping:
    clipping_threshold: 1.0
    clipping_type: norm
callbacks:
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
  speed_monitor:
    window_size: 10
composer_commit_hash: None
composer_version: 0.28.0
console_log_interval: 1ba
device_eval_batch_size: 16
device_train_batch_size: 256
device_train_grad_accum: 16
device_train_microbatch_size: 16
enabled_algorithms/GradientClipping: true
eval_first: false
eval_interval: 0
eval_loader:
  dataset:
    local: my-copy-c4
    max_seq_len: 2048
    remote: null
    shuffle: false
    shuffle_seed: 17
    split: val_small
  drop_last: false
  name: text
  num_workers: 8
eval_subset_num_batches: -1
fsdp_config: null
global_train_batch_size: 256
log_to_console: true
max_duration: 10ba
max_seq_len: 2048
merge: true
model:
  attn_config:
    attn_impl: torch
  d_model: 768
  expansion_ratio: 4
  init_device: meta
  loss_fn: torch_crossentropy
  max_seq_len: 2048
  n_heads: 12
  n_layers: 12
  name: mpt_causal_lm
  vocab_size: 50368
n_active_params: 125311488
n_gpus: 1
n_params: 125311488
n_trainable_params: 125311488
node_name: unknown because NODENAME environment variable not set
num_cpus_per_node: 1
num_nodes: 1
optimizer:
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  lr: 0.0006
  name: decoupled_adamw
  weight_decay: 0.0
precision: fp32
progress_bar: false
rank_zero_seed: 17
run_name: null
save_folder: mpt-125m
scheduler:
  alpha_f: 0.1
  name: cosine_with_warmup
  t_warmup: 100ba
seed: 17
time/remaining_estimate_unit: hours
tokenizer:
  kwargs:
    model_max_length: 2048
  name: EleutherAI/gpt-neox-20b
tp_config: null
train_loader:
  dataset:
    local: my-copy-c4
    max_seq_len: 2048
    remote: null
    shuffle: true
    shuffle_seed: 17
    split: train_small
  drop_last: true
  name: text
  num_workers: 8
variables:
  data_local: my-copy-c4
  data_remote: null
  global_seed: 17
  max_seq_len: 2048
  run_name: null

******************************
2025-02-02 11:16:12,497: rank0[2908][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
[rank0]: Traceback (most recent call last):
[rank0]:   File "/Users/devworks/github.com/uv-llm/scripts/train/train.py", line 9, in <module>
[rank0]:     train_from_yaml(yaml_path, args_list)
[rank0]:   File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 662, in train_from_yaml
[rank0]:     return train(yaml_cfg)
[rank0]:            ^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/llmfoundry/command_utils/train.py", line 643, in train
[rank0]:     trainer.fit()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2297, in fit
[rank0]:     self._train_loop()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2447, in _train_loop
[rank0]:     self._spin_dataloaders_to_cur_epoch()
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/composer/trainer/trainer.py", line 2381, in _spin_dataloaders_to_cur_epoch
[rank0]:     for _ in dataloader:
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 479, in __iter__
[rank0]:     self._iterator = self._get_iterator()
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 415, in _get_iterator
[rank0]:     return _MultiProcessingDataLoaderIter(self)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/github.com/uv-llm/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1138, in __init__
[rank0]:     w.start()
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/process.py", line 121, in start
[rank0]:     self._popen = self._Popen(self)
[rank0]:                   ^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 224, in _Popen
[rank0]:     return _default_context.get_context().Process._Popen(process_obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
[rank0]:     return Popen(process_obj)
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
[rank0]:     super().__init__(process_obj)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
[rank0]:     self._launch(process_obj)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
[rank0]:     reduction.dump(process_obj, fp)
[rank0]:   File "/Users/devworks/.local/share/mise/installs/python/3.12.5/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
[rank0]:     ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'
2025-02-02 11:16:12,512: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Post-closing callback CheckpointSaver
2025-02-02 11:16:12,513: rank0[2908][MainThread]: DEBUG: composer.core.engine: Engine closed.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 2908) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 2908) exited with code 1
```

## Expected behavior

Step of the quickstart to be successful

## Additional context

the line added for composer trainning `model.attn_config.attn_impl=torch model.loss_fn=torch_crossentropy precision=fp32` si to allow it to run on m1 cpu

Not sure how to go about this error as the part above seems OK?

```
[rank0]:     ForkingPickler(file, protocol).dump(obj)
[rank0]: AttributeError: Can't get local object 'get_tokens_per_batch_func.<locals>.get_num_tokens_in_batch'

ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quickstart train on cpu with m1? #1718

Environment

To reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

quickstart train on cpu with m1? #1718

Description

Environment

To reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions