Getting UnionMatchError error when trying to run official examples in SmolLM

Hi,

I'm trying to run  SmolLMv1 pretraining following the instructions in the SmolLM repository, with the current version of the nanotron library, but I'm encountering errors during this process.

What I've done:

I've followed all the instructions provided in the [README](https://github.com/huggingface/smollm/tree/main/text/pretraining) under the text/pretraining folder, including:

Setting up my training environment as described
Tokenizing the data according to the documentation (using tools/preprocess_data.py)
The issue

When I attempt to start training using the command provided in the README, I receive a UnionMatchError:

$ CUDA_DEVICES_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file ~/smollm/text/pretraining/smollm1/config_smollm1_135M_replicate.yaml 

I've created a configuration file named config_smollm1_135M_replicate.yaml which is identical to the repository's [config_smollm1_135M.yaml](https://github.com/huggingface/smollm/blob/main/text/pretraining/smollm1/config_smollm1_135M.yaml), with one intentional modification:

The dataset_folder paths have been updated to point to my locally tokenized datasets:

dataset_folder: # paths to tokenized datasets
  - ./data/tokenized/fineweb-edu-dedup
  - ./data/tokenized/cosmopedia-v2
  - ./data/tokenized/python-edu
  - ./data/tokenized/open-web-math
  - ./data/tokenized/stackoverflow-clean
Here is the error I'm getting:

```
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 69, in from_dict
  data = _build_value_for_union(union=type_, data=data, config=config)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 143, in _build_value_for_union
  raise UnionMatchError(field_type=union, value=data)
dacite.exceptions.UnionMatchError: can not match type "dict" to any type of "data_stages.data.dataset" union: typing.Union[nanotron.config.config.PretrainDatasetsArgs, nanotron.config.config.NanosetDatasetsArgs, nanotron.config.config.SFTDatasetsArgs, NoneType]
```

The error suggests there's a type mismatch in the configuration when trying to match data types for the dataset configuration. Could you please provide guidance on how to resolve this configuration issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting UnionMatchError error when trying to run official examples in SmolLM #371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Getting UnionMatchError error when trying to run official examples in SmolLM #371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions