-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Hi,
I'm trying to run SmolLMv1 pretraining following the instructions in the SmolLM repository, with the current version of the nanotron library, but I'm encountering errors during this process.
What I've done:
I've followed all the instructions provided in the README under the text/pretraining folder, including:
Setting up my training environment as described
Tokenizing the data according to the documentation (using tools/preprocess_data.py)
The issue
When I attempt to start training using the command provided in the README, I receive a UnionMatchError:
$ CUDA_DEVICES_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file ~/smollm/text/pretraining/smollm1/config_smollm1_135M_replicate.yaml
I've created a configuration file named config_smollm1_135M_replicate.yaml which is identical to the repository's config_smollm1_135M.yaml, with one intentional modification:
The dataset_folder paths have been updated to point to my locally tokenized datasets:
dataset_folder: # paths to tokenized datasets
- ./data/tokenized/fineweb-edu-dedup
- ./data/tokenized/cosmopedia-v2
- ./data/tokenized/python-edu
- ./data/tokenized/open-web-math
- ./data/tokenized/stackoverflow-clean
Here is the error I'm getting:
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 69, in from_dict
data = _build_value_for_union(union=type_, data=data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 143, in _build_value_for_union
raise UnionMatchError(field_type=union, value=data)
dacite.exceptions.UnionMatchError: can not match type "dict" to any type of "data_stages.data.dataset" union: typing.Union[nanotron.config.config.PretrainDatasetsArgs, nanotron.config.config.NanosetDatasetsArgs, nanotron.config.config.SFTDatasetsArgs, NoneType]
The error suggests there's a type mismatch in the configuration when trying to match data types for the dataset configuration. Could you please provide guidance on how to resolve this configuration issue?