Skip to content

Getting UnionMatchError error when trying to run official examples in SmolLM #371

@rekcu

Description

@rekcu

Hi,

I'm trying to run SmolLMv1 pretraining following the instructions in the SmolLM repository, with the current version of the nanotron library, but I'm encountering errors during this process.

What I've done:

I've followed all the instructions provided in the README under the text/pretraining folder, including:

Setting up my training environment as described
Tokenizing the data according to the documentation (using tools/preprocess_data.py)
The issue

When I attempt to start training using the command provided in the README, I receive a UnionMatchError:

$ CUDA_DEVICES_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file ~/smollm/text/pretraining/smollm1/config_smollm1_135M_replicate.yaml

I've created a configuration file named config_smollm1_135M_replicate.yaml which is identical to the repository's config_smollm1_135M.yaml, with one intentional modification:

The dataset_folder paths have been updated to point to my locally tokenized datasets:

dataset_folder: # paths to tokenized datasets

  • ./data/tokenized/fineweb-edu-dedup
  • ./data/tokenized/cosmopedia-v2
  • ./data/tokenized/python-edu
  • ./data/tokenized/open-web-math
  • ./data/tokenized/stackoverflow-clean
    Here is the error I'm getting:
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 69, in from_dict
  data = _build_value_for_union(union=type_, data=data, config=config)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "nanotron/lib/python3.11/site-packages/dacite/core.py", line 143, in _build_value_for_union
  raise UnionMatchError(field_type=union, value=data)
dacite.exceptions.UnionMatchError: can not match type "dict" to any type of "data_stages.data.dataset" union: typing.Union[nanotron.config.config.PretrainDatasetsArgs, nanotron.config.config.NanosetDatasetsArgs, nanotron.config.config.SFTDatasetsArgs, NoneType]

The error suggests there's a type mismatch in the configuration when trying to match data types for the dataset configuration. Could you please provide guidance on how to resolve this configuration issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions