Skip to content

Conversation

@wenzhaoabc
Copy link

Description

This PR addresses an issue in the dataset loading logic where args.dataset_config, if it is a dictionary, is not being correctly unpacked.

The datasets.load_dataset function expects configuration parameters (like name, split, data_files, etc.) to be passed as keyword arguments. When these arguments are grouped into a dictionary, it must be unpacked with ** to be passed correctly.

This change modifies the function call to use **args.dataset_config, ensuring that dictionary-based configurations are properly applied.

Change Details

if args.dataset_name and not args.dataset_mixture:
    logger.info(f"Loading dataset: {args.dataset_name}")
-   return datasets.load_dataset(args.dataset_name, args.dataset_config)
+   return datasets.load_dataset(args.dataset_name, **args.dataset_config)
elif args.dataset_mixture:
    logger.info(f"Creating dataset mixture with {len(args.dataset_mixture.datasets)} datasets")
    seed = args.dataset_mixture.seed

This ensures that if args.dataset_config is, for example, {'name': 'en', 'split': 'train'}, the call becomes load_dataset(dataset_name, name='en', split='train') instead of an incorrect load_dataset(dataset_name, {'name': 'en', 'split': 'train'}).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant