A command-line tool to interact with Hugging Face datasets and migrate them to Couchbase, with support for streaming data.
pip install -r requirements.txt
python setup.py installThe CLI provides the following commands:
Lists all available configurations for a dataset.
hf_to_cb_dataset_migrator list-configs --path datasetFlags:
--path: Path or name of the dataset (required)--revision: Version of the dataset script to load--download-config: Specific download configuration parameters--download-mode: Download mode (reuse_dataset_if_exists or force_redownload)--dynamic-modules-path: Path to dynamic modules--data-files: Path(s) to source data file(s)--token: Authentication token for private datasets--json-output: Output the configurations in JSON format--debug: Enable debug output--trust-remote-code: Allow loading arbitrary code from the dataset repository
Lists all available splits for a dataset.
hf_to_cb_dataset_migrator list-splits --path datasetFlags:
--path: Path or name of the dataset (required)--name: Configuration name of the dataset--data-files: Path(s) to source data file(s)--download-config: Specific download configuration parameters--download-mode: Download mode (reuse_dataset_if_exists or force_redownload)--revision: Version of the dataset script to load--token: Authentication token for private datasets--json-output: Output the splits in JSON format--debug: Enable debug output--trust-remote-code: Allow loading arbitrary code from the dataset repository
Lists all fields (columns) in a dataset.
hf_to_cb_dataset_migrator list-fields --path datasetFlags:
--path: Path or name of the dataset (required)--name: Name of the dataset configuration--data-files: Paths to source data files--download-config: Specific download configuration parameters--revision: Version of the dataset script to load--token: Hugging Face token for private datasets--split: Which split of the data to load--json-output: Output the fields in JSON format--debug: Enable debug output--trust-remote-code: Allow loading arbitrary code from the dataset repository
Migrates data from Hugging Face to Couchbase.
hf_to_cb_dataset_migrator migrate \
--path dataset \
--id-fields id_field \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scope \
--cb-collection my_collectionFlags:
--path: Path or name of the dataset (required)--id-fields: Comma-separated list of field names to use as document ID (required)--cb-url: Couchbase cluster URL (required)--cb-username: Couchbase username (required)--cb-password: Couchbase password (required)--cb-bucket: Couchbase bucket name (required)--cb-scope: Couchbase scope name (required)--cb-collection: Couchbase collection name--name: Configuration name of the dataset--data-files: Path(s) to source data file(s)--split: Which split of the data to load--cache-dir: Cache directory for datasets--download-config: Specific download configuration parameters--download-mode: Download mode (reuse_dataset_if_exists or force_redownload)--verification-mode: Verification mode (no_checks, basic_checks, or all_checks)--keep-in-memory: Keep dataset in memory--save-infos: Save dataset information--revision: Version of the dataset script to load--token: Authentication token for private datasets--no-streaming: Disable streaming mode--num-proc: Number of processes to use--storage-options: Storage options for remote filesystems--trust-remote-code: Allow loading arbitrary code from the dataset repository--cb-batch-size: Number of documents to insert per batch (default: 1000)--debug: Enable debug output
- List configurations for a public dataset:
hf_to_cb_dataset_migrator list-configs --path dataset- List configurations for a private dataset:
hf_to_cb_dataset_migrator list-configs --path my-dataset --token YOUR_HF_TOKEN- List splits for a dataset with specific configuration:
hf_to_cb_dataset_migrator list-splits --path dataset --name config-name- List fields in JSON format:
hf_to_cb_dataset_migrator list-fields --path dataset --json-output- List fields for a specific split:
hf_to_cb_dataset_migrator list-fields --path dataset --split train- List fields with download configuration:
hf_to_cb_dataset_migrator list-fields \
--path dataset \
--download-config '{"force_download": true}' \
--trust-remote-code- Migrate a dataset with multiple ID fields:
hf_to_cb_dataset_migrator migrate \
--path dataset \
--id-fields field1,field2 \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scope \
--cb-collection my_collection- Migrate a specific split with streaming enabled:
hf_to_cb_dataset_migrator migrate \
--path dataset \
--split train \
--id-fields id_field \
--cb-url couchbase://localhost \
--cb-username user \
--cb-password pass \
--cb-bucket my_bucket \
--cb-scope my_scopeThe CLI will exit with a non-zero status code if an error occurs during execution. Error messages will be displayed on stderr.
- Use
--debugflag with any command to enable debug-level logging - JSON output options are available for machine-readable output
- Progress information is displayed during migration
- For private Hugging Face datasets, use the
--tokenoption - Couchbase credentials are required for migration operations
- Credentials can be provided via command-line options
We truly appreciate your interest in this project!
This project is community-maintained, which means it's not officially supported by our support team.
If you need help, have found a bug, or want to contribute improvements, the best place to do that is right here — by opening a GitHub issue.
Our support portal is unable to assist with requests related to this project, so we kindly ask that all inquiries stay within GitHub.
Your collaboration helps us all move forward together — thank you!