-
Notifications
You must be signed in to change notification settings - Fork 586
Closed
Description
Hello, I have been following the LLM Pretraining README to find a way to download the preprocessed dataset to use with Llama 3.1 405B, however the commands provided to do so only lead to directory not found errors.
$ rclone copy mlc-training:mlcommons-training-wg-public/common/datasets/c4/mixtral_8x22b_preprocessed $PREPROCESSED_PATH -P
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 1/3 failed with 1 errors and: directory not found
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 2/3 failed with 1 errors and: directory not found
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 3/3 failed with 1 errors and: directory not found
Transferred: 0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors: 1 (retrying may help)
Elapsed time: 0.4s
2025/09/15 16:28:19 Failed to copy: directory not found
This error occurs even when using the provided access key ID, secret access key, and endpoint from the README.
$ rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
Remote config
--------------------
[mlc-training]
provider=Cloudflare = access_key_id=76ea42eadb867e854061a1806220ee1e
secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
--------------------
Is there a newer or correct way to download the preprocessed dataset?
Metadata
Metadata
Assignees
Labels
No labels