Skip to content

Llama 3.1 405B Preprocessed Dataset Downloading Issues #834

@harubaru

Description

@harubaru

Hello, I have been following the LLM Pretraining README to find a way to download the preprocessed dataset to use with Llama 3.1 405B, however the commands provided to do so only lead to directory not found errors.

$ rclone copy mlc-training:mlcommons-training-wg-public/common/datasets/c4/mixtral_8x22b_preprocessed $PREPROCESSED_PATH -P
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 1/3 failed with 1 errors and: directory not found
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 2/3 failed with 1 errors and: directory not found
2025-09-15 16:28:19 ERROR : : error reading source directory: directory not found
2025-09-15 16:28:19 ERROR : Attempt 3/3 failed with 1 errors and: directory not found
Transferred:             0 / 0 Bytes, -, 0 Bytes/s, ETA -
Errors:                 1 (retrying may help)
Elapsed time:         0.4s
2025/09/15 16:28:19 Failed to copy: directory not found

This error occurs even when using the provided access key ID, secret access key, and endpoint from the README.

$ rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
Remote config
--------------------
[mlc-training]
provider=Cloudflare = access_key_id=76ea42eadb867e854061a1806220ee1e
secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 = endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
--------------------

Is there a newer or correct way to download the preprocessed dataset?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions