@@ -19,7 +19,7 @@ The Shared Sharded dataset is based on the [mlfoundations/dclm-baseline-1.0-parq
1919For the fastest training, our optimized version includes:
2020
2121- Pretokenized numpy arrays in .npy files
22- - Array slicing provided via .bin files
22+ - Sample ID arrays provided via .npy files
2323
2424## System Requirements
2525
@@ -43,9 +43,9 @@ Append the following env keys:
4343
4444``` bash
4545R2_DATASET_ACCOUNT_ID=8af7f92a8a0661cf7f1ac0420c932980
46- R2_DATASET_BUCKET_NAME=gemma -migration
47- R2_DATASET_READ_ACCESS_KEY_ID=a733fac6c32a549e0d48f9f7cf67d758
48- R2_DATASET_READ_SECRET_ACCESS_KEY=f50cab456587f015ad21c48c3e23c7ff0e6f1ad5a22c814c3a50d1a4b7c76bb9
46+ R2_DATASET_BUCKET_NAME=mixed-dataset -migration
47+ R2_DATASET_READ_ACCESS_KEY_ID=e70cd26850f697479bbb5fd9413713f4
48+ R2_DATASET_READ_SECRET_ACCESS_KEY=11e3364d6ef70e44d671863fb6de32d474aa6220fa2c9c3df45c5e012ebfbda3
4949DATASET_BINS_PATH=" tokenized/"
5050```
5151
@@ -92,16 +92,16 @@ Use the CloudFlare migration tool for the easiest setup. Here are the key-value
9292
9393- Bucket Information
9494 ` Source bucket provider ` : ` S3-Compatible Storage `
95- ` Bucket name ` : ` gemma -migration`
96- ` S3-compatible endpoint URL ` : ` https://8af7f92a8a0661cf7f1ac0420c932980.r2.cloudflarestorage.com/gemma -migration `
95+ ` Bucket name ` : ` mixed-dataset -migration`
96+ ` S3-compatible endpoint URL ` : ` https://8af7f92a8a0661cf7f1ac0420c932980.r2.cloudflarestorage.com/mixed-dataset -migration `
9797- Required Credentials
98- ` Access Key ID ` : ` a733fac6c32a549e0d48f9f7cf67d758 `
99- ` Secret Access Key ` : ` f50cab456587f015ad21c48c3e23c7ff0e6f1ad5a22c814c3a50d1a4b7c76bb9 `
98+ ` Access Key ID ` : ` e70cd26850f697479bbb5fd9413713f4 `
99+ ` Secret Access Key ` : ` 11e3364d6ef70e44d671863fb6de32d474aa6220fa2c9c3df45c5e012ebfbda3 `
100100
101101#### Page 2
102102
103103- Select destination R2 bucket
104- ` Bucket name ` : ` gemma -migration`
104+ ` Bucket name ` : ` mixed-dataset -migration`
105105 ` Access Key ID ` : your_write_id
106106 ` Access Key ` : your_secret_write_id
107107 ` Overwrite files? ` : ` Yes, overwrite (recommended) `
@@ -122,8 +122,8 @@ curl https://rclone.org/install.sh | sudo bash
122122# Configure source (read-only)
123123rclone config create r2-source s3 \
124124 provider=Cloudflare \
125- access_key_id=a733fac6c32a549e0d48f9f7cf67d758 \
126- secret_access_key=f50cab456587f015ad21c48c3e23c7ff0e6f1ad5a22c814c3a50d1a4b7c76bb9 \
125+ access_key_id=e70cd26850f697479bbb5fd9413713f4 \
126+ secret_access_key=11e3364d6ef70e44d671863fb6de32d474aa6220fa2c9c3df45c5e012ebfbda3 \
127127 endpoint=https://8af7f92a8a0661cf7f1ac0420c932980.r2.cloudflarestorage.com \
128128 acl=private
129129
@@ -139,7 +139,7 @@ rclone config create r2-dest s3 \
139139##### Copy all shards (Full Migration)
140140``` bash
141141# Copy entire tokenized directory (all shards and sample IDs)
142- rclone copy r2-source:gemma -migration/tokenized/ r2-dest:< your-bucket-name> /tokenized/ \
142+ rclone copy r2-source:mixed-dataset -migration/tokenized/ r2-dest:< your-bucket-name> /tokenized/ \
143143 --transfers 32 \
144144 --checkers 16 \
145145 --progress
@@ -149,10 +149,10 @@ rclone copy r2-source:gemma-migration/tokenized/ r2-dest:<your-bucket-name>/toke
149149If you want to test with just the first two shards:
150150``` bash
151151# Copy first two training shards and their sample IDs
152- rclone copy r2-source:gemma -migration/tokenized/train_000000.npy r2-dest:< your-bucket-name> /tokenized/ --progress
153- rclone copy r2-source:gemma -migration/tokenized/train_000001.npy r2-dest:< your-bucket-name> /tokenized/ --progress
154- rclone copy r2-source:gemma- migration/tokenized/sample_ids_000000.bin r2-dest:< your-bucket-name> /tokenized/ --progress
155- rclone copy r2-source:gemma- migration/tokenized/sample_ids_000001.bin r2-dest:< your-bucket-name> /tokenized/ --progress
152+ rclone copy r2-source:mixed-dataset -migration/tokenized/train_000000.npy r2-dest:< your-bucket-name> /tokenized/ --progress
153+ rclone copy r2-source:mixed-dataset -migration/tokenized/train_000001.npy r2-dest:< your-bucket-name> /tokenized/ --progress
154+ rclone copy r2-source:mixed-dataset- migration/tokenized/sample_ids_000000.npy r2-dest:< your-bucket-name> /tokenized/ --progress
155+ rclone copy r2-source:mixed-dataset- migration/tokenized/sample_ids_000001.npy r2-dest:< your-bucket-name> /tokenized/ --progress
156156```
157157
158158After migration, update your environment variables to point to your bucket:
0 commit comments