-
Notifications
You must be signed in to change notification settings - Fork 586
Closed
Description
I’m preparing the dataset for the MLPerf Training v5.0 - DLRM v2 benchmark and have encountered repeated errors when processing the Criteo 1 TB Click Logs dataset hosted on Hugging Face.
After downloading all 24 daily archives, I invoke the preprocessing script as documented:
bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
./criteo_1tb/raw_input_dataset_dir \
./criteo_1tb/temp_intermediate_files_dir \
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
For every file (day_0.gz through day_23.gz), the script halts with:
gzip: day_.gz: unexpected end of file
This suggests each .gz archive may be truncated or otherwise corrupted on download. I have verified:
- Tried multiple download methods (Hugging Face CLI, wget, browser) on different machines and networks—same result.
- Re-running the script and re-downloading for each individual day zip file yields the same error.
Metadata
Metadata
Assignees
Labels
No labels