Possible Data Corruption in Criteo 1 TB Click Logs Dataset

I’m preparing the dataset for the MLPerf Training v5.0 - DLRM v2 benchmark and have encountered repeated errors when processing the Criteo 1 TB Click Logs dataset hosted on [Hugging Face](https://huggingface.co/datasets/criteo/CriteoClickLogs).
 
After downloading all 24 daily archives, I invoke the preprocessing script as documented:
```
bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
    ./criteo_1tb/raw_input_dataset_dir \
    ./criteo_1tb/temp_intermediate_files_dir \
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
```
 
For every file (day_0.gz through day_23.gz), the script halts with: 

> gzip: day_<i>.gz: unexpected end of file

This suggests each .gz archive may be truncated or otherwise corrupted on download. I have verified:
- Tried multiple download methods (Hugging Face CLI, wget, browser) on different machines and networks—same result.
- Re-running the script and re-downloading for each individual day zip file yields the same error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Data Corruption in Criteo 1 TB Click Logs Dataset #801

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible Data Corruption in Criteo 1 TB Click Logs Dataset #801

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions