Skip to content

Possible Data Corruption in Criteo 1 TB Click Logs Dataset #801

@IFOnlyC

Description

@IFOnlyC

I’m preparing the dataset for the MLPerf Training v5.0 - DLRM v2 benchmark and have encountered repeated errors when processing the Criteo 1 TB Click Logs dataset hosted on Hugging Face.

After downloading all 24 daily archives, I invoke the preprocessing script as documented:

bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
    ./criteo_1tb/raw_input_dataset_dir \
    ./criteo_1tb/temp_intermediate_files_dir \
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir

For every file (day_0.gz through day_23.gz), the script halts with:

gzip: day_.gz: unexpected end of file

This suggests each .gz archive may be truncated or otherwise corrupted on download. I have verified:

  • Tried multiple download methods (Hugging Face CLI, wget, browser) on different machines and networks—same result.
  • Re-running the script and re-downloading for each individual day zip file yields the same error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions