Skip to content

Corrupted files when downloading with azcopy #331

@mwidmoser

Description

@mwidmoser

I was trying to download the MSTuring-1B dataset using the create_dataset.py script, which runs Azure's azcopy tool. The download is quick, and the base1b.fbin file has the expected size of 4x100x10^9+8 Bytes; however, the file is corrupted. I noticed this by slicing the dataset (using only the first 1M vectors and adjusting the header) that I diff-ed against the 1M slice, which can also be downloaded with the create_dataset.py script (but this time runs urlopen, not azcopy).
Everything is fine when I download the entire dataset using other tools such as wget.

In fact, there seems to be an issue when using azcopy with a static website endpoint (cf. the issue in the azcopy-storage-azcopy Github repository: Azure/azure-storage-azcopy#2836).

According to the issue, a blob endpoint (https://.blob.core.windows.net) instead of a static website endpoint (https://.web.core.windows.net) should be used. Is it possible to adjust the links? Moreover, including a checksum check to verify data integrity would also be helpful (it took quite a while to find out that there is an issue with the data). I suppose data corruption could also occur when downloading other large datasets with azcopy (e.g., MSSPACEV-1b), but I did not check.

I noticed that on the website (https://big-ann-benchmarks.com/neurips21.html), the link to the MSTuring dataset is a blob endpoint link, but unfortunately, the blob is no longer publicly available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions