-
Notifications
You must be signed in to change notification settings - Fork 132
Description
I was trying to download the MSTuring-1B dataset using the create_dataset.py
script, which runs Azure's azcopy
tool. The download is quick, and the base1b.fbin
file has the expected size of 4x100x10^9+8 Bytes; however, the file is corrupted. I noticed this by slicing the dataset (using only the first 1M vectors and adjusting the header) that I diff-ed against the 1M slice, which can also be downloaded with the create_dataset.py
script (but this time runs urlopen
, not azcopy
).
Everything is fine when I download the entire dataset using other tools such as wget
.
In fact, there seems to be an issue when using azcopy
with a static website endpoint (cf. the issue in the azcopy-storage-azcopy
Github repository: Azure/azure-storage-azcopy#2836).
According to the issue, a blob endpoint (https://.blob.core.windows.net) instead of a static website endpoint (https://.web.core.windows.net) should be used. Is it possible to adjust the links? Moreover, including a checksum check to verify data integrity would also be helpful (it took quite a while to find out that there is an issue with the data). I suppose data corruption could also occur when downloading other large datasets with azcopy
(e.g., MSSPACEV-1b), but I did not check.
I noticed that on the website (https://big-ann-benchmarks.com/neurips21.html), the link to the MSTuring dataset is a blob endpoint link, but unfortunately, the blob is no longer publicly available.