Skip to content

Update example scripts relying on http_get to download from the Hugging Face Hub instead #3620

@tomaarsen

Description

@tomaarsen

Hello!

Feature Request overview

  • Many example scripts use http_get, while we can more smoothly load that data with datasets

Details

Many example scripts and some tests rely on http_get to download e.g. https://sbert.net/datasets/stsbenchmark.tsv.gz / https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz / askubuntu / TREC, etc., while this data is often also easily accessible on Hugging Face. We should be able to simplify a lot of these scripts considerably with datasets (and perhaps also Dataset.map/Dataset.filter etc.).

For example

sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
if not os.path.exists(sts_dataset_path):
util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)

When we can follow the steps I already took in 548e463

to update these

# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
train_dataset = load_dataset("sentence-transformers/stsb", split="train")
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
test_dataset = load_dataset("sentence-transformers/stsb", split="test")

  • Tom Aarsen

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions