Update example scripts relying on `http_get` to download from the Hugging Face Hub instead

Hello!

## Feature Request overview
* Many example scripts use `http_get`, while we can more smoothly load that data with `datasets`

## Details
Many example scripts and some tests rely on `http_get` to download e.g. `https://sbert.net/datasets/stsbenchmark.tsv.gz` / `https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz` / askubuntu / TREC, etc., while this data is often also easily accessible on Hugging Face. We should be able to simplify a lot of these scripts considerably with `datasets` (and perhaps also `Dataset.map`/`Dataset.filter` etc.).

For example https://github.com/huggingface/sentence-transformers/blob/5bd3e612d5de1cf41180b14e6eecd11a7d497be4/tests/test_train_stsb.py#L35-L37

When we can follow the steps I already took in https://github.com/huggingface/sentence-transformers/pull/2622/commits/548e4637614b2a431f7fea284987ed4cb375c026

to update these https://github.com/huggingface/sentence-transformers/blob/5bd3e612d5de1cf41180b14e6eecd11a7d497be4/examples/sentence_transformer/training/sts/training_stsbenchmark.py#L40-L43

- Tom Aarsen

	# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb
	train_dataset = load_dataset("sentence-transformers/stsb", split="train")
	eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
	test_dataset = load_dataset("sentence-transformers/stsb", split="test")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update example scripts relying on `http_get` to download from the Hugging Face Hub instead #3620

Feature Request overview

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	sts_dataset_path = "datasets/stsbenchmark.tsv.gz"
	if not os.path.exists(sts_dataset_path):
	util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path)

Update example scripts relying on http_get to download from the Hugging Face Hub instead #3620

Description

Feature Request overview

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Update example scripts relying on `http_get` to download from the Hugging Face Hub instead #3620