Skip to content

slow dataset loading #1559

Closed as not planned
Closed as not planned
@assaftibm

Description

@assaftibm

It seems that loading a dataset from HF using Unitxt is much slower than doing it using the datasets package.

Compare this:

from datasets import load_dataset
from time import time
import os
from uuid import uuid4

path = os.path.join(f"cache/{uuid4()}")

t0 = time()
ds = load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
t1 = time()

print(t1-t0)

print(len(ds))

To:

from time import time

from unitxt import load_dataset


t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()

print(t1-t0)

print(len(ds))

The Unitxt version takes about x5 times longer.

In both cases a fresh new copy is downloaded.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions