Skip to content

slow dataset loading #1559

Open
Open
@assaftibm

Description

It seems that loading a dataset from HF using Unitxt is much slower than doing it using the datasets package.

Compare this:

from datasets import load_dataset
from time import time
import os
from uuid import uuid4

path = os.path.join(f"cache/{uuid4()}")

t0 = time()
ds = load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
t1 = time()

print(t1-t0)

print(len(ds))

To:

from time import time

from unitxt import load_dataset


t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()

print(t1-t0)

print(len(ds))

The Unitxt version takes about x5 times longer.

In both cases a fresh new copy is downloaded.

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions