Open
Description
It seems that loading a dataset from HF using Unitxt is much slower than doing it using the datasets
package.
Compare this:
from datasets import load_dataset
from time import time
import os
from uuid import uuid4
path = os.path.join(f"cache/{uuid4()}")
t0 = time()
ds = load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
t1 = time()
print(t1-t0)
print(len(ds))
To:
from time import time
from unitxt import load_dataset
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
print(t1-t0)
print(len(ds))
The Unitxt version takes about x5 times longer.
In both cases a fresh new copy is downloaded.
Metadata
Assignees
Labels
No labels