Description
❓ Questions & Help
I am experiencing a large degradation in the performance of the Loader
when adding a transform with EmbeddingOperator
, for loading data from a pretrained embeddings numpy array.
I have been following the method exposed in this tutorial notebook
Without the transforms
argument the entire dataset is consumed in 6 seconds, while with the loading from pretrained embeddings array it takes almost 40 minutes!
My "validation.parquet" is a small NVTabular dataset with 16 partitions, totalling almost 200 MB.
Specifically with the transforms
enabled, I am seeing a very low CPU and GPU utilization, as well as close to zero GPU memory consumption. Nor the CPU or the GPU gets utilized more than 6%.
It seems very strange to me that simply reading batch_size
specific rows from a numpy array takes that much time, even considering moving them to GPU.
Details
Here is a minimal working example to reproduce this degradation.
from __future__ import annotations
from pathlib import Path
import numpy as np
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io.dataset import Dataset
from merlin.loader.tensorflow import Loader
from tqdm.auto import tqdm
def test_pretrained_loader():
data_path = "validation.parquet"
data_path = Path(data_path)
X = Dataset(data_path, engine="parquet")
pretrained_array = np.zeros((1_000_000, 2), dtype=np.float32)
loader = Loader(
X,
batch_size=4096,
shuffle=True,
transforms=[
EmbeddingOperator(
pretrained_array,
lookup_key="recruitment_id",
embedding_name="embeddings",
)
],
device="gpu",
)
for batch in tqdm(loader, desc="Iterating batches..."):
pass
if __name__ == "__main__":
test_pretrained_loader()
Question
Is this behaviour intended? What are possible bottlenecks for this? Is something like data prefetching or asynchronous loading applicable here?
### Tasks