[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow

# ❓ Questions & Help
I am experiencing a large degradation in the performance of the `Loader` when adding a transform with `EmbeddingOperator`, for loading data from a pretrained embeddings numpy array.
I have been following the method exposed in [this tutorial notebook](https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/entertainment-with-pretrained-embeddings.ipynb)

Without the `transforms` argument the entire dataset is consumed in 6 seconds, while with the loading from *pretrained embeddings array* it takes almost 40 minutes!
My "validation.parquet" is a small NVTabular dataset with 16 partitions, totalling almost 200 MB.
Specifically with the `transforms` enabled, I am seeing a very low CPU and GPU utilization, as well as close to zero GPU memory consumption. Nor the CPU or the GPU gets utilized more than 6%.
It seems very strange to me that simply reading `batch_size`  specific rows from a numpy array takes that much time, even considering moving them to GPU.


## Details
Here is a minimal working example to reproduce this degradation.

```python
from __future__ import annotations

from pathlib import Path

import numpy as np
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io.dataset import Dataset
from merlin.loader.tensorflow import Loader
from tqdm.auto import tqdm


def test_pretrained_loader():
    data_path = "validation.parquet"
    data_path = Path(data_path)
    X = Dataset(data_path, engine="parquet")
    pretrained_array = np.zeros((1_000_000, 2), dtype=np.float32)

    loader = Loader(
        X,
        batch_size=4096,
        shuffle=True,
        transforms=[
            EmbeddingOperator(
                pretrained_array,
                lookup_key="recruitment_id",
                embedding_name="embeddings",
            )
        ],
        device="gpu",
    )

    for batch in tqdm(loader, desc="Iterating batches..."):
        pass


if __name__ == "__main__":
    test_pretrained_loader()
```


## Question
Is this behaviour intended? What are possible bottlenecks for this? Is something like data prefetching or asynchronous loading applicable here?

```[tasklist]
### Tasks
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow #1244

❓ Questions & Help

Details

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow #1244

Description

❓ Questions & Help

Details

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions