Skip to content

[QST] Data loader with EmbeddingOperator using pretrained embeddings is very slow #1244

Open
@CarloNicolini

Description

@CarloNicolini

❓ Questions & Help

I am experiencing a large degradation in the performance of the Loader when adding a transform with EmbeddingOperator, for loading data from a pretrained embeddings numpy array.
I have been following the method exposed in this tutorial notebook

Without the transforms argument the entire dataset is consumed in 6 seconds, while with the loading from pretrained embeddings array it takes almost 40 minutes!
My "validation.parquet" is a small NVTabular dataset with 16 partitions, totalling almost 200 MB.
Specifically with the transforms enabled, I am seeing a very low CPU and GPU utilization, as well as close to zero GPU memory consumption. Nor the CPU or the GPU gets utilized more than 6%.
It seems very strange to me that simply reading batch_size specific rows from a numpy array takes that much time, even considering moving them to GPU.

Details

Here is a minimal working example to reproduce this degradation.

from __future__ import annotations

from pathlib import Path

import numpy as np
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io.dataset import Dataset
from merlin.loader.tensorflow import Loader
from tqdm.auto import tqdm


def test_pretrained_loader():
    data_path = "validation.parquet"
    data_path = Path(data_path)
    X = Dataset(data_path, engine="parquet")
    pretrained_array = np.zeros((1_000_000, 2), dtype=np.float32)

    loader = Loader(
        X,
        batch_size=4096,
        shuffle=True,
        transforms=[
            EmbeddingOperator(
                pretrained_array,
                lookup_key="recruitment_id",
                embedding_name="embeddings",
            )
        ],
        device="gpu",
    )

    for batch in tqdm(loader, desc="Iterating batches..."):
        pass


if __name__ == "__main__":
    test_pretrained_loader()

Question

Is this behaviour intended? What are possible bottlenecks for this? Is something like data prefetching or asynchronous loading applicable here?

### Tasks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions