GPU memory does not get freed up properly after each batch

**Describe the issue**:

Dataloader accumulates GPU memory across batches if not manually calling `gc.collect()` after each batch or after every e.g every 5th batch. See example below, manually calling garbage collection saves around `7GiB` in max GPU memory usage (`11GiB` vs `18GiB`). Is there a way to free up GPU memory more reliable after each batch? 

**Minimal Complete Verifiable Example**:

Create example data:
```python
import pandas as pd
import numpy as np

n_samples = 20480

df = pd.DataFrame({
    'x': [np.random.uniform(size=(19357, )).astype('f4') for _ in range(n_samples)],
    'y': np.random.choice(range(100), size=n_samples).astype('i8')
})

df.to_parquet('test.parquet', row_group_size=1024, engine='pyarrow')
```

Check memory usage:

```python
import merlin.io
from merlin.dataloader.torch import Loader
from merlin.schema import ColumnSchema, Schema

import gc
from pynvml import nvmlDeviceGetMemoryInfo, nvmlDeviceGetHandleByIndex


dataset = merlin.io.Dataset(
    'test.parquet', 
    engine='parquet', 
    part_size='180MB',
    schema=Schema([
        ColumnSchema(
            'x', dtype='float32', 
            is_list=True, is_ragged=False, 
            properties={'value_count': {'max': 19357}}
        ),
        ColumnSchema('y', dtype='int64')
    ])
)
print(dataset.partition_lens[:10])  # --> [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048]


def benchmark(dataset, batch_size=4096, n_samples=1_000_000, call_gc=False):
    handle = nvmlDeviceGetHandleByIndex(0)
    max_memory = nvmlDeviceGetMemoryInfo(handle).used

    num_iter = n_samples // batch_size
    loader = Loader(dataset, batch_size=batch_size, shuffle=True, drop_last=True).epochs(100)

    for i, (batch, _) in enumerate(loader):
        x, y = batch['x'], batch['y']
        max_memory = max((max_memory, nvmlDeviceGetMemoryInfo(handle).used))
        if call_gc:
            gc.collect()
        if i == num_iter:
            break  

    loader.stop()
    gc.collect()

    return max_memory

```

Without manually calling garbage collection
```python
max_mem = benchmark(dataset, batch_size=4096, n_samples=300_000, call_gc=False)
print('Max GPU memory usage:', max_mem // 1024**2 , 'MiB') # --> Gives: Max GPU memory usage: 18435 MiB
```

With manually calling garbage collection
```python
max_mem = benchmark(dataset, batch_size=4096, n_samples=300_000, call_gc=True)
print('Max GPU memory usage:', max_mem // 1024**2 , 'MiB')  # --> Gives: Max GPU memory usage: 11305 MiB
```

**Environment**:

OS: Rocky Linux 8.7
Python: 3.10.9
merlin-core: 0.10.0
merlin-dataloader: 0.0.4
cudf-cu11: 23.02
rmm-cu11: 23.02
dask-cudf: 23.02

I installed both cudf + merlin via pip:
`python -m pip install cudf-cu11==23.02 rmm-cu11==23.02 dask-cudf-cu11==23.02 --extra-index-url https://pypi.nvidia.com/`
`python -m pip install merlin-dataloader`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory does not get freed up properly after each batch #108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU memory does not get freed up properly after each batch #108

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions