Skip to content

GPU memory does not get freed up properly after each batch #108

Open
@felix0097

Description

@felix0097

Describe the issue:

Dataloader accumulates GPU memory across batches if not manually calling gc.collect() after each batch or after every e.g every 5th batch. See example below, manually calling garbage collection saves around 7GiB in max GPU memory usage (11GiB vs 18GiB). Is there a way to free up GPU memory more reliable after each batch?

Minimal Complete Verifiable Example:

Create example data:

import pandas as pd
import numpy as np

n_samples = 20480

df = pd.DataFrame({
    'x': [np.random.uniform(size=(19357, )).astype('f4') for _ in range(n_samples)],
    'y': np.random.choice(range(100), size=n_samples).astype('i8')
})

df.to_parquet('test.parquet', row_group_size=1024, engine='pyarrow')

Check memory usage:

import merlin.io
from merlin.dataloader.torch import Loader
from merlin.schema import ColumnSchema, Schema

import gc
from pynvml import nvmlDeviceGetMemoryInfo, nvmlDeviceGetHandleByIndex


dataset = merlin.io.Dataset(
    'test.parquet', 
    engine='parquet', 
    part_size='180MB',
    schema=Schema([
        ColumnSchema(
            'x', dtype='float32', 
            is_list=True, is_ragged=False, 
            properties={'value_count': {'max': 19357}}
        ),
        ColumnSchema('y', dtype='int64')
    ])
)
print(dataset.partition_lens[:10])  # --> [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048]


def benchmark(dataset, batch_size=4096, n_samples=1_000_000, call_gc=False):
    handle = nvmlDeviceGetHandleByIndex(0)
    max_memory = nvmlDeviceGetMemoryInfo(handle).used

    num_iter = n_samples // batch_size
    loader = Loader(dataset, batch_size=batch_size, shuffle=True, drop_last=True).epochs(100)

    for i, (batch, _) in enumerate(loader):
        x, y = batch['x'], batch['y']
        max_memory = max((max_memory, nvmlDeviceGetMemoryInfo(handle).used))
        if call_gc:
            gc.collect()
        if i == num_iter:
            break  

    loader.stop()
    gc.collect()

    return max_memory

Without manually calling garbage collection

max_mem = benchmark(dataset, batch_size=4096, n_samples=300_000, call_gc=False)
print('Max GPU memory usage:', max_mem // 1024**2 , 'MiB') # --> Gives: Max GPU memory usage: 18435 MiB

With manually calling garbage collection

max_mem = benchmark(dataset, batch_size=4096, n_samples=300_000, call_gc=True)
print('Max GPU memory usage:', max_mem // 1024**2 , 'MiB')  # --> Gives: Max GPU memory usage: 11305 MiB

Environment:

OS: Rocky Linux 8.7
Python: 3.10.9
merlin-core: 0.10.0
merlin-dataloader: 0.0.4
cudf-cu11: 23.02
rmm-cu11: 23.02
dask-cudf: 23.02

I installed both cudf + merlin via pip:
python -m pip install cudf-cu11==23.02 rmm-cu11==23.02 dask-cudf-cu11==23.02 --extra-index-url https://pypi.nvidia.com/
python -m pip install merlin-dataloader

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions