Skip to content

[BUG] (feature/fvdb) multi-card training error and memory leak #2030

Open
@xiaoc57

Description

@xiaoc57

Issues encountered during experiments with fvdb

I encountered two issues during my experiments using the latest fvdb environment:

  1. Incorrect Function Call Causing Multi-GPU Training Failure

    The current implementation in TorchDeviceBuffer::create is incompatible with recent changes to NanoVDB's buffer creation. Specifically, NanoVDB buffer creation has been updated from:

auto buffer = BufferT::create(mData.size, &pool, false); // only allocate buffer on the device 

to:

auto buffer = BufferT::create(mData.size, &pool, device, mStream); // only allocate buffer on the device

However, the corresponding TorchDeviceBuffer::create method signature remains:

TorchDeviceBuffer::create(uint64_t size, const TorchDeviceBuffer *proto, bool host, void *stream)

Due to this mismatch, the function call fails, preventing successful multi-GPU training.

  1. Potential GPU Memory Leak During Training
    I observed a gradual increase in GPU memory usage during training, eventually leading to out-of-memory errors. However, it is challenging to confirm definitively whether this memory leak is due to the fvdb framework, as the incremental memory increase per iteration is very small.

These issues were detected under the latest fvdb version and its associated environment. Based on initial observations, it seems that the XCube implementation utilizing fvdb may not exhibit these problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions