Open
Description
Issues encountered during experiments with fvdb
I encountered two issues during my experiments using the latest fvdb environment:
-
Incorrect Function Call Causing Multi-GPU Training Failure
The current implementation in
TorchDeviceBuffer::create
is incompatible with recent changes to NanoVDB's buffer creation. Specifically, NanoVDB buffer creation has been updated from:
auto buffer = BufferT::create(mData.size, &pool, false); // only allocate buffer on the device
to:
auto buffer = BufferT::create(mData.size, &pool, device, mStream); // only allocate buffer on the device
However, the corresponding TorchDeviceBuffer::create method signature remains:
TorchDeviceBuffer::create(uint64_t size, const TorchDeviceBuffer *proto, bool host, void *stream)
Due to this mismatch, the function call fails, preventing successful multi-GPU training.
- Potential GPU Memory Leak During Training
I observed a gradual increase in GPU memory usage during training, eventually leading to out-of-memory errors. However, it is challenging to confirm definitively whether this memory leak is due to the fvdb framework, as the incremental memory increase per iteration is very small.
These issues were detected under the latest fvdb version and its associated environment. Based on initial observations, it seems that the XCube implementation utilizing fvdb may not exhibit these problems.