[BUG] (feature/fvdb) multi-card training error and memory leak

### Issues encountered during experiments with fvdb

I encountered two issues during my experiments using the latest fvdb environment:

1. **Incorrect Function Call Causing Multi-GPU Training Failure**

   The current implementation in `TorchDeviceBuffer::create` is incompatible with recent changes to NanoVDB's buffer creation. Specifically, NanoVDB buffer creation has been updated from:

```cpp
auto buffer = BufferT::create(mData.size, &pool, false); // only allocate buffer on the device 
```
to:
  ```cpp 
auto buffer = BufferT::create(mData.size, &pool, device, mStream); // only allocate buffer on the device
```

However, the corresponding TorchDeviceBuffer::create method signature remains:

```cpp
TorchDeviceBuffer::create(uint64_t size, const TorchDeviceBuffer *proto, bool host, void *stream)
```

Due to this mismatch, the function call fails, preventing successful multi-GPU training.

2. **Potential GPU Memory Leak During Training**
I observed a gradual increase in GPU memory usage during training, eventually leading to out-of-memory errors. However, it is challenging to confirm definitively whether this memory leak is due to the fvdb framework, as the incremental memory increase per iteration is very small.

These issues were detected under the latest fvdb version and its associated environment. Based on initial observations, it seems that the XCube implementation utilizing fvdb may not exhibit these problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] (feature/fvdb) multi-card training error and memory leak #2030

Issues encountered during experiments with fvdb

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] (feature/fvdb) multi-card training error and memory leak #2030

Description

Issues encountered during experiments with fvdb

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions