Added
- CSV defined benchmarks
- Memory access latency benchmarks
- Sample Dockerfile to build nvloom
Changed
- Retry mechanism for CUDA multicast allocations was removed
Fixed
- Freeing MNNVL memory did not have enough MPI barriers, leading to race conditions in extremely rare edge-cases
- Benchmarking algorithm sometimes would record "end event" twice. This had no impact on benchmark results.