Releases · NVIDIA/nvloom · GitHub

06 Nov 01:53

v.1.3.0 Latest

Latest

Added

CSV defined benchmarks
Memory access latency benchmarks
Sample Dockerfile to build nvloom

Changed

Retry mechanism for CUDA multicast allocations was removed

Fixed

Freeing MNNVL memory did not have enough MPI barriers, leading to race conditions in extremely rare edge-cases
Benchmarking algorithm sometimes would record "end event" twice. This had no impact on benchmark results.

Assets 2

21 Jul 18:58

v1.2.0

Added

Multicast reductions benchmarks
Option to specify iteration count (-i/--iterations)
Option to repeat a testcase for a specified number of iterations (-c/--repeat)
Option to repeat a testcase for a specified number of seconds (-d/--duration)
CUDA Stream Ordered Memory Allocator was added as a new allocator option (-a cudapool)

Changed

Caching multicast allocations is now much faster, thanks to multicast-specific memory pool

Assets 2

22 May 20:15

v1.1.0

Added

Heatmap plotter
Support for CUDA Error Log Management
Retry mechanism for CUDA multicast allocations
Nvloom_cli argument to set number of samples in gpu-to-rack testcases
Nvloom_cli now prints its version, git commit it was built from, and specified buffer size
Nvloom_cli now prints units when reporting results
Native compilation for sm_103 on CUDA 12.9 toolkits

Changed

Expanded README.md
Rack-to-rack are now both unidir and bidir, and bidir rack-to-rack are symmetry-optimized.

Fixed

Bug where requesting allocations over 4 GiB would fail with CUDA_OUT_OF_MEMORY

Assets 2

18 Mar 16:05

v1.0.0

Initial release

Assets 2