-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
NVSHMEM v3.5.19-1 has been released with breaking API changes and significant behavioral changes. The current pin is v3.4.5-0 in docker/Dockerfile.cuda line 42.
Breaking API Changes
1. Tile API Renaming (HIGH)
The v3.5.19 release renames tile collective operations:
tile_allreduce→tile_reducetile_reduce→tile_rooted_reduce
Impact: Code using the old function names will fail to compile.
2. Static Library Removal (HIGH)
- Removed
libnvshmem.astatic-only version - Must now link to separate
libnvshmem_hostandlibnvshmem_devicelibraries
Impact: Build scripts and linking configuration must be updated.
3. Configuration Defaults Changed (MEDIUM)
NVSHMEM_MAX_MEMORY_PER_GPUincreased from 128 GiB → 256 GiB (default)- Default number of used QPs changed from 4 → 8 for full bandwidth with data direct NICs
Impact: Memory footprint will double by default; may affect multi-GPU nodes with limited memory.
New Features
Positive Changes
- qpair-specific API calls (
nvshmemx_qp_*) for RMA operations on specific queue pairs - Tile-granular RMA routines (
tile_put,tile_get,tile_broadcast) - LLVM bitcode library support for IBGDA
- Option to pass CUDA device to
nvshmemx_init_attr - EFA transport improvements with multiple bug fixes for AWS environments
- CUTLASS support for tile API calls
- Fixed race condition in barrier causing hangs on unordered networks
Hardware Support
- CUDA Toolkit: 12.4, 12.9, 13.0, 13.1
- GPUs: Ampere A100, Hopper, Blackwell
- CPUs: x86 and NVIDIA Grace
Files Affected
Per docs/upstream-versions.md:
- docker/Dockerfile.cuda line 42 (
NVSHMEMtag/version)
Recommended Actions
-
Audit build scripts for
libnvshmem.areferences:grep -r "libnvshmem\.a" docker/ --include="Dockerfile*" --include="*.sh"
-
Check for tile API usage in vLLM codebase (pinned at d7de043d):
# Check if vLLM uses NVSHMEM tile APIs grep -r "tile_allreduce\|tile_reduce\|tile_rooted_reduce" --include="*.cu" --include="*.cpp" --include="*.h"
-
Update Dockerfile:
# Line 42 in docker/Dockerfile.cuda ARG NVSHMEM_VERSION=v3.5.19-1
-
Review memory requirements: 256 GiB default per GPU may be excessive; consider setting
NVSHMEM_MAX_MEMORY_PER_GPUexplicitly -
Test container build and E2E tests
-
Verify EFA transport if running on AWS (multiple improvements in this release)
Impact
HIGH - Breaking API changes (tile function renames) and removal of static library will break builds that use these APIs. Memory default doubling may cause OOM on memory-constrained nodes.
Upstream References
- Release: https://github.com/NVIDIA/nvshmem/releases/tag/v3.5.19-1
- Full Changelog: NVIDIA/nvshmem@v3.4.5-0...v3.5.19-1
Generated by Upstream Dependency Monitor
Generated by Upstream Dependency Monitor