[Upstream Breaking Change] NVSHMEM v3.4.5-0 → v3.5.19-1

## Summary

NVSHMEM **v3.5.19-1** has been released with **breaking API changes** and significant behavioral changes. The current pin is **v3.4.5-0** in `docker/Dockerfile.cuda` line 42.

## Breaking API Changes

### 1. **Tile API Renaming** (HIGH)

The v3.5.19 release renames tile collective operations:
- `tile_allreduce` → `tile_reduce`
- `tile_reduce` → `tile_rooted_reduce`

**Impact**: Code using the old function names will fail to compile.

### 2. **Static Library Removal** (HIGH)

- Removed `libnvshmem.a` static-only version
- **Must now link** to separate `libnvshmem_host` and `libnvshmem_device` libraries

**Impact**: Build scripts and linking configuration must be updated.

### 3. **Configuration Defaults Changed** (MEDIUM)

- `NVSHMEM_MAX_MEMORY_PER_GPU` increased from **128 GiB → 256 GiB** (default)
- Default number of used QPs changed from **4 → 8** for full bandwidth with data direct NICs

**Impact**: Memory footprint will double by default; may affect multi-GPU nodes with limited memory.

## New Features

### Positive Changes

- **qpair-specific API calls** (`nvshmemx_qp_*`) for RMA operations on specific queue pairs
- **Tile-granular RMA routines** (`tile_put`, `tile_get`, `tile_broadcast`)
- **LLVM bitcode library** support for IBGDA
- **Option to pass CUDA device** to `nvshmemx_init_attr`
- **EFA transport improvements** with multiple bug fixes for AWS environments
- **CUTLASS support** for tile API calls
- **Fixed race condition** in barrier causing hangs on unordered networks

### Hardware Support

- CUDA Toolkit: 12.4, 12.9, 13.0, 13.1
- GPUs: Ampere A100, Hopper, Blackwell
- CPUs: x86 and NVIDIA Grace

## Files Affected

Per `docs/upstream-versions.md`:
- **docker/Dockerfile.cuda** line 42 (`NVSHMEM` tag/version)

## Recommended Actions

1. **Audit build scripts** for `libnvshmem.a` references:
   ````bash
   grep -r "libnvshmem\.a" docker/ --include="Dockerfile*" --include="*.sh"
   ````

2. **Check for tile API usage** in vLLM codebase (pinned at d7de043d):
   ````bash
   # Check if vLLM uses NVSHMEM tile APIs
   grep -r "tile_allreduce\|tile_reduce\|tile_rooted_reduce" --include="*.cu" --include="*.cpp" --include="*.h"
   ````

3. **Update Dockerfile**:
   ````dockerfile
   # Line 42 in docker/Dockerfile.cuda
   ARG NVSHMEM_VERSION=v3.5.19-1
   ````

4. **Review memory requirements**: 256 GiB default per GPU may be excessive; consider setting `NVSHMEM_MAX_MEMORY_PER_GPU` explicitly

5. **Test container build** and E2E tests

6. **Verify EFA transport** if running on AWS (multiple improvements in this release)

## Impact

**HIGH** - Breaking API changes (tile function renames) and removal of static library will break builds that use these APIs. Memory default doubling may cause OOM on memory-constrained nodes.

## Upstream References

- Release: https://github.com/NVIDIA/nvshmem/releases/tag/v3.5.19-1
- Full Changelog: https://github.com/NVIDIA/nvshmem/compare/v3.4.5-0...v3.5.19-1

---

> Generated by [Upstream Dependency Monitor](https://github.com/llm-d/llm-d-infra/actions/runs/22335860249)




> Generated by [Upstream Dependency Monitor](https://github.com/llm-d/llm-d-infra/actions/runs/22335860249)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Upstream Breaking Change] NVSHMEM v3.4.5-0 → v3.5.19-1 #57

Summary

Breaking API Changes

1. Tile API Renaming (HIGH)

2. Static Library Removal (HIGH)

3. Configuration Defaults Changed (MEDIUM)

New Features

Positive Changes

Hardware Support

Files Affected

Recommended Actions

Impact

Upstream References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Upstream Breaking Change] NVSHMEM v3.4.5-0 → v3.5.19-1 #57

Description

Summary

Breaking API Changes

1. Tile API Renaming (HIGH)

2. Static Library Removal (HIGH)

3. Configuration Defaults Changed (MEDIUM)

New Features

Positive Changes

Hardware Support

Files Affected

Recommended Actions

Impact

Upstream References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions