Skip to content

[Upstream Breaking Change] NVSHMEM v3.4.5-0 → v3.5.19-1 #57

@github-actions

Description

@github-actions

Summary

NVSHMEM v3.5.19-1 has been released with breaking API changes and significant behavioral changes. The current pin is v3.4.5-0 in docker/Dockerfile.cuda line 42.

Breaking API Changes

1. Tile API Renaming (HIGH)

The v3.5.19 release renames tile collective operations:

  • tile_allreducetile_reduce
  • tile_reducetile_rooted_reduce

Impact: Code using the old function names will fail to compile.

2. Static Library Removal (HIGH)

  • Removed libnvshmem.a static-only version
  • Must now link to separate libnvshmem_host and libnvshmem_device libraries

Impact: Build scripts and linking configuration must be updated.

3. Configuration Defaults Changed (MEDIUM)

  • NVSHMEM_MAX_MEMORY_PER_GPU increased from 128 GiB → 256 GiB (default)
  • Default number of used QPs changed from 4 → 8 for full bandwidth with data direct NICs

Impact: Memory footprint will double by default; may affect multi-GPU nodes with limited memory.

New Features

Positive Changes

  • qpair-specific API calls (nvshmemx_qp_*) for RMA operations on specific queue pairs
  • Tile-granular RMA routines (tile_put, tile_get, tile_broadcast)
  • LLVM bitcode library support for IBGDA
  • Option to pass CUDA device to nvshmemx_init_attr
  • EFA transport improvements with multiple bug fixes for AWS environments
  • CUTLASS support for tile API calls
  • Fixed race condition in barrier causing hangs on unordered networks

Hardware Support

  • CUDA Toolkit: 12.4, 12.9, 13.0, 13.1
  • GPUs: Ampere A100, Hopper, Blackwell
  • CPUs: x86 and NVIDIA Grace

Files Affected

Per docs/upstream-versions.md:

  • docker/Dockerfile.cuda line 42 (NVSHMEM tag/version)

Recommended Actions

  1. Audit build scripts for libnvshmem.a references:

    grep -r "libnvshmem\.a" docker/ --include="Dockerfile*" --include="*.sh"
  2. Check for tile API usage in vLLM codebase (pinned at d7de043d):

    # Check if vLLM uses NVSHMEM tile APIs
    grep -r "tile_allreduce\|tile_reduce\|tile_rooted_reduce" --include="*.cu" --include="*.cpp" --include="*.h"
  3. Update Dockerfile:

    # Line 42 in docker/Dockerfile.cuda
    ARG NVSHMEM_VERSION=v3.5.19-1
  4. Review memory requirements: 256 GiB default per GPU may be excessive; consider setting NVSHMEM_MAX_MEMORY_PER_GPU explicitly

  5. Test container build and E2E tests

  6. Verify EFA transport if running on AWS (multiple improvements in this release)

Impact

HIGH - Breaking API changes (tile function renames) and removal of static library will break builds that use these APIs. Memory default doubling may cause OOM on memory-constrained nodes.

Upstream References


Generated by Upstream Dependency Monitor

Generated by Upstream Dependency Monitor

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions