Releases · openucx/ucc

21 Oct 15:46

wfaderhold21

v1.6.0-rc2

9093b01

v1.6.0-rc2 Pre-release

Pre-release

What's Changed

Build and Test

Check for CX7 in wait_on_data gtest {PR #1127}

Tools

Add CUDA managed memory type to ucc_perftest {PR #1199}

Assets 2

13 Oct 17:13

wfaderhold21

v1.6.0-rc1

6575f83

v1.6.0-rc1 Pre-release

Pre-release

New Features and Enhancements

Core

Added UCC_DEBUGGER_WAIT environment variable {PR #1130}

CL/HIER

Fixed Wlto-type-mismatch {PR #1179}

TL/CUDA

Fixed printing of device PCI id {PR #1053}
Added NVLS improvements and bfloat16 data type support {PR #1162}
Added NVLS barrier {PR #1180}
Added Alltoall(v) copy engine {PR #1138}

TL/UCP

Removed a debug print statement {PR #1177}
Added knomial allgather with mapped buffers {PR #1176}
Added node local id config {PR #1189}
Enable knomial allgatherv {PR #1188}
Added congestion avoidant onesided Alltoall {PR #1096}

Build and Test

Added check to see if target exists in CMAKE {PR #1173}
Fixed build with GCC 14 {PR #1190}
Added gtest and mpi test for ucc_mem_map and ucc_mem_unmap {PR #1165}

Tools

Updated perftest to print BusBW {PR #1186}
Added support for onesided alltoall in perftest {PR #1194}

Assets 2

10 Sep 14:15

janjust

v1.5.1

16ec7ab

v1.5.1 Latest

Latest

What's Changed

CL/HIER

Fix Wlto-type-mismatch {PR #1179}

Build and Test

Adjusting rocm gfx targets for rocm {PR #1183}

Documentation

v1.5.x: update NEWS {PR #1184}

Full Changelog: v1.5.0...v1.5.1

Assets 2

07 Aug 19:17

janjust

v1.5.0

430e241

v1.5.0

New Features and Enhancements

Core

Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
Enhanced error logs in context creation {PR #1135}
Enhanced error log in collective init {PR #1104}
Added ucc net devices config {PR #1141}
EC/CUDA: Link with stdc++ {PR #1168}

CL/HIER

Added flag for nonroot info {PR #1123}
Removed per node leader, fix double free {PR #1126}

TL/UCP

Fixed allreduce knomial data consistency {PR #1145}
Fixed ag oneshot {PR #1134}
Added Allgather linear implementation {PR #1122}
Fall back if memh not passed {PR #1136}

TL/MLX5

Added HCA-assisted copy & CUDA scratch design {PR #1154}
Added logging for mcast FORCE/TRY modes {PR #1156}
Fixed segfault in multicast team creation {PR #1150}
Recover from ipoib issue in mcast init {PR #1140}
Added configuration to set IB QP SL {PR #1057}
Added ctx global status check {PR #1113}
Added cuda support for zcopy mcast {PR #1118}
Add reliability-init improvements {PR #1163}

TL/CUDA

Added NVLink SHARP (NVLS) Allreduce {PR #1148}
Added Topology Cache {PR #1137}
Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}

EC/ROCM

Include stdbool.h for new versions of ROCM {PR #1146}

TOPO

Node ldr ordered by team {PR #1129}

Build and Test

Fixed coverity issues {PR #1152}
Updated cuda arch {PR #1143}
Changed to CUDA 12.9 {PR #1155}
Added buffers for onesided tests {PR #1100}
Added perftest generator {PR #1147}
Added missing progress calls in UCC_PERFTEST {PR #1151}
Updated versions in CI {PR #1115}
Bumped version to v1.5 {PR #1121}

Documentation

Updated component image 1.4.4 {PR #1153}

Tools

Added perftest generator {PR #1147}
Added missing progress calls in UCC_PERFTEST {PR #1151}

Assets 2

17 Jul 06:18

janjust

v1.5.0-rc1

91a5549

v1.5.0-rc1 Pre-release

Pre-release

New Features and Enhancements

Core

Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
Enhanced error logs in context creation {PR #1135}
Enhanced error log in collective init {PR #1104}
Added ucc net devices config {PR #1141}

CL/HIER

Added flag for nonroot info {PR #1123}
Removed per node leader, fix double free {PR #1126}

TL/UCP

Fixed allreduce knomial data consistency {PR #1145}
Fixed ag oneshot {PR #1134}
Added Allgather linear implementation {PR #1122}
Fall back if memh not passed {PR #1136}

TL/MLX5

Added HCA-assisted copy & CUDA scratch design {PR #1154}
Added logging for mcast FORCE/TRY modes {PR #1156}
Fixed segfault in multicast team creation {PR #1150}
Recover from ipoib issue in mcast init {PR #1140}
Added configuration to set IB QP SL {PR #1057}
Added ctx global status check {PR #1113}
Added cuda support for zcopy mcast {PR #1118}

TL/CUDA

Added NVLink SHARP (NVLS) Allreduce {PR #1148}
Added Topology Cache {PR #1137}
Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}

EC/ROCM

Include stdbool.h for new versions of ROCM {PR #1146}

TOPO

Node ldr ordered by team {PR #1129}

Build and Test

Fixed coverity issues {PR #1152}
Updated cuda arch {PR #1143}
Changed to CUDA 12.9 {PR #1155}
Added buffers for onesided tests {PR #1100}
Added perftest generator {PR #1147}
Added missing progress calls in UCC_PERFTEST {PR #1151}
Updated versions in CI {PR #1115}
Bumped version to v1.5 {PR #1121}

Documentation

Updated component image 1.4.4 {PR #1153}

Tools

Added perftest generator {PR #1147}
Added missing progress calls in UCC_PERFTEST {PR #1151}

Assets 2

09 May 07:55

Sergei-Lebedev

v1.4.4

2c77074

1.4.4

New Features and Enhancements

Core

Implemented asymmetric memory support {PR #1000}
Enhanced error handling and resource cleanup {PR #960, #951}
Improved service team handling {PR #1046}
Fixed triggered post for zero size collectives {PR #960}

CL/HIER

Added allgatherv support {PR #1111}
Implemented node subgroup unpacking {PR #1103}
Added reduce to supported collectives {PR #997}
Fixed integer overflow in alltoall {PR #944}

TL/UCP

Split single and multithreaded send/receive operations {PR #1109}
Added knomial allgather with CUDA memory support {PR #1095}
Implemented reduce SRG knomial algorithm {PR #1058}
Added radix selection to knomial operations {PR #1072}
Added sliding window allreduce implementation {PR #958}
Added knomial allgatherv support {PR #1008}
Added sparbit algorithm for allgather {PR #940}
Extended broadcast active set support for size > 2 {PR #926}
Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

Added multicast-based zero-copy broadcast {PR #1087}
Implemented mcast multi-group support {PR #1060}
Added non-blocking CUDA memory copy support {PR #1040}
Added device memory multicast broadcast {PR #989}
Enhanced mcast allgather staging-based algorithm {PR #994}
Improved one-sided mcast reliability initialization {PR #980}
Various performance optimizations in alltoall {PR #1067}
Fixed fences in all-to-all WQEs {PR #1069}
Added context option to disable all-to-all operations {PR #1062}
Improved error handling and device checks {PR #1102}
Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

Added support for allgather operation {PR #1081}
Enabled reduce-scatter with SAT support {PR #1084}
Added SHARP multi-channel support {PR #1049}
Fixed service team OOB handling {PR #1001}
Improved internal OOB usage {PR #986}

CUDA

Added linear broadcast implementation {PR #948}
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
Enhanced error handling for CUDA context operations {PR #1025}
Fixed context cleanup in CUDA operations {PR #954}

Build and Test

Added support for specific GPU architectures with ROCM {PR #987}
Added UCC pkg-config support {PR #1036}
Fixed build compatibility with NVC compiler {PR #1052}
Enhanced config parser functionality {PR #1092}
Enhanced ASAN/LSAN memory leak detection {PR #1074}
Added error checking and exit handling in gtests {PR #1083}

Documentation

Updated README with UCC publication information {PR #1028}
Added DOCA_UROM documentation {PR #999}
Fixed Doxygen documentation issues {PR #1038}
Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

Implemented new DOCA UROM plugin {PR #978}
Added support for offloading collective operations to DPUs
Implemented allreduce collective

Assets 2

15 Apr 08:06

Sergei-Lebedev

v1.4.4-rc1

d733d17

v1.4.4-rc1 Pre-release

Pre-release

New Features and Enhancements

Core

Implemented asymmetric memory support {PR #1000}
Enhanced error handling and resource cleanup {PR #960, #951}
Improved service team handling {PR #1046}
Fixed triggered post for zero size collectives {PR #960}

CL/HIER

Added allgatherv support {PR #1111}
Implemented node subgroup unpacking {PR #1103}
Added reduce to supported collectives {PR #997}
Fixed integer overflow in alltoall {PR #944}

TL/UCP

Split single and multithreaded send/receive operations {PR #1109}
Added knomial allgather with CUDA memory support {PR #1095}
Implemented reduce SRG knomial algorithm {PR #1058}
Added radix selection to knomial operations {PR #1072}
Added sliding window allreduce implementation {PR #958}
Added knomial allgatherv support {PR #1008}
Added sparbit algorithm for allgather {PR #940}
Extended broadcast active set support for size > 2 {PR #926}
Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

Added multicast-based zero-copy broadcast {PR #1087}
Implemented mcast multi-group support {PR #1060}
Added non-blocking CUDA memory copy support {PR #1040}
Added device memory multicast broadcast {PR #989}
Enhanced mcast allgather staging-based algorithm {PR #994}
Improved one-sided mcast reliability initialization {PR #980}
Various performance optimizations in alltoall {PR #1067}
Fixed fences in all-to-all WQEs {PR #1069}
Added context option to disable all-to-all operations {PR #1062}
Improved error handling and device checks {PR #1102}
Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

Added support for allgather operation {PR #1081}
Enabled reduce-scatter with SAT support {PR #1084}
Added SHARP multi-channel support {PR #1049}
Fixed service team OOB handling {PR #1001}
Improved internal OOB usage {PR #986}

CUDA

Added linear broadcast implementation {PR #948}
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
Enhanced error handling for CUDA context operations {PR #1025}
Fixed context cleanup in CUDA operations {PR #954}

Build and Test

Added support for specific GPU architectures with ROCM {PR #987}
Added UCC pkg-config support {PR #1036}
Fixed build compatibility with NVC compiler {PR #1052}
Enhanced config parser functionality {PR #1092}
Enhanced ASAN/LSAN memory leak detection {PR #1074}
Added error checking and exit handling in gtests {PR #1083}

Documentation

Updated README with UCC publication information {PR #1028}
Added DOCA_UROM documentation {PR #999}
Fixed Doxygen documentation issues {PR #1038}
Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

Implemented new DOCA UROM plugin {PR #978}
Added support for offloading collective operations to DPUs
Implemented allreduce collective

Assets 2

18 Apr 18:10

manjugv

v1.3.0

1522ccf

1.3.0 (April 18th, 2024)

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}
Updating NEWS for v1.3 {PR #937}

BUILD and TEST

Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

Assets 2

04 Mar 20:19

manjugv

v.1.3.0-rc1

484f69a

v1.3.0-rc1 Pre-release

Pre-release

1.3.0 (TBD)

New Features and Enhancements

CL/HIER

Disable onesided alltoallv {PR #875}

TL/CUDA

Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

Enable hybrid alltoallv {PR #781}
Avoid copy in knomial scatter {PR #771}
Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
Remove memcpy in last SRA step {PR #743}
Fix sparse pack in hybrid a2av {PR #825}
Fix recycle in hybrid a2av {PR #827}
Reorder ranks for SRA {PR #834}
Use ring allgather when reordering needed {PR #879}
Use pipelining in SRA allreduce for CUDA {PR #873}
Poll for onesided alltoall completion {PR #876}
Add support for non-host buffers in bruck alltoall {PR #852}
Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

Enable bcast for any predefined dt {PR #774}
Don't print team create error {PR #777}
Check datasize supported {PR #776}
Fix sharp context cleanup {PR #843}

API

Remove duplicate get_version_string {PR #933}

TL/NCCL

Make team init non-blocking {PR #772}
Add CUDA managed to score {PR #793}
Make ncclGroupEnd nb {PR #798}
Lazy init nccl comm {PR #851}

TL/MLX5

Share ib_ctx and pd {PR #749}
Rcache {PR #753}
Device memory and topo init {PR #780}
Adding mcast interface {PR #784}
A2A part 1 -- coll init {PR #790}
A2A part 2 -- full collective {PR #802}
Revisit team and ctx init {PR #815}
Fix context create hang {PR #887}
Add librdmacm linkage {PR #910}

CORE

Fix score update when only score given {PR #779}
Coverity fixes {PR #809}
Additional coverty fixes {PR #813}
Fix error handling for ctx create epilog {PR #818}
Skip zero size collectives {PR #787}

DOCS

Updating NEWS for v1.2 {PR #791}

TEST

Check op and dt compatibility {PR #773}
Fix barrier test {PR #799}
Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}

Assets 2

13 Jun 13:27

manjugv

v1.2.0

20fc186

UCC v1.2.0

This release includes numerous updates, bug fixes, and improvements across various components. The following is a summary of the changes based on the commit messages:

New Features and Enhancements

CL/HIER

Fixed single proc on node issue in alltoall (#658)
Implemented allreduce rab pipelined (#608)
Added bcast 2step algorithm (#620)
Fixed allreduce rab pipeline (#759)

TL/CUDA

Support for CUDA 12
Fixed cache unmap issue (#642)
Implemented reduce scatter linear (#669)
Added algorithm selection based on topology (#688)
Fixed linear algorithms (#751)
Fixed pipelining in linear rs (#770)

TL/UCP

Added special service worker (#560)
Added scatterv (#663)
Added gatherv (#664)
Fixed running with npolls 0 (#695)
Added knomial allgather (#729)
Fixed bug for triggered colls (#757)
Added bruck alltoall (#756)
Added SLOAV alltoallv (#687)
Large message broadcast optimizations (#738)
Ranks reordering in ring allgather for better locality(#69)

TL/SHARP

Fixed memory type check in allreduce (#662)
Added support for sharpv3 dt (#661)
Fixed assert check (#686)
Implemented SHARP OOB fixes (#746)
Fixed local rank when NODE SBGP not enabled (#760)
Prevented sharp team with team max ppn > 1 (#761)

CORE

Fixed memory type score update (#650)
Fixed ucc parser build (#666)
Implemented ucc_pipeline_params (#675)
Changed log level of config_modify (#667)
Fixed timeout handle for triggered post (#679)

DOCS

Added User Guide (#720)

Assets 2

Releases: openucx/ucc

v1.6.0-rc2

What's Changed

Build and Test

Tools

Uh oh!

v1.6.0-rc1

New Features and Enhancements

Core

CL/HIER

TL/CUDA

TL/UCP

Build and Test

Tools

Uh oh!

v1.5.1

What's Changed

CL/HIER

Build and Test

Documentation

Uh oh!

v1.5.0

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/CUDA

EC/ROCM

TOPO

Build and Test

Documentation

Tools

Uh oh!

v1.5.0-rc1

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/CUDA

EC/ROCM

TOPO

Build and Test

Documentation

Tools

Uh oh!

1.4.4

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/SHARP

CUDA

Build and Test

Documentation

CL/DOCA_UROM

Uh oh!

v1.4.4-rc1

New Features and Enhancements

Core

CL/HIER

TL/UCP

TL/MLX5

TL/SHARP

CUDA

Build and Test

Documentation

CL/DOCA_UROM

Uh oh!

1.3.0 (April 18th, 2024)

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

TL/CUDA

TL/UCP

TL/SHARP

API

TL/NCCL