TL/UCP: add allreduce ring algorithm by wfaderhold21 · Pull Request #1258 · openucx/ucc

wfaderhold21 · 2026-01-22T00:25:59Z

What

This a reproduction of PR #1082 with some changes. The original algorithm was an implementation of a ring-based reduce-scatter + allgather algorithm to perform allreduce. This is the same but has been converted to a schedule-based approach to reuse reduce-scatter and allgather ring algorithms. This improves performance compared to the original approach.

Allreduce Performance Comparison

Configuration: Thor cluster, 16 Nodes, 32 PPN (512 processes total)

Size	UCC Ring (μs)	UCC Default (μs)	OMPI Tuned (μs)	Ring vs Default	Ring vs OMPI
4B	20.09	17.41	30.04	0.87x	1.49x
8B	14.33	12.18	12.34	0.85x	0.86x
16B	14.96	12.96	6.93	0.87x	0.46x
32B	11.88	12.11	5.46	1.02x	0.46x
64B	12.60	12.12	7.06	0.96x	0.56x
128B	18.31	18.88	5.73	1.03x	0.31x
256B	19.06	18.43	8.41	0.97x	0.44x
512B	27.81	27.37	7.48	0.98x	0.27x
1KB	35.48	35.84	19.34	1.01x	0.55x
2KB	401.68	51.39	22.06	0.13x	0.05x
4KB	971.32	36.01	28.00	0.04x	0.03x
8KB	940.93	44.18	43.39	0.05x	0.05x
16KB	979.73	56.61	103.19	0.06x	0.11x
32KB	1120.04	71.34	118.56	0.06x	0.11x
64KB	1220.42	97.15	108.31	0.08x	0.09x
128KB	1489.83	138.17	149.70	0.09x	0.10x
256KB	3164.63	218.73	231.86	0.07x	0.07x
512KB	3519.75	367.45	378.86	0.10x	0.11x
1MB	4429.24	708.90	737.39	0.16x	0.17x
2MB	5701.08	2113.65	2231.48	0.37x	0.39x
4MB	7557.97	6639.26	6621.17	0.88x	0.88x
8MB	12714.40	16802.76	13050.44	1.32x	1.03x
16MB	19043.60	36955.76	25359.61	1.94x	1.33x
32MB	32388.59	73831.34	51227.33	2.28x	1.58x
64MB	60050.21	149227.26	135101.16	2.49x	2.25x
128MB	139286.78	300387.02	278375.47	2.16x	2.00x
256MB	287431.78	597718.12	572264.84	2.08x	1.99x

See attached graphs for visual graphs of the table above.
allreduce_comparison.pdf
allreduce_comparison_large.pdf

greptile-apps · 2026-01-22T00:32:26Z

Greptile Summary

This PR adds a ring-based allreduce algorithm to TL/UCP by composing the existing reduce_scatter_ring and allgather_ring primitives via a UCC schedule. The implementation is clean and follows the general schedule-based pattern seen in allreduce_sra_knomial and bcast_sag_knomial, with correct count handling (passing total count with IN_PLACE to reduce_scatter_ring) and a proper memcpy in allreduce_ring_start to handle the non-inplace allreduce case. Performance data shows a clear advantage for large messages (4 MB+) with a 2-2.5× speedup over the default algorithm.

Key findings:

Error handling inconsistency — allreduce_ring_init uses err_ag/err_rs fallthrough labels that manually call task->finalize() before ucc_tl_ucp_put_schedule. This deviates from every other similar implementation in the TL/UCP layer (bcast_sag_knomial, reduce_scatter_ring, sra_knomial), which rely solely on put_schedule in error paths. Once a task has been added to the schedule via ucc_schedule_add_task, manually finalising it while the schedule still holds a reference in schedule->tasks[] is architecturally inconsistent with UCC's schedule ownership model and could become problematic if the mpool or scheduler behaviour changes.
ring_persistent test missing inplace coverage — the persistent test only exercises TEST_NO_INPLACE, leaving the ucc_mc_memcpy branch in allreduce_ring_start untested across repeated executions.
The new test counts ({15, 65535, 123570} for n_procs=15 and multiples of LCM(3,7,13)=273 for edge cases) are all correctly divisible by their respective team sizes.

Confidence Score: 3/5

The algorithm logic and count handling are correct, but an error-path inconsistency in the init function and missing inplace persistence test coverage warrant attention before merging.
The core schedule composition (reduce_scatter_ring → allgather_ring), count semantics, and start/finalize callbacks are sound. The performance regression for small/medium messages is inherent to the algorithm's count divisibility constraint and is documented. Two concerns lower confidence: (1) the error-handling pattern in allreduce_ring_init deviates from the established TL/UCP codebase convention in a way that could cause subtle ownership issues if internals change, and (2) the ring_persistent test does not cover the inplace code path in allreduce_ring_start.
src/components/tl/ucp/allreduce/allreduce_ring.c (error handling in init), test/gtest/coll/test_allreduce.cc (ring_persistent inplace coverage)

Important Files Changed

Filename	Overview
src/components/tl/ucp/allreduce/allreduce_ring.c	New ring allreduce implementation using a schedule that chains reduce_scatter_ring + allgather_ring. Error handling in the init path deviates from the established codebase pattern and may cause issues when tasks have already been added to the schedule before a failure.
src/components/tl/ucp/allreduce/allreduce.h	Adds UCC_TL_UCP_ALLREDUCE_ALG_RING enum value (placed before LAST) and function declaration for ucc_tl_ucp_allreduce_ring_init. Changes look correct.
src/components/tl/ucp/allreduce/allreduce.c	Registers the new ring algorithm info entry with correct id, name, and description using designated-initializer syntax.
src/components/tl/ucp/tl_ucp_coll.c	Adds RING case to the alg_id_to_init switch, wiring UCC_TL_UCP_ALLREDUCE_ALG_RING to ucc_tl_ucp_allreduce_ring_init. Straightforward and correct.
src/components/tl/ucp/Makefile.am	Adds allreduce_ring.c to the build system. Correct.
test/gtest/coll/test_allreduce.cc	Adds ring algorithm tests (ring, ring_edge_cases, ring_persistent) and moves rab/rab_pipelined further down. Counts in new ring tests are divisible by their respective team sizes. ring_persistent only covers the non-inplace code path, missing coverage of the inplace branch in allreduce_ring_start.

_{Last reviewed commit: 096bae9}

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T00:28:02Z

test/gtest/coll/test_allreduce.cc

+    }
+
+    // Test with various data sizes: small, medium, large
+    for (auto count : {8, 65536, 123567}) {


test counts not divisible by team size (n_procs=15)

The ring algorithm requires count % tsize == 0 (enforced at allreduce_ring.c:101-105). With n_procs=15:

count 8: 8 % 15 = 8 (fails)

count 65536: 65536 % 15 = 1 (fails)

count 123567: 123567 % 15 = 12 (fails)

Use counts divisible by 15, e.g. {15, 65520, 123570}

Suggested change

for (auto count : {8, 65536, 123567}) {

for (auto count : {15, 65520, 123570}) {

greptile-apps · 2026-01-27T00:28:03Z

test/gtest/coll/test_allreduce.cc

+        UccTeam_h     team = job.create_team(team_size);
+        UccCollCtxVec ctxs;
+
+        for (auto count : {0, 1, 3, 17}) {


most test counts not divisible by team sizes

The ring algorithm requires count % tsize == 0. Most combinations will fail:

team_size=3: only count 0 and 3 work (1, 17 fail)

team_size=7: only count 0 works (1, 3, 17 fail)

team_size=13: only count 0 works (1, 3, 17 fail)

Use counts that are multiples of all team sizes (e.g., LCM(3,7,13)=273):

Suggested change

for (auto count : {0, 1, 3, 17}) {

for (auto count : {0, 273, 546, 819}) {

greptile-apps · 2026-01-27T00:28:04Z

src/components/tl/ucp/tl_ucp_task.h

+        struct {
+            void                   *scratch;
+            ucc_mc_buffer_header_t *scratch_mc_header;
+            ucc_ee_executor_task_t *etask;
+            ucc_ee_executor_t      *executor;
+            ucc_tl_ucp_copy_task_t *copy_task;
+            int                     phase;
+            int                     step;
+            size_t                  chunk_size;
+            int                     p2p_posted;


unused struct - schedule-based implementation doesn't use task state

The schedule-based ring allreduce (allreduce_ring.c) uses ucc_schedule_t and doesn't access task->allreduce_ring. This struct appears to be leftover from the original non-schedule implementation (PR #1082).

Suggested change

struct {

void *scratch;

ucc_mc_buffer_header_t *scratch_mc_header;

ucc_ee_executor_task_t *etask;

ucc_ee_executor_t *executor;

ucc_tl_ucp_copy_task_t *copy_task;

int phase;

int step;

size_t chunk_size;

int p2p_posted;

Sergei-Lebedev

looks good to me, but the requirement of count%tsize is too strict imho. We can modify ring reduce scatter and ring allgather so that they can handle remainder correctly similar to knomial algorithms. This can be improved in next PR though

Pls fix greptile comments

Sergei-Lebedev · 2026-02-02T13:30:07Z

src/components/tl/ucp/allreduce/allreduce_ring.c

@@ -0,0 +1,196 @@
+/**
+ * Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-10T20:11:37Z

src/components/tl/ucp/allreduce/allreduce_ring.c

+     * For reduce_scatter_ring:
+     * - dst.info.count should be the per-rank output count (count/tsize)
+     * - The algorithm internally computes total = dst.info.count * tsize


comment is misleading - states count should be per-rank (count/tsize), but code passes total count at line 132. This is actually correct because IN_PLACE flag is set at line 134, and reduce_scatter_ring expects total count when in-place. Update comment to clarify this.

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-10T22:28:07Z

src/components/tl/ucp/allreduce/allreduce_ring.c

+    if (UCC_TL_UCP_TEAM_LIB(tl_team)->cfg.reduce_avg_pre_op &&
+        coll_args->args.op == UCC_OP_AVG) {
+        return UCC_ERR_NOT_SUPPORTED;
+    }
+
+    /* Check that count is divisible by team size for ring algorithm */
+    if (count % tsize != 0) {
+        tl_debug(team->context->lib,
+                 "ring requires count (%zu) divisible by team size (%u)",
+                 count, tsize);
+        return UCC_ERR_NOT_SUPPORTED;
+    }
+


Wrong count for ring

ucc_tl_ucp_allreduce_ring_init enforces count % tsize == 0 (where count is dst.info.count, i.e., total elements per rank), but then passes that same total count as rs_args.args.dst.info.count while also forcing UCC_COLL_ARGS_FLAG_IN_PLACE. In reduce_scatter_ring the in-place path interprets dst.info.count as per-rank block count and internally computes total count as dst.info.count * size (reduce_scatter/reduce_scatter_ring.c:95-114, 213-231, 364-366), so this will run with an effective total of count*tsize and produce incorrect offsets/results for normal allreduce inputs.

Also note that count=0 will currently be rejected by the count % tsize check even though reduce_scatter_ring supports it (it just results in no work), so edge-case tests can’t pass as written.

greptile-apps · 2026-02-10T22:28:08Z

test/gtest/coll/test_allreduce.cc

+        for (auto count : {0, 273, 546, 819}) {
+            SET_MEM_TYPE(UCC_MEMORY_TYPE_HOST);
+            this->set_inplace(TEST_NO_INPLACE);
+            this->data_init(team_size, TypeParam::dt, count, ctxs, false);
+            UccReq req(team, ctxs);
+
+            req.start();
+            req.wait();
+            EXPECT_EQ(true, this->data_validate(ctxs));


Zero-count ring test fails

ring_edge_cases includes count=0, but the ring allreduce init currently returns UCC_ERR_NOT_SUPPORTED when count % team_size != 0 (and 0 % team_size is 0, but earlier validation paths in data_init/coll selection can still make this fail depending on how count=0 is handled). More importantly, with the current allreduce_ring wiring, even if init succeeded, the reduce-scatter/allgather composition assumes a meaningful block size; it’s safer to avoid count=0 here unless the algorithm explicitly supports it end-to-end.

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

TEST: add tests for allreduce ring algorithm TL/UCP: convert allreduce ring to schedule rs+ag TEST: update and reorder gtest allreduce tests

greptile-apps · 2026-03-10T22:29:04Z

src/components/tl/ucp/allreduce/allreduce_ring.c

+    /* Add reduce-scatter to schedule and subscribe to schedule start */
+    UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, rs_task), err_ag, status);
+    UCC_CHECK_GOTO(
+        ucc_event_manager_subscribe(
+            &schedule->super,
+            UCC_EVENT_SCHEDULE_STARTED,
+            rs_task,
+            ucc_task_start_handler),
+        err_ag,
+        status);
+
+    /* Add allgather to schedule and subscribe to reduce-scatter completion */
+    UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, ag_task), err_ag, status);
+    UCC_CHECK_GOTO(
+        ucc_event_manager_subscribe(
+            rs_task, UCC_EVENT_COMPLETED, ag_task, ucc_task_start_handler),
+        err_ag,
+        status);
+
+    schedule->super.post     = ucc_tl_ucp_allreduce_ring_start;
+    schedule->super.finalize = ucc_tl_ucp_allreduce_ring_finalize;
+
+    *task_h = &schedule->super;
+    return UCC_OK;
+
+err_ag:
+    ag_task->finalize(ag_task);
+err_rs:
+    rs_task->finalize(rs_task);
+err_schedule:
+    ucc_tl_ucp_put_schedule(schedule);
+out:


Error handling inconsistent with codebase — potential double-finalize of scheduled tasks

Every other schedule-based TL/UCP algorithm (bcast_sag_knomial_init, reduce_scatter_ring_init, allreduce_sra_knomial_frag_init) handles errors in a single out: label that only calls ucc_tl_ucp_put_schedule(schedule). This PR introduces err_ag/err_rs labels that manually call task->finalize() before calling put_schedule.

The problem is that ucc_schedule_finalize (called in the normal allreduce_ring_finalize path) iterates schedule->tasks[] and calls each task's finalize. If an error occurs after ucc_schedule_add_task(schedule, rs_task) succeeds (e.g., the ucc_event_manager_subscribe for rs_task fails), we jump to err_ag:, which calls:

ag_task->finalize(ag_task) — correct (ag_task not yet added)

rs_task->finalize(rs_task) — rs_task was already added to schedule->tasks[]

ucc_tl_ucp_put_schedule(schedule) — returns schedule to pool

ucc_tl_ucp_put_schedule is ucc_mpool_put, which does not call ucc_schedule_finalize, so there is no double-finalize in the error path itself. However, schedule->tasks[0] now holds a dangling pointer to the freed rs_task. If the mpool reuses the schedule object (via ucc_schedule_init, which correctly resets n_tasks = 0), the dangling pointer is never accessed. This is technically safe, but fragile.

More importantly, after ucc_schedule_add_task(schedule, rs_task) is called, rs_task->schedule = schedule is set and an internal UCC_EVENT_COMPLETED_SCHEDULE subscription is added to rs_task's event manager. Manually calling rs_task->finalize while the schedule still references it is architecturally inconsistent with how UCC schedules take ownership of their tasks.

The established pattern is:

err_schedule: ucc_tl_ucp_put_schedule(schedule); out: return status;

Tasks that were not added to the schedule (only rs_task and ag_task before any ucc_schedule_add_task call) should be finalized before put_schedule. Once tasks are added, they are owned by the schedule and should not be manually finalized in the error path.

greptile-apps · 2026-03-10T22:29:05Z

test/gtest/coll/test_allreduce.cc

+            UccReq req(team, ctxs);
+
+            req.start();
+            req.wait();
+            EXPECT_EQ(true, this->data_validate(ctxs));
+            this->data_fini(ctxs);
+        }
+    }
+}
+
+TYPED_TEST(test_allreduce_alg, ring_persistent) {
+    // Test persistent operation - results should be consistent across multiple calls
+    int           n_procs = 8;
+    ucc_job_env_t env     = {{"UCC_CL_BASIC_TUNE", "inf"},
+                             {"UCC_TL_UCP_TUNE", "allreduce:0-inf:@ring"}};
    UccJob        job(n_procs, UccJob::UCC_JOB_CTX_GLOBAL, env);
    UccTeam_h     team   = job.create_team(n_procs);
-    int           repeat = 3;
    UccCollCtxVec ctxs;
-    std::vector<ucc_memory_type_t> mt = {UCC_MEMORY_TYPE_HOST};
-
-    if (UCC_OK == ucc_mc_available(UCC_MEMORY_TYPE_CUDA)) { //add cuda_managed for cl hier?
-        mt.push_back(UCC_MEMORY_TYPE_CUDA);
-    }
-
-    for (auto count : {65536, 123567}) {
-        for (auto inplace : {TEST_NO_INPLACE, TEST_INPLACE}) {
-            for (auto m : mt) {
-                SET_MEM_TYPE(m);
-                this->set_inplace(inplace);
-                this->data_init(n_procs, TypeParam::dt, count, ctxs, true);
-                UccReq req(team, ctxs);

-                for (auto i = 0; i < repeat; i++) {
-                    req.start();
-                    req.wait();
-                    EXPECT_EQ(true, this->data_validate(ctxs));
-                    this->reset(ctxs);
-                }
-                this->data_fini(ctxs);
-            }
-        }
+    SET_MEM_TYPE(UCC_MEMORY_TYPE_HOST);
+    this->set_inplace(TEST_NO_INPLACE);
+    // Use a larger buffer for persistent test
+    size_t count = 1024;


ring_persistent does not cover the inplace code path

allreduce_ring_start has two distinct branches: when UCC_IS_INPLACE is true it skips the ucc_mc_memcpy, and when false it performs the copy. The ring_persistent test hard-codes this->set_inplace(TEST_NO_INPLACE), so the inplace branch is never exercised here.

While the basic ring test does cover TEST_INPLACE, ring_persistent specifically validates correct behaviour across multiple iterations of the same request — a scenario where a stale inplace buffer or a mis-handled inplace flag could silently produce wrong results only on later iterations. Consider adding a second loop or a separate ring_persistent_inplace variant:

// also test inplace persistent this->set_inplace(TEST_INPLACE); this->data_init(n_procs, TypeParam::dt, count, ctxs, true); UccReq req_ip(team, ctxs); for (int i = 0; i < 5; i++) { req_ip.start(); req_ip.wait(); EXPECT_EQ(true, this->data_validate(ctxs)); this->reset(ctxs); } this->data_fini(ctxs);

wfaderhold21 added the Ready-for-Review label Jan 22, 2026

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

Sergei-Lebedev reviewed Feb 2, 2026

View reviewed changes

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

wfaderhold21 force-pushed the topic/allreduce-ring branch from 6cb06b1 to ce2cde7 Compare February 10, 2026 20:08

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

janjust mentioned this pull request Feb 12, 2026

TL/UCP: Add all-reduce ring alogorithm #1082

Closed

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

TL/UCP: Add allreduce ring algorithm

096bae9

TEST: add tests for allreduce ring algorithm TL/UCP: convert allreduce ring to schedule rs+ag TEST: update and reorder gtest allreduce tests

wfaderhold21 force-pushed the topic/allreduce-ring branch from e1229b6 to 096bae9 Compare March 10, 2026 22:17

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

	for (auto count : {8, 65536, 123567}) {
	for (auto count : {15, 65520, 123570}) {

	for (auto count : {0, 1, 3, 17}) {
	for (auto count : {0, 273, 546, 819}) {

		@@ -0,0 +1,196 @@
		/**
		* Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

wfaderhold21 commented Jan 22, 2026

What

Allreduce Performance Comparison

Uh oh!

greptile-apps bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Sergei-Lebedev left a comment

Choose a reason for hiding this comment

Uh oh!

Sergei-Lebedev Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 22, 2026 •

edited

Loading