AZP: UCXX integration - tests + builds#11473
Conversation
05531f8 to
1def0bd
Compare
1a79d8a to
dec2673
Compare
3b30b46 to
e78b89e
Compare
6f59248 to
47c2eb8
Compare
0a45412 to
24623e9
Compare
| # Azure wrapper around rapidsai/ci-conda: chmod /opt/conda so the non-root UID Azure runs | ||
| # steps as can use conda/python (rapidsai owns it as root); + adds gdb for stack capture. | ||
|
|
||
| ARG BASE_IMAGE=rapidsai/ci-conda:26.06-latest |
There was a problem hiding this comment.
Since you've started, RAPIDS 26.06 was released and all ToT development is now happening for 26.08, including the images we depend on. I suggest targeting 26.08 throughout this PR as well.
| # Upstream ucxx header uses usleep() but omits <unistd.h>; undeclared on | ||
| # newer gcc. Affects all C++ phases. |
There was a problem hiding this comment.
The fix rapidsai/ucxx#674 has been merged on main. If you switch to building main this should not be necessary anymore.
There was a problem hiding this comment.
Will there be a new tag soon?
There was a problem hiding this comment.
Not too soon, this is why I'd prefer to target main, but Yossi has a preference for stability at this time (understandable) so we may have to wait. The next tag should occur around July 16. Maybe instead of relying on specific tags we can test and target specific commits instead, such that we can do controlled upgrades? I fear keeping an older tag may diverge from RAPIDS CI updates, as you have seen there are several aspects that need to work in tandem (CI images, CI scripts in the project such as ucxx/ci, etc.).
There was a problem hiding this comment.
Thanks for the tip! I went ahead and pinned it to the latest main SHA for now. That lets me drop both patches. Going forward, we can update the SHA or switch to a tag in a controlled manner as RAPIDS advances.
ac9f7b1
| # Upstream ucxx examples header uses usleep() but omits <unistd.h>; | ||
| # undeclared on newer gcc. Same patch as build_ucxx.sh. |
UCXX tests run in rapidsai/ci-conda and ci-wheel base images. Thin wrappers open /opt/conda and /pyenv so the Azure-injected step user can use them, and add gdb so ucxx's timeout_with_stack.py can capture stacks on hangs.
Pull rapidsai/ucxx as a pipeline resource and add two stages gated on Static_check: UCXX_build (conda + wheel packages, docs, devcontainer, checks) then UCXX_tests (conda C++/Python on the CPU + GPU matrix). Covers x86_64 + aarch64, CUDA 12 + 13; GPU tests on amd64/cuda13. distributed-ucxx excluded (not upstreamed).
build_ucxx.sh and test_ucxx.sh wrap UCXX's ci/*.sh entrypoints for the Azure agents: stage rapids download shims, set the wheel toolchain, run the conda/wheel build, C++ gtest and Python test phases. CPU slices disable CUDA-only gtests; GPU slices force the host CUDA driver so cuInit matches the MPS daemon. test_client_shutdown is skipped (flaky teardown under MPS contention).
Each UCX PR must test a fixed UCXX revision; refs/heads/main drifts, so a green run says nothing durable. Pin to a tag and bump it deliberately as new UCXX releases are validated.
RAPIDS 26.06 shipped; ToT and the base images we wrap moved to 26.08.
Pin the rapidsai/ucxx resource to a specific main commit (33deb0b) rather than v0.51.00a. Alpha tags are cut at code-freeze and don't pick up ongoing main work, and an old tag drifts from RAPIDS CI updates (images, ci/ scripts) that must move in tandem. A pinned commit stays immutable/reproducible while letting us do controlled bumps. This commit already includes ucxx openucx#674, so drop the local <unistd.h> patch in build_ucxx.sh + test_ucxx.sh.
ac9f7b1 to
3cb3bed
Compare
What?
Add
UCXX_build+UCXX_testsstages to the UCX PR pipeline. Each UCX PR builds UCXX conda packages,libucxx/ucxxwheels, and docs fromrapidsai/ucxx, and runs the C++ (CPU+GPU), Python (GPU), and wheel tests.Why?
Move UCXX CI from RAPIDS GitHub Actions onto UCX's Azure pipeline (mirrors upstream
pr.yaml).How?
Two runner scripts in
buildlib/tools/(build_ucxx.sh,test_ucxx.sh) + container images wrappingrapidsai/ci-condaandrapidsai/ci-wheel. Matrix: CUDA 12 + 13 × x86_64 + aarch64.