Skip to content

Commit eba8d59

Browse files
Add cu130 variants for all ray images (#63801)
Adds first-class cu130 (CUDA 13.0) support to Ray's image build pipeline and release test suite. Previously cu130 existed only as a one-off compiled_graph experiment pinned to py3.12; this generalizes it into proper cu130 base images plus a reusable cu130 GPU test dependency layer spanning Python 3.10–3.13. Image builds - ci/raydepsets/configs/rayimg.depsets.yaml - Added cu130 build-arg sets and made CUDA explicit in the build-arg key rather than implied: regular keys (py310…py314) carry no CUDA_CODE, and CUDA builds use explicit *_cu128 / *_cu130 keys. This prevents a "regular" Python build from silently pulling a CUDA index. - New ray_img_cu130_* depset that relaxes cupy-cuda12x → cupy-cuda13x for the cu130 image chain. - New ray-gpu-cu130-base_extra_testdeps_* depset for the cu130 GPU base-extra-testdeps image. - Left the ray-ml depset as a single, non-cuda-coded, py3.10-only lock (ray-ml only ships a CUDA 12.x image), keeping the lock that byod.Dockerfile actually consumes. - .buildkite/{base,build,_images,linux_aarch64,release/build}.rayci.yml: wire cu13.0.0-cudnn into the base/release image matrices. - New compiled locks: python/deplocks/ray_img/ray_img_cu130_py3.{10–14}.lock and ray-gpu-cu130-base_extra_testdeps_py3.{10–14}.lock. - ray-images.json / ci/ray_ci/test_ray_docker_container.py updated for cu130. GPU release tests - ci/raydepsets/configs/release_gpu_cu130.depsets.yaml (new): a shared gpu_cu130_py3.{10–13}.lock torch layer that expands the gpu-cu130 base image with a CUDA 13.x torch build, constrained to the base image lock so versions (e.g. cupy-cuda13x) stay consistent. Installed via python_depset (BYOD) — replaces the old post_build_script + ray-ml-image approach. - Removed the legacy release_compiled_graph_gpu_cu130.depsets.yaml, byod_compiled_graph_gpu_cu130.sh, and requirements_compiled_graph_gpu_cu130.in. - New requirements_gpu_cu130.in / requirements_byod_gpu_cu130.in. - New workload jobs_check_cuda_version.py asserting the runtime torch CUDA version is 13.0. - release/release_tests.yaml — cu130 tests across Python 3.10–3.13: - hello_world_cu130_py{3.10–3.13} - jobs_check_cuda_version_cu130_py{3.10–3.13} - jobs_check_cuda_available.py3{10–13}_cu130 (new variations) - compiled_graphs_GPU_cu130_py{3.10–3.13} and compiled_graphs_GPU_multinode_cu130_py{3.10–3.13} (converted from a single py3.12 test to a full Python matrix on the shared cu130 torch layer) Testing Running the new cu130 GPU release tests (20 total, Python 3.10–3.13): hello_world_cu130, jobs_check_cuda_version_cu130, jobs_check_cuda_available.*_cu130, compiled_graphs_GPU_cu130, compiled_graphs_GPU_multinode_cu130. Release tests: https://buildkite.com/ray-project/release/builds/96047/canvas?sid=019eab1c-8fd9-4539-a08d-44f060784f05&open=false --------- Signed-off-by: sai.miduthuri <sai.miduthuri@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
1 parent 05ee25f commit eba8d59

26 files changed

Lines changed: 36863 additions & 1838 deletions

.buildkite/_images.rayci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@ steps:
234234
- "12.6.3"
235235
- "12.8.1"
236236
- "12.9.1"
237+
- "13.0.0"
237238
env:
238239
PYTHON_VERSION: "{{array.python}}"
239240
CUDA_VERSION: "{{array.cuda}}"

.buildkite/release/build.rayci.yml

Lines changed: 3 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,7 @@ steps:
2626
- "3.13"
2727
cuda:
2828
- "12.3.2-cudnn9"
29-
adjustments:
30-
- with:
31-
python: "3.12"
32-
cuda: "13.0.0-cudnn"
29+
- "13.0.0-cudnn"
3330
env:
3431
PYTHON_VERSION: "{{array.python}}"
3532
CUDA_VERSION: "{{array.cuda}}"
@@ -95,17 +92,14 @@ steps:
9592
array:
9693
gpu:
9794
- "cu12.3.2-cudnn9"
95+
- "cu13.0.0-cudnn"
9896
python:
9997
# This list should be kept in sync with the list of supported Python in
10098
# release test suite
10199
- "3.10"
102100
- "3.11"
103101
- "3.12"
104102
- "3.13"
105-
adjustments:
106-
- with:
107-
python: "3.12"
108-
gpu: "cu13.0.0-cudnn"
109103
env:
110104
PYTHON_VERSION: "{{array.python}}"
111105
GPU: "{{array.gpu}}"
@@ -156,15 +150,12 @@ steps:
156150
array:
157151
gpu:
158152
- cu12.3.2-cudnn9
153+
- cu13.0.0-cudnn
159154
python:
160155
- "3.10"
161156
- "3.11"
162157
- "3.12"
163158
- "3.13"
164-
adjustments:
165-
- with:
166-
python: "3.12"
167-
gpu: cu13.0.0-cudnn
168159

169160
- name: ray-llm-anyscale-cuda-build
170161
label: "wanda: ray-llm-anyscale py{{array.python}} {{array.gpu}}"

ci/docker/ray-image.Dockerfile

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ FROM ${RAY_WHEEL_IMAGE} AS wheel-source
1717
FROM ${BASE_IMAGE}
1818

1919
ARG IMAGE_TYPE=ray
20+
ARG PLATFORM=cpu
2021
ARG RAY_COMMIT=unknown-commit
2122
ARG RAY_VERSION=3.0.0.dev0
2223

@@ -47,10 +48,27 @@ else
4748
RAY_EXTRAS="all"
4849
fi
4950

51+
# TODO(cu130): ray[all]'s cgraph extra hard-pins cupy-cuda12x, so this install
52+
# always pulls the CUDA-12 build even on cu130 images (no PEP 508 marker exists
53+
# to select cupy by CUDA version). Until the cgraph extra can resolve cupy per
54+
# CUDA runtime (or cupy ships a unified package), we patch it up with the
55+
# uninstall/reinstall swap below. Drop that swap once this install can pick the
56+
# right cupy directly.
5057
$HOME/anaconda3/bin/pip --no-cache-dir install \
5158
-c /home/ray/requirements_compiled.txt \
5259
"${WHEEL_FILE}[${RAY_EXTRAS}]"
5360

61+
# ray[all]'s cgraph extra hard-pins cupy-cuda12x (a CUDA-12 build), but cu130
62+
# images ship a CUDA-13 runtime where that build is broken. Swap it for the
63+
# matching CUDA-13 build. cupy-cuda12x and cupy-cuda13x both own the top-level
64+
# `cupy` package and cannot coexist, so this is an uninstall-then-install.
65+
# Scoped to IMAGE_TYPE=ray (covers ray + ray-extra); ray-llm flows through this
66+
# same Dockerfile but manages cupy via its own llm locks, so leave it untouched.
67+
if [[ "${IMAGE_TYPE}" == "ray" && "${PLATFORM}" == cu13* ]]; then
68+
$HOME/anaconda3/bin/pip --no-cache-dir uninstall -y cupy-cuda12x
69+
$HOME/anaconda3/bin/pip --no-cache-dir install "cupy-cuda13x==13.6.0"
70+
fi
71+
5472
$HOME/anaconda3/bin/pip freeze > /home/ray/pip-freeze.txt
5573

5674
echo "Ray version: $($HOME/anaconda3/bin/python -c 'import ray; print(ray.__version__)')"

ci/ray_ci/test_ray_docker_container.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -469,6 +469,9 @@ def test_get_platform_tag(self) -> None:
469469
container = RayDockerContainer(v, "cu12.9.1-cudnn", "ray")
470470
assert container._get_platform_tag() == "-cu129"
471471

472+
container = RayDockerContainer(v, "cu13.0.0-cudnn", "ray")
473+
assert container._get_platform_tag() == "-cu130"
474+
472475
def test_should_upload(self) -> None:
473476
v = DEFAULT_PYTHON_TAG_VERSION
474477
test_cases = [

ci/raydepsets/configs/rayimg.depsets.yaml

Lines changed: 90 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,49 @@
11
build_arg_sets:
2+
# Regular (non-GPU) build args: no CUDA_CODE, so depsets that template
3+
# ${CUDA_CODE} cannot accidentally pull a CUDA index for a CPU Python build.
24
py310:
35
PYTHON_VERSION: "3.10"
46
PYTHON_SHORT: "310"
57
py311:
68
PYTHON_VERSION: "3.11"
79
PYTHON_SHORT: "311"
8-
CUDA_CODE: cu128
910
py312:
1011
PYTHON_VERSION: "3.12"
1112
PYTHON_SHORT: "312"
12-
CUDA_CODE: cu130
1313
py313:
1414
PYTHON_VERSION: "3.13"
1515
PYTHON_SHORT: "313"
1616
py314:
1717
PYTHON_VERSION: "3.14"
1818
PYTHON_SHORT: "314"
19+
py310_cu128:
20+
PYTHON_VERSION: "3.10"
21+
PYTHON_SHORT: "310"
22+
CUDA_CODE: cu128
23+
py311_cu128:
24+
PYTHON_VERSION: "3.11"
25+
PYTHON_SHORT: "311"
26+
CUDA_CODE: cu128
27+
py310_cu130:
28+
PYTHON_VERSION: "3.10"
29+
PYTHON_SHORT: "310"
30+
CUDA_CODE: cu130
31+
py311_cu130:
32+
PYTHON_VERSION: "3.11"
33+
PYTHON_SHORT: "311"
34+
CUDA_CODE: cu130
35+
py312_cu130:
36+
PYTHON_VERSION: "3.12"
37+
PYTHON_SHORT: "312"
38+
CUDA_CODE: cu130
39+
py313_cu130:
40+
PYTHON_VERSION: "3.13"
41+
PYTHON_SHORT: "313"
42+
CUDA_CODE: cu130
43+
py314_cu130:
44+
PYTHON_VERSION: "3.14"
45+
PYTHON_SHORT: "314"
46+
CUDA_CODE: cu130
1947

2048
depsets:
2149
- name: ray_img_depset_${PYTHON_SHORT}
@@ -39,6 +67,29 @@ depsets:
3967
- ci/raydepsets/pre_hooks/build-placeholder-wheel.sh
4068
- ci/raydepsets/pre_hooks/remove-compiled-headers.sh ${PYTHON_VERSION}
4169

70+
# cu130 variant of the core ray image deps: ray[all] pins cupy-cuda12x (a
71+
# CUDA 12.x build) unconditionally, which is broken on the cu130 CUDA-13
72+
# runtime. Relax it out here so the cu130 gpu base layer can pin the matching
73+
# cupy-cuda13x build instead.
74+
# TODO(cu130): this relax exists only because the cgraph extra can't select
75+
# cupy by CUDA version (no PEP 508 marker for it) — same root cause as the
76+
# cupy swap in ci/docker/ray-image.Dockerfile. Drop this relax (and the gpu
77+
# base's requirements_byod_gpu_cu130.in) once cupy resolves correctly per CUDA
78+
# runtime, so the cu130 chain no longer needs a special-cased core image lock.
79+
- name: ray_img_cu130_${PYTHON_SHORT}
80+
operation: relax
81+
source_depset: ray_img_depset_${PYTHON_SHORT}
82+
packages:
83+
- cupy-cuda12x
84+
output: python/deplocks/ray_img/ray_img_cu130_py${PYTHON_SHORT}.lock
85+
# py3.10-3.13 only: cupy-cuda13x==13.6.0 has no cp314 wheel, and no cu130
86+
# gpu release test / published image targets py3.14.
87+
build_arg_sets:
88+
- py310_cu130
89+
- py311_cu130
90+
- py312_cu130
91+
- py313_cu130
92+
4293
- name: ray_base_extra_testdeps_${PYTHON_SHORT}
4394
operation: expand
4495
requirements:
@@ -76,8 +127,8 @@ depsets:
76127
- --python-version=${PYTHON_VERSION}
77128
- --python-platform=linux
78129
build_arg_sets:
79-
- py311
80-
- py312
130+
- py311_cu128
131+
- py312_cu130
81132

82133
- name: ray_base_extra_testdeps_gpu_${PYTHON_SHORT}
83134
operation: expand
@@ -96,8 +147,42 @@ depsets:
96147
- --python-version=${PYTHON_VERSION}
97148
- --python-platform=linux
98149
build_arg_sets:
99-
- py310
150+
- py310_cu128
151+
152+
- name: ray_base_extra_testdeps_gpu_${CUDA_CODE}_${PYTHON_SHORT}
153+
operation: expand
154+
requirements:
155+
# cu130 gpu base: the cupy-cuda12x-relaxed core image + the matching
156+
# cupy-cuda13x==13.6.0 build (requirements_byod_gpu_cu130.in) + base
157+
# layers. torch is layered per-test via python_depset (which expands this
158+
# base, so it inherits cupy-cuda13x). The published rayproject/ray cu130
159+
# image performs the same 12x->13x swap at Docker build time, so all
160+
# layers agree on cupy-cuda13x==13.6.0.
161+
- release/ray_release/byod/requirements_byod_gpu_cu130.in
162+
- docker/base-deps/requirements.in
163+
- docker/base-extra/requirements.in
164+
constraints:
165+
- /tmp/ray-deps/requirements_compiled_py${PYTHON_VERSION}.txt
166+
depsets:
167+
- ray_img_cu130_${PYTHON_SHORT}
168+
output: python/deplocks/base_extra_testdeps/ray-gpu-${CUDA_CODE}-base_extra_testdeps_py${PYTHON_VERSION}.lock
169+
append_flags:
170+
- --index https://download.pytorch.org/whl/${CUDA_CODE}
171+
- --unsafe-package ray
172+
- --python-version=${PYTHON_VERSION}
173+
- --python-platform=linux
174+
# py3.10-3.13 only: cupy-cuda13x==13.6.0 has no cp314 wheel, and no cu130
175+
# gpu release test / published image targets py3.14.
176+
build_arg_sets:
177+
- py310_cu130
178+
- py311_cu130
179+
- py312_cu130
180+
- py313_cu130
100181

182+
# ray-ml only ships a CUDA 12.x (cu128) image and is Python 3.10 only
183+
# (requirements_ml_byod_<ver>.in exists only for 3.10). The single lock below
184+
# is consumed by the ray-ml base-extra-testdeps image build (byod.Dockerfile,
185+
# IMAGE_TYPE=ray-ml), which reads the non-cuda-coded filename.
101186
- name: ray_ml_base_extra_testdeps_cuda_${PYTHON_SHORT}
102187
operation: expand
103188
requirements:

ci/raydepsets/configs/release_compiled_graph_gpu_cu130.depsets.yaml

Lines changed: 0 additions & 20 deletions
This file was deleted.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
build_arg_sets:
2+
py310:
3+
PYTHON_VERSION: "3.10"
4+
PYTHON_SHORT: "310"
5+
py311:
6+
PYTHON_VERSION: "3.11"
7+
PYTHON_SHORT: "311"
8+
py312:
9+
PYTHON_VERSION: "3.12"
10+
PYTHON_SHORT: "312"
11+
py313:
12+
PYTHON_VERSION: "3.13"
13+
PYTHON_SHORT: "313"
14+
15+
depsets:
16+
# Shared torch layer for cu130 GPU release tests. Expands the gpu-cu130 base
17+
# image deps (which carry cupy-cuda13x==13.6.0 via the relax in
18+
# rayimg.depsets.yaml) with a CUDA 13.x torch build (torch is not in ray[all],
19+
# so the core ray image lacks it). Consumed via `python_depset` by
20+
# compiled_graphs_GPU_cu130 and jobs_check_cuda_available (cu130 variants).
21+
# Because the base lock already pins cupy-cuda13x==13.6.0, this full-closure
22+
# install is idempotent with the published image's Docker-build cupy swap — no
23+
# post_build_script needed.
24+
- name: gpu_cu130_py${PYTHON_SHORT}
25+
operation: expand
26+
depsets:
27+
- ray_base_extra_testdeps_gpu_cu130_${PYTHON_SHORT}
28+
requirements:
29+
- release/ray_release/byod/requirements_gpu_cu130.in
30+
# Constrain to the gpu-cu130 base image lock so this torch layer stays a
31+
# consistent superset of the image it is installed onto (e.g. cupy-cuda13x
32+
# matches the base instead of floating to latest).
33+
constraints:
34+
- python/deplocks/base_extra_testdeps/ray-gpu-cu130-base_extra_testdeps_py${PYTHON_VERSION}.lock
35+
output: release/ray_release/byod/gpu_cu130_py${PYTHON_VERSION}.lock
36+
append_flags:
37+
- --index https://download.pytorch.org/whl/cu130
38+
- --python-version=${PYTHON_VERSION}
39+
- --unsafe-package ray
40+
- --python-platform=linux
41+
build_arg_sets:
42+
- py310
43+
- py311
44+
- py312
45+
- py313

0 commit comments

Comments
 (0)