[release][train] Adding py3.13 ray-ml image with torchft-nightly#63587
[release][train] Adding py3.13 ray-ml image with torchft-nightly#63587elliot-barn wants to merge 9 commits into
Conversation
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for Python 3.13 across the build and release infrastructure, including updates to Buildkite configurations, dependency lock files, and BYOD requirements. It also adds a new suite of nightly training ingest benchmarks for Python 3.13. Feedback was provided regarding a potential typo in a configuration flag, redundant dependency declarations in the new requirements file, and inconsistent argument formatting in the release test definitions.
I am having trouble creating individual review comments. Click here to see my feedback.
release/release_tests.yaml (1967)
anyscale_sdk_2026: true appears to be a typo. This flag is typically anyscale_sdk_v2: true in Ray release tests. Please verify if this is the intended key.
anyscale_sdk_v2: truerelease/ray_release/byod/requirements_ml_byod_3.13.in (43-44)
Both torchft==0.1.1 and torchft-nightly are listed. Since the pull request aims to include the nightly version, the stable version is redundant and may cause installation conflicts. It should be removed.
torchft-nightly
release/release_tests.yaml (2038)
The arguments --skip_train_step True and --skip_validation_at_epoch_end True use a space-separated format, which is inconsistent with the --arg=value format used in all other variations of this test (e.g., lines 1989, 2058). Using the consistent format improves maintainability and avoids potential parsing issues.
script: RAY_TRAIN_V2_ENABLED=1 python train_benchmark.py --task=image_classification --dataloader_type=ray_data --num_workers=16 --skip_train_step=True --skip_validation_at_epoch_end=True --image_classification_data_format=s3_url|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Add a self-contained raydepsets depset (release_ml_torchft_tests.depsets.yaml) that compiles the Ray ML release-test dependencies with torchft-nightly layered on top, producing release/ray_release/byod/ml_torchft_py3.13.lock for py3.13 / cu128. It is installed onto the core Ray CUDA image via byod_ml_torchft.sh, so torchft release tests no longer depend on the published py3.13 ray-ml image (which fails to build due to dask/nixl py3.13 gaps). Decouple from the in-progress published py3.13 ray-ml image work by reverting the buildkite image/release steps, ray-images.json, the gpu BYOD py3.13 allowance, and the ml-base-extra-testdeps py3.13 depset + locks, and by removing torchft from the shared requirements_ml_byod_*.in files. torchft now lives only in the dedicated requirements_ml_torchft.in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…/github.com/ray-project/ray into elliot-barn-add-torchft-to-ml-release-image
…ft.txt Align the py3.13 torchft release-image depset with master after the torch 2.9.0 upgrade (#63361): - Bump requirements_ml_byod_3.13.in to torch==2.9.0 and drop the stale triton==3.3.0 pin (torch 2.9.0 pulls triton==3.5.0 transitively), matching the py3.13 constraint and ML requirement files. - Source torchft from the canonical python/requirements/ml/py313/torchft.txt (torchft-nightly==2026.5.15, torch-2.9.0-compatible) instead of a separate requirements_ml_torchft.in, so there is a single torchft pin. - Regenerate ml_torchft_py3.13.lock -> torch==2.9.0+cu128 / torchaudio 2.11.0+cu128 / triton 3.5.0; verified idempotent so raydepsets --check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Add a minimal reference release test showing how to run a release test on the
torchft Ray ML image variant. It uses the core Ray CUDA image (py3.13) with the
torchft dependency lock installed on top:
cluster:
anyscale_sdk_2026: true
byod:
type: cu123
post_build_script: byod_ml_torchft.sh
python_depset: ml_torchft_py3.13.lock
The workload imports torch (2.9.0) + torchft and runs a short Ray Train v2 +
torchft linear training loop to prove the image works end to end. Validated
against the release schema (//release:test_config).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Setting byod.python_depset is sufficient: the BYOD image build automatically copies the lock in and runs `uv pip install --system --no-deps -r python_depset.lock` (release/ray_release/byod/build_context.py). The custom byod_ml_torchft.sh ran the identical command, so it installed the deps a second time for no reason. Remove byod_ml_torchft.sh and the post_build_script reference from the torchft_hello_world reference test; rely on python_depset alone. Validated with //release:test_config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
creating a ray-ml py3.13 release test image with torchft-nightly
Creating a python 3.13 variation of training_ingest_benchmark-task=image_classification for full_training.jpeg and full_training.s3_url
release test run: https://buildkite.com/ray-project/release/builds/93976