Skip to content

Commit b7dc00e

Browse files
[llm] Upgrade vLLM to 0.21.0 (#63396)
## Description Two changes unrelated to dependencies: 1. Disable flash inference sampler by default: #63396 (comment) 2. Patch for protecting NIXL EP import path in vLLM: #63396 (comment) ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
1 parent 52a378b commit b7dc00e

16 files changed

Lines changed: 286 additions & 161 deletions

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ exclude: |
1010
release/release_logs/|
1111
rllib/offline/tests/data|
1212
thirdparty/patches/|
13+
python/requirements/llm/patches/|
1314
src/ray/thirdparty/|
1415
doc/external/|
1516
doc/source/

.vale/styles/Google/Acronyms.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ exceptions:
4646
- MPS
4747
- NET
4848
- NFS
49+
- NIXL
4950
- NOTE
5051
- NVDA
5152
- OSS

ci/docker/llm.build.Dockerfile

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,39 @@ SKIP_PYTHON_PACKAGES=1 ./ci/env/install-dependencies.sh
6767
PYTHON_CODE="$(python -c "import sys; v=sys.version_info; print(f'py{v.major}{v.minor}')")"
6868
pip install --no-deps -r python/deplocks/llm/rayllm_test_${PYTHON_CODE}_${RAY_CUDA_CODE}.lock
6969

70+
# Temporarily patch fixes from https://github.com/vllm-project/vllm/pull/39873
71+
# until the pinned vLLM release includes it.
72+
VLLM_IMPORT_UTILS_PATCH="$(pwd)/python/requirements/llm/patches/vllm-trial-import-patch"
73+
VLLM_SITE_PACKAGES="$(python - <<'PY'
74+
import site
75+
import sysconfig
76+
from pathlib import Path
77+
78+
candidate_dirs = [
79+
Path(sysconfig.get_paths()["purelib"]),
80+
Path(sysconfig.get_paths()["platlib"]),
81+
*(Path(path) for path in site.getsitepackages()),
82+
]
83+
84+
for base_dir in dict.fromkeys(candidate_dirs):
85+
import_utils = base_dir / "vllm" / "utils" / "import_utils.py"
86+
if import_utils.exists():
87+
print(base_dir)
88+
break
89+
else:
90+
raise SystemExit("vLLM import_utils.py not found")
91+
PY
92+
)"
93+
(
94+
cd "${VLLM_SITE_PACKAGES}"
95+
git apply "${VLLM_IMPORT_UTILS_PATCH}"
96+
)
97+
7098
EOF
7199

72-
# Use the revamped ray executor backend in vLLM
73-
ENV VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1
100+
101+
# vLLM 0.21.0 selects the FlashInfer top-k/top-p sampler during engine initialization
102+
# instead of the previous PyTorch-native/Triton sampling path. The FlashInfer sampler
103+
# introduces longer adds a large one-time engine initialization cost. To avoid performance
104+
# surprises, we disable the FlashInfer sampler by default.
105+
ENV VLLM_USE_FLASHINFER_SAMPLER=0

ci/docker/llm.build.wanda.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ srcs:
55
- ci/env/install-dependencies.sh
66
- ci/env/install-llvm-binaries.sh
77
- ci/suppress_output
8+
- python/requirements/llm/patches/vllm-trial-import-patch
89
- python/deplocks/llm/rayllm_test_py312_cpu.lock
910
- python/deplocks/llm/rayllm_test_py312_cu130.lock
1011
tags:

ci/raydepsets/configs/rayllm.depsets.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@ build_arg_sets:
1313
append_flags:
1414
- --python-version=${PYTHON_VERSION_STR}
1515
- --unsafe-package ray
16+
# Omit the nixl-cu12 binary wheel from the compiled lockfiles. nixl-cu12
17+
# and nixl-cu13 1.x wheels both install a top-level nixl_ep/ package with
18+
# an identically named nixl_ep_cpp.so but different libcudart
19+
# requirements; if both wheels are present the cu12 binary wins the file
20+
# race and breaks vLLM's eager `import nixl_ep` on the cu130 image. The
21+
# nixl meta-package (pure Python) is still required so that examples like
22+
# dp_pd_example can `import nixl` to gate optional features.
23+
- --unsafe-package nixl-cu12
1624
- --python-platform=x86_64-manylinux_2_31
1725
- --index https://download.pytorch.org/whl/${CUDA_CODE}
1826
build_arg_sets:

doc/source/data/working-with-llms.rst

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -609,21 +609,20 @@ Then reference the remote path in your config:
609609
:end-before: __s3_config_example_end__
610610

611611

612-
C/C++ runtime dependencies incompatibility
613-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
612+
vLLM NIXL EP dependency incompatibility
613+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
614614

615615
.. admonition:: Known issue
616616

617-
Ray 2.55 installs vLLM 0.18.0. Depending on the conda environment, you may encounter
618-
incompatibilities with native runtime libraries (for example, ``libstdc++``, ``CXXABI``, ``ICU``).
617+
Users who install Ray and vLLM directly may encounter NIXL EP incompatibility error as follows:
619618

620-
In such cases, override just the ``libstdc++`` library from your conda environment with ``LD_LIBRARY_PATH``:
619+
.. code-block:: text
621620
622-
.. code-block:: shell
621+
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
622+
623+
Remove the incompatible package or ensure the installed ``nixl_ep`` package is compatible with the CUDA runtime
624+
and vLLM build in your environment.
623625

624-
mkdir -p "${CONDA_PREFIX}/lib-overrides"
625-
ln -sf "${CONDA_PREFIX}/lib/libstdc++.so.6" "${CONDA_PREFIX}/lib-overrides/libstdc++.so.6"
626-
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib-overrides${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
627626

628627
**Usage data collection**: Ray collects anonymous usage data to improve Ray Data LLM. To opt out, see :ref:`Ray usage stats <ref-usage-stats>`.
629628

@@ -638,4 +637,4 @@ If you encounter issues not covered in this guide:
638637
- `Ray Discourse Forum <https://discuss.ray.io>`_ - Ask questions and share knowledge
639638
- `Ray LLM Office Hours <https://docs.google.com/document/d/1n3-Jw_4su8yilo9zdi5OciAduoz6H_VmdL8i9sL4f-E/edit?tab=t.e700ayqsx3v3>`_ - Learn about new features, ask questions, and get guidance from the team
640639

641-
- `Past Office Hours Recordings <https://youtube.com/playlist?list=PLzTswPQNepXl2IYF8DcV35FdCoVbeL4_6&si=ik81bljIlasYAHKN>`_ - View recordings from previous sessions
640+
- `Past Office Hours Recordings <https://youtube.com/playlist?list=PLzTswPQNepXl2IYF8DcV35FdCoVbeL4_6&si=ik81bljIlasYAHKN>`_ - View recordings from previous sessions

doc/source/serve/llm/troubleshooting.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -77,18 +77,18 @@ app = build_openai_app({"llm_configs": [llm_config]})
7777
serve.run(app, blocking=True)
7878
```
7979

80-
### C/C++ runtime dependencies incompatibility
80+
### vLLM NIXL EP dependency incompatibility
8181

8282
:::{admonition} Known issue
83-
Ray 2.55 installs vLLM 0.18.0. Depending on the conda environment, you may encounter incompatibilities with native runtime libraries (for example, `libstdc++`, `CXXABI`, `ICU`).
83+
Users who install Ray and vLLM directly may encounter NIXL EP incompatibility error as follows:
8484

85-
In such cases, override just the ``libstdc++`` library from your conda environment with `LD_LIBRARY_PATH`:
86-
87-
```shell
88-
mkdir -p "${CONDA_PREFIX}/lib-overrides"
89-
ln -sf "${CONDA_PREFIX}/lib/libstdc++.so.6" "${CONDA_PREFIX}/lib-overrides/libstdc++.so.6"
90-
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib-overrides${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
85+
```text
86+
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
9187
```
88+
89+
Remove the incompatible package or ensure the installed ``nixl_ep`` package is compatible with the CUDA runtime
90+
and vLLM build in your environment.
91+
9292
:::
9393

9494
## Get help
@@ -105,4 +105,3 @@ If you encounter issues not covered in this guide:
105105

106106
- {doc}`Quickstart examples <quick-start>`
107107
- {doc}`Examples <examples>`
108-

docker/ray-llm/Dockerfile

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,11 @@ ARG BASE_IMAGE
44
FROM "$BASE_IMAGE"
55

66
COPY python/deplocks/llm/rayllm_*.lock ./
7+
COPY python/requirements/llm/patches/vllm-trial-import-patch ./
78

89
# vLLM version tag to use for EP kernel and DeepGEMM install scripts
910
# Keep in sync with vllm version in python/requirements/llm/llm-requirements.txt
10-
ARG VLLM_SCRIPTS_REF="v0.20.0"
11+
ARG VLLM_SCRIPTS_REF="v0.21.0"
1112

1213
RUN <<EOF
1314
#!/bin/bash
@@ -35,8 +36,33 @@ uv pip install --system --no-cache-dir --no-deps \
3536
--no-verify-hashes \
3637
-r "rayllm_${PYTHON_CODE}_${CUDA_CODE}.lock"
3738

38-
# Export installed packages
39-
$HOME/anaconda3/bin/pip freeze > /home/ray/pip-freeze.txt
39+
# Temporarily patch fixes from https://github.com/vllm-project/vllm/pull/39873
40+
# until the pinned vLLM release includes it.
41+
VLLM_IMPORT_UTILS_PATCH="$(pwd)/vllm-trial-import-patch"
42+
VLLM_SITE_PACKAGES="$(python - <<'PY'
43+
import site
44+
import sysconfig
45+
from pathlib import Path
46+
47+
candidate_dirs = [
48+
Path(sysconfig.get_paths()["purelib"]),
49+
Path(sysconfig.get_paths()["platlib"]),
50+
*(Path(path) for path in site.getsitepackages()),
51+
]
52+
53+
for base_dir in dict.fromkeys(candidate_dirs):
54+
import_utils = base_dir / "vllm" / "utils" / "import_utils.py"
55+
if import_utils.exists():
56+
print(base_dir)
57+
break
58+
else:
59+
raise SystemExit("vLLM import_utils.py not found")
60+
PY
61+
)"
62+
(
63+
cd "${VLLM_SITE_PACKAGES}"
64+
git apply "${VLLM_IMPORT_UTILS_PATCH}"
65+
)
4066

4167
sudo apt-get update -y && sudo apt-get install -y curl kmod pkg-config librdmacm-dev cmake
4268

@@ -57,10 +83,17 @@ curl -fsSL "${VLLM_RAW}/tools/ep_kernels/install_python_libraries.sh" | \
5783
# Install DeepGEMM
5884
curl -fsSL "${VLLM_RAW}/tools/install_deepgemm.sh" | bash
5985

86+
# Export installed packages
87+
$HOME/anaconda3/bin/pip freeze > /home/ray/pip-freeze.txt
88+
6089
sudo rm -rf /var/lib/apt/lists/*
6190
sudo apt-get clean
6291

6392
EOF
6493

65-
# Use the revamped ray executor backend in vLLM
66-
ENV VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1
94+
95+
# vLLM 0.21.0 selects the FlashInfer top-k/top-p sampler during engine initialization
96+
# instead of the previous PyTorch-native/Triton sampling path. The FlashInfer sampler
97+
# introduces longer adds a large one-time engine initialization cost. To avoid performance
98+
# surprises, we disable the FlashInfer sampler by default.
99+
ENV VLLM_USE_FLASHINFER_SAMPLER=0

docker/ray-llm/cuda.wanda.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ froms: ["cr.ray.io/rayproject/ray-py$PYTHON_VERSION-cu$CUDA_VERSION-base"]
33
dockerfile: docker/ray-llm/Dockerfile
44
srcs:
55
- python/requirements.txt
6+
- python/requirements/llm/patches/vllm-trial-import-patch
67
- python/deplocks/llm/rayllm_py312_cu130.lock
78
build_args:
89
- BASE_IMAGE=cr.ray.io/rayproject/ray-py$PYTHON_VERSION-cu$CUDA_VERSION-base

python/deplocks/llm/rayllm_py312_cpu.lock

Lines changed: 39 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,7 @@ apache-tvm-ffi==0.1.9 \
240240
# flashinfer-python
241241
# quack-kernels
242242
# tilelang
243+
# tokenspeed-mla
243244
# vllm
244245
# xgrammar
245246
astor==0.8.1 \
@@ -2547,37 +2548,22 @@ ninja==1.13.0 \
25472548
# -r python/requirements/llm/llm-requirements.txt
25482549
# flashinfer-python
25492550
# vllm
2550-
nixl==0.10.1 \
2551-
--hash=sha256:616465673dae5180d296525a03237af4cd5f2c00c3228d185bc06dbe621509b7
2551+
nixl==1.1.0 \
2552+
--hash=sha256:f46f65768770fa508eb52921c41b5dc52b754478b0ebb606fff6d80f41375d8b
25522553
# via
25532554
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
25542555
# -r python/requirements/llm/llm-requirements.txt
2555-
nixl-cu12==0.10.1 \
2556-
--hash=sha256:0bb1b3532f95c2f376e21008e91e8ec5791304a29af19e75d29fd1bcc754c9bc \
2557-
--hash=sha256:15376c1527c68d77fff5c6bb7cf7466a16dae0ab3bb32de152a602ba9edaaa9d \
2558-
--hash=sha256:26e59f9841985cf5b547202865036f84ae6dc23184789446fe5833e7499e21a9 \
2559-
--hash=sha256:277cde28bc45f706df689ed399327d0cce5432382606a5fc1d19fc470fcc57b4 \
2560-
--hash=sha256:3dde565c9d6e1d5af139a4dca240e902d5dbb32ea622acb31cdca3fb25cb859f \
2561-
--hash=sha256:48d3d9cc882edaa0a323d0ddfed39e0864b873ef1fa56e774c5a793629bcf083 \
2562-
--hash=sha256:685a0b8c5cdaa9cdbd826ea54cf46b4b3e46b016ee73a919cd2cf489402c56fd \
2563-
--hash=sha256:7641bc2bd3aeeefcf2ea3a3fd9f940f54a62d985ce2426dde6b3860d0edce13e \
2564-
--hash=sha256:b4712c6e0f18f57fee34cd970faac01480f0caf12da33d4a40ef4e9096a4caf7 \
2565-
--hash=sha256:ba46837abee8e06c8d86bd9b2cd7dab8c3d1e8407e04d52e6db3a9b137478c4b
2566-
# via
2567-
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
2568-
# -r python/requirements/llm/llm-requirements.txt
2569-
# nixl
2570-
nixl-cu13==0.10.1 \
2571-
--hash=sha256:0cbd4ffc25398f565a378e6da09b60d8d2625f1eda96bfe20d6d9c5f3fdba2c0 \
2572-
--hash=sha256:129f41f6855cf13837b55516319512ee1561e0a8cbbc2e4a9be3d839631ebf36 \
2573-
--hash=sha256:234752e979465e98aae5866e32777a4d98892c1d9fd2f59f25fef69e1e26716e \
2574-
--hash=sha256:322e4702606ad498a493d99a065af46c0c16ce84a4bed6495f85efab75670f0a \
2575-
--hash=sha256:66ad915a090da0b8928d9fea2aaac98c3468bb5e08a1a293ef211615bd49e460 \
2576-
--hash=sha256:67c913c0345f8703f1b3c96dd5f63c914bc6c173e3f283f62c46c07b5fcc5618 \
2577-
--hash=sha256:909d00ffc1929ef45cd3cfa0cd3585999274c90a6bcf0799000677cf83a8f0e2 \
2578-
--hash=sha256:9346f26d4b97088ee23921d567b7836eceb57473638455eac73b2f2b7388cbfc \
2579-
--hash=sha256:efa8f95ac57b9cf71fd5a0dcaa51ecef8d40510ca5ab3347e046a972905edbf1 \
2580-
--hash=sha256:f32bfd6f649ef1968e4f6d37d7c3cae61a58fe1f57a987d2fad324f18d5dc6e5
2556+
nixl-cu13==1.1.0 \
2557+
--hash=sha256:1991d7899603907099f3e3ac3bf59f950de194bfe8d92c01ef9f06ce1639efa4 \
2558+
--hash=sha256:1c4e8142eff7cabe6107b3b65bc7a09da27ed585efb7972e45b1faabe74726c7 \
2559+
--hash=sha256:3f623b77fd59199afd71edadebb79ab394ec5e035873efe8a9bc8b4716b34e73 \
2560+
--hash=sha256:4e6031798b0a123d1821db698b1f9b3a1534c821af860ee0ef23601638c50d8f \
2561+
--hash=sha256:52b1e33ed9613df277d957cf1282cda14fbdf7b73006d8f45904cc68619e7af9 \
2562+
--hash=sha256:60cc00b12871d8c7d78c2385ad9380070424d5b07d3fe01680f222d6c4f1f428 \
2563+
--hash=sha256:6549dcb4f405f70903534a0770970ab95ed9185a8b16522db1ab4e2d0cc60b37 \
2564+
--hash=sha256:67149f7d2e3d471ca91499e5437d7b718e4e3e7a27f3b5b917f94b8992a4ed5a \
2565+
--hash=sha256:90c27cfdae0932f8ecb96ce29474249dd25d7f7712b9f43821cdb57699888fcb \
2566+
--hash=sha256:e8edf4b0d6a7549d8555fe1a99193aebd522b7737c405c3f8760f432d82e11df
25812567
# via
25822568
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
25832569
# -r python/requirements/llm/llm-requirements.txt
@@ -2721,6 +2707,7 @@ nvidia-cutlass-dsl==4.4.2 \
27212707
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
27222708
# flashinfer-python
27232709
# quack-kernels
2710+
# tokenspeed-mla
27242711
# vllm
27252712
nvidia-cutlass-dsl-libs-base==4.4.2 \
27262713
--hash=sha256:06acb3acff3dcf4bf6630476efac7de94de30b988ded4fa00b647bbcec4224ff \
@@ -4997,6 +4984,22 @@ tokenizers==0.22.2 \
49974984
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
49984985
# transformers
49994986
# vllm
4987+
tokenspeed-mla==0.1.2 \
4988+
--hash=sha256:592590f36d85e624ecdc5e357ff35e29e761e6d879900dce8b67a6785c8ce75c \
4989+
--hash=sha256:c9466a351fe039792e56cf49f3e79744c1dc28c7af10306a02e62b8e92fa5985
4990+
# via
4991+
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
4992+
# vllm
4993+
tokenspeed-triton==3.7.10.post20260505 \
4994+
--hash=sha256:060f657c78b5cd0c5645f01eb0f73b72cf385589235e3b96ca05f9b3d33a644f \
4995+
--hash=sha256:06bad3e25ccaba22bb43eb8499f01008f9aaa0bfb3fbfb0cef1b37d2c006c6f0 \
4996+
--hash=sha256:15e867fbc3dc7f5d1d2ec80b6b783c0e58d6d5c470cbfa99e87a035ec6af6212 \
4997+
--hash=sha256:19618c7db01a9bd33885f7acbf8945adb2f5534668aa97629b56d481753cbcad \
4998+
--hash=sha256:7a679e079f98023cf326f299c8150ebc8ef6f1d2cf744d5dc435bc0d9a6f8a5b \
4999+
--hash=sha256:82c222755095db261e32e3964e009573f3360806088fa493be65404276866344
5000+
# via
5001+
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
5002+
# tokenspeed-mla
50005003
torch==2.11.0+cpu \
50015004
--hash=sha256:1abeaa46fa7532ed35ed79146f4de5d7a9d4b30462c98052ea4ddfe781ea3eca \
50025005
--hash=sha256:2db3ae5404e32cb42b5fcbd94f13607761eaec0cf1687fde95095289d1e26cfb \
@@ -5034,6 +5037,7 @@ torch==2.11.0+cpu \
50345037
# nixl-cu13
50355038
# quack-kernels
50365039
# tilelang
5040+
# tokenspeed-mla
50375041
# torch-c-dlpack-ext
50385042
# torchvision
50395043
# vllm
@@ -5310,10 +5314,12 @@ virtualenv==21.2.4 \
53105314
# via
53115315
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
53125316
# -r python/requirements.txt
5313-
vllm==0.20.0 \
5314-
--hash=sha256:24d28892e210200f6e1bd13f699c42a74cd2bb7364c11248e2348f677c7f6dfb \
5315-
--hash=sha256:29a135ca0d70650f057f15c7c0b560d24659524c771f70fbddc24597c861c118 \
5316-
--hash=sha256:a6d50152936ee292455af3ffbe359f7a284ac43bf3b68caccf29f368e196cc72
5317+
vllm==0.21.0 \
5318+
--hash=sha256:05ff89c3e926b88b77d7878e317a659ffba678afc21c1d48952037aa5457f058 \
5319+
--hash=sha256:b241b085742cf04a68c82c089d12afe4d9ee729e0c7f81b2b2b9961d36105ee5 \
5320+
--hash=sha256:d6e63955b595bd2aa364e90f85c0a2e99573e701146db58394da569ddc6f4eea \
5321+
--hash=sha256:dc62135a50dc4b412b4f79549208e782f1665e49e8c13c2d29d2c3d94ff8ac97 \
5322+
--hash=sha256:f4a75b1391f44c67dc1ca268f5ffed9f6b7fdbc657c93db64e6892c5d1bc320b
53175323
# via
53185324
# -c python/deplocks/llm/rayllm_test_py312_cpu.lock
53195325
# -r python/requirements/llm/llm-requirements.txt
@@ -5692,3 +5698,4 @@ zipp==3.23.1 \
56925698

56935699
# The following packages were excluded from the output:
56945700
# setuptools
5701+
# nixl-cu12

0 commit comments

Comments
 (0)