Skip to content

[deps] Pin torch to pytorch-cpu index in the vllm extra#4663

Open
AlienKevin wants to merge 2 commits intomainfrom
kevin/vllm-extra-pins-cpu-torch
Open

[deps] Pin torch to pytorch-cpu index in the vllm extra#4663
AlienKevin wants to merge 2 commits intomainfrom
kevin/vllm-extra-pins-cpu-torch

Conversation

@AlienKevin
Copy link
Copy Markdown
Contributor

The vllm extra installs vllm-tpu, which transitively depends on torch but does not pin which index to use. Without an explicit binding, uv resolves the transitive dep against the default PyPI index and installs the CUDA build wheel. On TPU workers this crashes at module init with libcublas.so.*[0-9] not found in the system path, because torch's _load_global_deps preloads the CUDA runtime libraries that don't exist on the TPU workers.

The bug is normally hidden by uv.lock pinning torch to the cpu index from the cpu/tpu extras. It surfaces on Iris workers because those workers drop uv.lock from the workspace bundle when it exceeds 1MB (Kubernetes ConfigMap limit) and fall back to a fresh uv sync --extra vllm resolve, which then picks the wrong torch wheel. This was first hit by the SWE-ZERO multi-language experiment in #4653 — every preempted worker that had to do a fresh resolve crashed at vLLM startup until the script was rewritten to spawn vllm via subprocess and the user was instructed to pass --extra vllm --extra tpu manually.

Fix: add an explicit torch==2.9.0 (and matching torchvision) pin to the vllm extra and route it to the pytorch-cpu index via [tool.uv.sources]. Also declare a vllm/gpu mutual-exclusion in [tool.uv.conflicts] since marin only ships vllm-tpu (no vllm-cuda variant) and the two extras would otherwise conflict over which torch index to use during full-workspace locking.

Verified by running uv sync --package marin --extra vllm in a clean worktree off main: torch resolves to 2.9.0+cpu, torch.cuda is None, import torch succeeds without libcublas. After this lands, --extra vllm alone is sufficient on Iris TPU workers and the --extra vllm --extra tpu workaround can be dropped.

Fixes the libcublas.so.*[0-9] not found in the system path crash hit by #4653.

The vllm extra installs vllm-tpu, which transitively depends on torch but
does not pin which index to use. Without an explicit binding, uv resolves
the transitive dep against the default PyPI index and installs the CUDA
build wheel. On TPU workers this crashes at module init with
"libcublas.so.*[0-9] not found in the system path", because torch's
_load_global_deps preloads the CUDA runtime libraries that don't exist
on the TPU workers.

The bug is normally hidden by uv.lock pinning torch to the cpu index from
the cpu/tpu extras. It surfaces on Iris workers because those workers drop
uv.lock from the workspace bundle when it exceeds 1MB (Kubernetes ConfigMap
limit) and fall back to a fresh `uv sync --extra vllm` resolve, which then
picks the wrong torch wheel.

Fix: add an explicit `torch==2.9.0` (and matching torchvision) pin to the
vllm extra and route it to the pytorch-cpu index via `[tool.uv.sources]`.
Also declare a vllm/gpu mutual-exclusion in `[tool.uv.conflicts]` since
marin only ships vllm-tpu (no vllm-cuda variant) and the two extras would
otherwise conflict over which torch index to use during full-workspace
locking.

Verified by running `uv sync --package marin --extra vllm` in a clean
worktree off main: torch resolves to `2.9.0+cpu`, `torch.cuda` is None,
`import torch` succeeds without libcublas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AlienKevin AlienKevin added the agent-generated Created by automation/agent label Apr 12, 2026
AlienKevin added a commit that referenced this pull request Apr 12, 2026
Pin shards away from europe-west4 region by default. Some workers in that
region have a broken vllm-tpu venv (CUDA torch instead of CPU torch_xla)
that crashes vLLM at engine-core init. Iris max-retries reassigns to the
SAME worker, so a single bad worker poisons all 5 retries of a shard.

us-east5 and us-east1 v6e-4 workers consistently work in our experience
(verified by the multilang topup and the Step 6 32K runs). Flag is
configurable so we can broaden the pool once #4663 lands and the
worker-image divergence is resolved.

Part of #4666
@AlienKevin AlienKevin marked this pull request as ready for review April 12, 2026 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant