[deps] Pin torch to pytorch-cpu index in the vllm extra by AlienKevin · Pull Request #4663 · marin-community/marin

AlienKevin · 2026-04-12T01:08:36Z

The vllm extra installs vllm-tpu, which transitively depends on torch but does not pin which index to use. Without an explicit binding, uv resolves the transitive dep against the default PyPI index and installs the CUDA build wheel. On TPU workers this crashes at module init with libcublas.so.*[0-9] not found in the system path, because torch's _load_global_deps preloads the CUDA runtime libraries that don't exist on the TPU workers.

The bug is normally hidden by uv.lock pinning torch to the cpu index from the cpu/tpu extras. It surfaces on Iris workers because those workers drop uv.lock from the workspace bundle when it exceeds 1MB (Kubernetes ConfigMap limit) and fall back to a fresh uv sync --extra vllm resolve, which then picks the wrong torch wheel. This was first hit by the SWE-ZERO multi-language experiment in #4653 — every preempted worker that had to do a fresh resolve crashed at vLLM startup until the script was rewritten to spawn vllm via subprocess and the user was instructed to pass --extra vllm --extra tpu manually.

Fix: add an explicit torch==2.9.0 (and matching torchvision) pin to the vllm extra and route it to the pytorch-cpu index via [tool.uv.sources]. Also declare a vllm/gpu mutual-exclusion in [tool.uv.conflicts] since marin only ships vllm-tpu (no vllm-cuda variant) and the two extras would otherwise conflict over which torch index to use during full-workspace locking.

Verified by running uv sync --package marin --extra vllm in a clean worktree off main: torch resolves to 2.9.0+cpu, torch.cuda is None, import torch succeeds without libcublas. After this lands, --extra vllm alone is sufficient on Iris TPU workers and the --extra vllm --extra tpu workaround can be dropped.

Fixes the libcublas.so.*[0-9] not found in the system path crash hit by #4653.

The vllm extra installs vllm-tpu, which transitively depends on torch but does not pin which index to use. Without an explicit binding, uv resolves the transitive dep against the default PyPI index and installs the CUDA build wheel. On TPU workers this crashes at module init with "libcublas.so.*[0-9] not found in the system path", because torch's _load_global_deps preloads the CUDA runtime libraries that don't exist on the TPU workers. The bug is normally hidden by uv.lock pinning torch to the cpu index from the cpu/tpu extras. It surfaces on Iris workers because those workers drop uv.lock from the workspace bundle when it exceeds 1MB (Kubernetes ConfigMap limit) and fall back to a fresh `uv sync --extra vllm` resolve, which then picks the wrong torch wheel. Fix: add an explicit `torch==2.9.0` (and matching torchvision) pin to the vllm extra and route it to the pytorch-cpu index via `[tool.uv.sources]`. Also declare a vllm/gpu mutual-exclusion in `[tool.uv.conflicts]` since marin only ships vllm-tpu (no vllm-cuda variant) and the two extras would otherwise conflict over which torch index to use during full-workspace locking. Verified by running `uv sync --package marin --extra vllm` in a clean worktree off main: torch resolves to `2.9.0+cpu`, `torch.cuda` is None, `import torch` succeeds without libcublas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pin shards away from europe-west4 region by default. Some workers in that region have a broken vllm-tpu venv (CUDA torch instead of CPU torch_xla) that crashes vLLM at engine-core init. Iris max-retries reassigns to the SAME worker, so a single bad worker poisons all 5 retries of a shard. us-east5 and us-east1 v6e-4 workers consistently work in our experience (verified by the multilang topup and the Step 6 32K runs). Flag is configurable so we can broaden the pool once #4663 lands and the worker-image divergence is resolved. Part of #4666

AlienKevin added the agent-generated Created by automation/agent label Apr 12, 2026

This was referenced Apr 12, 2026

Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs) #4653

Closed

Experiment: SWE-ZERO scaling to 1B tokens (32k PRs × 3 rollouts) #4666

Open

Merge branch 'main' into kevin/vllm-extra-pins-cpu-torch

71eba19

AlienKevin marked this pull request as ready for review April 12, 2026 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deps] Pin torch to pytorch-cpu index in the vllm extra#4663

[deps] Pin torch to pytorch-cpu index in the vllm extra#4663
AlienKevin wants to merge 2 commits intomainfrom
kevin/vllm-extra-pins-cpu-torch

AlienKevin commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlienKevin commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant