[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server by AlienKevin · Pull Request #4675 · marin-community/marin

AlienKevin · 2026-04-12T02:58:25Z

The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker.

Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. Unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if present, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock.

Found and verified during the SWE-ZERO 1B-token scaling experiment (#4666): 50-shard distributed pipeline where preempted shards retried on the same workers and got stuck on the stale lockfile. Adding this pre-start cleanup resolved the issue.

…erver The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker. Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. This unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if they exist, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rjpower · 2026-04-12T20:50:05Z

@claude

/tmp shouldn't be shared between runs at all, so there shouldn't be a need to cleanup the lockfile. Are we accidentally mounting docker runtime containers in lib/iris to the same /tmp/ mount point? They should be isolated even if they mount to the same root prefix on the running host.

FWIW:

"TPU initialization failed: open(/dev/vfio/N) is busy" is, I think, a different issue, and indicates another task has genuinely opened the TPU device. Only jobs which request the TPU attribute should be able to do this, so it shouldn't be a conflict with another user. Claude, can you also validate that our scheduling "deducts" available TPU counts when a user requests a given TPU type? Only a single TPU task should be assigned to a TPU VM.

claude · 2026-04-12T20:50:38Z

Claude finished @rjpower's task in 4m 1s —— View job

Investigation: `/tmp` isolation and TPU scheduling

Read the PR changes for context
Investigate Docker container /tmp mounting in lib/iris
Investigate Iris TPU scheduling / resource deduction logic
Summarize findings

1. `/tmp` IS isolated between Iris Docker containers

Russell is correct that /tmp should not be shared between runs. The Iris Docker runtime already isolates /tmp per-container:

lib/iris/src/iris/cluster/worker/task_attempt.py:702-704 — every task mounts /tmp as tmpfs:

mounts = [
    MountSpec("/app", kind=MountKind.WORKDIR),
    MountSpec("/tmp", kind=MountKind.TMPFS),   # ← isolated per container
    ...
]

lib/iris/src/iris/cluster/runtime/docker.py:704-706 — tmpfs mounts use Docker's --tmpfs flag:

if rm.kind == MountKind.TMPFS:
    cmd.extend(["--tmpfs", rm.container_path])

Each container gets a fresh, ephemeral tmpfs at /tmp. When the container exits, the tmpfs is discarded. A stale /tmp/libtpu_lockfile from a prior container cannot persist into the next container.

The premise of this PR appears incorrect for Iris Docker-mode tasks. If the lockfile issue was genuinely observed, it suggests the code was running in a context where /tmp wasn't isolated (e.g., native mode outside Docker, or a misconfigured container).

The _start_vllm_native_server() function starts vllm serve as a subprocess within the Iris Docker container. Since the container already has an isolated /tmp, removing lockfiles at this level shouldn't be necessary under normal Iris operation.

2. TPU scheduling: YES, the scheduler properly deducts TPU counts

Only a single TPU task is assigned per TPU VM (assuming the task requests all available chips). The deduction logic is sound:

Capacity tracking (lib/iris/src/iris/cluster/controller/scheduler.py:174-179):

WorkerCapacity(
    ...
    available_tpus=worker.total_tpu_count - worker.committed_tpu,
    ...
)

Fitness check (scheduler.py:192-229): can_fit() extracts the TPU count from the job's DeviceConfig and compares it against the worker's available TPUs. If insufficient, the task is rejected with RejectionKind.TPU_COUNT.

Deduction on assignment (scheduler.py:253-261):

def deduct(self, req: JobRequirements) -> None:
    ...
    self.available_tpus -= get_tpu_count(res.device)

Database commit (lib/iris/src/iris/cluster/controller/transitions.py:1578-1596): when a task is assigned, committed_tpu is incremented on the worker row. When a task completes, _decommit_worker_resources() (transitions.py:355-373) releases the chips back.

Test coverage confirms this (lib/iris/tests/cluster/controller/test_scheduler.py):

test_tpu_chip_count_deducted_from_capacity (line 1009): worker with 4 TPU chips → first 4-chip task assigned → second 4-chip task rejected (0 available)
test_tpu_job_rejected_when_insufficient_chips (line 1059): worker with 4 chips rejects an 8-chip request
test_tpu_count_released_after_task_completion (line 1086): after first task completes and resources decommit, second task schedules successfully

Summary: A TPU VM with N chips will only have one task assigned to it if that task requests all N chips. The scheduler correctly tracks committed vs. available TPU counts both in-memory during scheduling cycles and in the database for persistence. Two tasks requesting the full chip count cannot be assigned to the same worker.

Recommendation

Given that /tmp is properly isolated via Docker tmpfs, this PR's lockfile cleanup is a no-op under normal Iris operation. If the "Device or resource busy" error was genuinely observed on Iris, the root cause is likely:

A genuine TPU device contention issue (another process opened /dev/vfio/N) — not a lockfile problem
A non-Docker execution path (e.g., running native mode directly on a VM without Iris)

The PR is harmless (defensive cleanup won't break anything), but it may be masking the real issue rather than fixing it. If this was observed during the SWE-ZERO scaling experiment, it would be worth investigating whether those tasks were actually running in Docker-isolated mode.

AlienKevin added the agent-generated Created by automation/agent label Apr 12, 2026

AlienKevin marked this pull request as ready for review April 12, 2026 02:59

AlienKevin mentioned this pull request Apr 12, 2026

Experiment: SWE-ZERO scaling to 1B tokens (32k PRs × 3 rollouts) #4666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675

[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675
AlienKevin wants to merge 1 commit intomainfrom
kevin/vllm-prestart-lockfile-cleanup

AlienKevin commented Apr 12, 2026

Uh oh!

rjpower commented Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlienKevin commented Apr 12, 2026

Uh oh!

rjpower commented Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation: /tmp isolation and TPU scheduling

1. /tmp IS isolated between Iris Docker containers

2. TPU scheduling: YES, the scheduler properly deducts TPU counts

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 12, 2026 •

edited

Loading

Investigation: `/tmp` isolation and TPU scheduling

1. `/tmp` IS isolated between Iris Docker containers