Skip to content

[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675

Open
AlienKevin wants to merge 1 commit intomainfrom
kevin/vllm-prestart-lockfile-cleanup
Open

[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675
AlienKevin wants to merge 1 commit intomainfrom
kevin/vllm-prestart-lockfile-cleanup

Conversation

@AlienKevin
Copy link
Copy Markdown
Contributor

The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker.

Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. Unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if present, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock.

Found and verified during the SWE-ZERO 1B-token scaling experiment (#4666): 50-shard distributed pipeline where preempted shards retried on the same workers and got stuck on the stale lockfile. Adding this pre-start cleanup resolved the issue.

…erver

The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on
clean Python exits. When an Iris worker is preempted (SIGKILL) or
OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists.
The next task on the same worker then fails with "TPU initialization
failed: open(/dev/vfio/N): Device or resource busy" and all
--max-retries retries fail because Iris re-assigns to the same worker.

Fix: call _remove_stale_tpu_lockfiles() at the top of
_start_vllm_native_server() before spawning the vllm process. This
unconditionally deletes /tmp/libtpu_lockfile and
/tmp/libtpu.so_lockfile if they exist, so a re-launched task on a
recycled worker recovers instead of blocking on a stale lock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AlienKevin AlienKevin added the agent-generated Created by automation/agent label Apr 12, 2026
@AlienKevin AlienKevin marked this pull request as ready for review April 12, 2026 02:59
@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 12, 2026

@claude

/tmp shouldn't be shared between runs at all, so there shouldn't be a need to cleanup the lockfile. Are we accidentally mounting docker runtime containers in lib/iris to the same /tmp/ mount point? They should be isolated even if they mount to the same root prefix on the running host.

FWIW:

"TPU initialization failed: open(/dev/vfio/N) is busy" is, I think, a different issue, and indicates another task has genuinely opened the TPU device. Only jobs which request the TPU attribute should be able to do this, so it shouldn't be a conflict with another user. Claude, can you also validate that our scheduling "deducts" available TPU counts when a user requests a given TPU type? Only a single TPU task should be assigned to a TPU VM.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 12, 2026

Claude finished @rjpower's task in 4m 1s —— View job


Investigation: /tmp isolation and TPU scheduling

  • Read the PR changes for context
  • Investigate Docker container /tmp mounting in lib/iris
  • Investigate Iris TPU scheduling / resource deduction logic
  • Summarize findings

1. /tmp IS isolated between Iris Docker containers

Russell is correct that /tmp should not be shared between runs. The Iris Docker runtime already isolates /tmp per-container:

lib/iris/src/iris/cluster/worker/task_attempt.py:702-704 — every task mounts /tmp as tmpfs:

mounts = [
    MountSpec("/app", kind=MountKind.WORKDIR),
    MountSpec("/tmp", kind=MountKind.TMPFS),   # ← isolated per container
    ...
]

lib/iris/src/iris/cluster/runtime/docker.py:704-706 — tmpfs mounts use Docker's --tmpfs flag:

if rm.kind == MountKind.TMPFS:
    cmd.extend(["--tmpfs", rm.container_path])

Each container gets a fresh, ephemeral tmpfs at /tmp. When the container exits, the tmpfs is discarded. A stale /tmp/libtpu_lockfile from a prior container cannot persist into the next container.

The premise of this PR appears incorrect for Iris Docker-mode tasks. If the lockfile issue was genuinely observed, it suggests the code was running in a context where /tmp wasn't isolated (e.g., native mode outside Docker, or a misconfigured container).

The _start_vllm_native_server() function starts vllm serve as a subprocess within the Iris Docker container. Since the container already has an isolated /tmp, removing lockfiles at this level shouldn't be necessary under normal Iris operation.


2. TPU scheduling: YES, the scheduler properly deducts TPU counts

Only a single TPU task is assigned per TPU VM (assuming the task requests all available chips). The deduction logic is sound:

Capacity tracking (lib/iris/src/iris/cluster/controller/scheduler.py:174-179):

WorkerCapacity(
    ...
    available_tpus=worker.total_tpu_count - worker.committed_tpu,
    ...
)

Fitness check (scheduler.py:192-229): can_fit() extracts the TPU count from the job's DeviceConfig and compares it against the worker's available TPUs. If insufficient, the task is rejected with RejectionKind.TPU_COUNT.

Deduction on assignment (scheduler.py:253-261):

def deduct(self, req: JobRequirements) -> None:
    ...
    self.available_tpus -= get_tpu_count(res.device)

Database commit (lib/iris/src/iris/cluster/controller/transitions.py:1578-1596): when a task is assigned, committed_tpu is incremented on the worker row. When a task completes, _decommit_worker_resources() (transitions.py:355-373) releases the chips back.

Test coverage confirms this (lib/iris/tests/cluster/controller/test_scheduler.py):

  • test_tpu_chip_count_deducted_from_capacity (line 1009): worker with 4 TPU chips → first 4-chip task assigned → second 4-chip task rejected (0 available)
  • test_tpu_job_rejected_when_insufficient_chips (line 1059): worker with 4 chips rejects an 8-chip request
  • test_tpu_count_released_after_task_completion (line 1086): after first task completes and resources decommit, second task schedules successfully

Summary: A TPU VM with N chips will only have one task assigned to it if that task requests all N chips. The scheduler correctly tracks committed vs. available TPU counts both in-memory during scheduling cycles and in the database for persistence. Two tasks requesting the full chip count cannot be assigned to the same worker.


Recommendation

Given that /tmp is properly isolated via Docker tmpfs, this PR's lockfile cleanup is a no-op under normal Iris operation. If the "Device or resource busy" error was genuinely observed on Iris, the root cause is likely:

  1. A genuine TPU device contention issue (another process opened /dev/vfio/N) — not a lockfile problem
  2. A non-Docker execution path (e.g., running native mode directly on a VM without Iris)

The PR is harmless (defensive cleanup won't break anything), but it may be masking the real issue rather than fixing it. If this was observed during the SWE-ZERO scaling experiment, it would be worth investigating whether those tasks were actually running in Docker-isolated mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants