Skip to content

Commit 9c0e2b6

Browse files
AlienKevinclaude
andcommitted
[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server
The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker. Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. This unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if they exist, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f8d4889 commit 9c0e2b6

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

lib/marin/src/marin/inference/vllm_server.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,31 @@ def _vllm_env() -> dict[str, str]:
801801
return env
802802

803803

804+
_TPU_LOCKFILE_PATHS = ("/tmp/libtpu_lockfile", "/tmp/libtpu.so_lockfile")
805+
806+
807+
def _remove_stale_tpu_lockfiles() -> None:
808+
"""Best-effort delete of stale TPU lockfiles left by a prior aborted vLLM run.
809+
810+
``remove_tpu_lockfile_on_exit`` (registered via ``atexit``) only fires on
811+
clean exits. When an Iris worker is preempted (SIGKILL) or OOM-killed,
812+
the handler never runs and the lockfile persists. The next task on the
813+
same worker then fails with "TPU initialization failed: open(/dev/vfio/N):
814+
Device or resource busy" and all ``--max-retries`` retries fail because
815+
Iris re-assigns to the same worker.
816+
817+
Calling this *before* starting vLLM ensures the lock is released even
818+
after an unclean previous exit.
819+
"""
820+
for path in _TPU_LOCKFILE_PATHS:
821+
try:
822+
if os.path.exists(path):
823+
os.unlink(path)
824+
logger.info("Removed stale TPU lockfile: %s", path)
825+
except OSError as e:
826+
logger.warning("Could not remove TPU lockfile %s: %s", path, e)
827+
828+
804829
def _start_vllm_native_server(
805830
*,
806831
model_name_or_path: str,
@@ -811,6 +836,7 @@ def _start_vllm_native_server(
811836
) -> VllmServerHandle:
812837
"""Start `vllm serve` in-process and wait until `/v1/models` responds."""
813838

839+
_remove_stale_tpu_lockfiles()
814840
resolved_port = port if port is not None else 8000
815841

816842
vllm_bin = shutil.which("vllm") or "vllm"

0 commit comments

Comments
 (0)