[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675
[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server#4675AlienKevin wants to merge 1 commit intomainfrom
Conversation
…erver The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker. Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. This unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if they exist, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/tmp shouldn't be shared between runs at all, so there shouldn't be a need to cleanup the lockfile. Are we accidentally mounting docker runtime containers in lib/iris to the same /tmp/ mount point? They should be isolated even if they mount to the same root prefix on the running host. FWIW: "TPU initialization failed: open(/dev/vfio/N) is busy" is, I think, a different issue, and indicates another task has genuinely opened the TPU device. Only jobs which request the TPU attribute should be able to do this, so it shouldn't be a conflict with another user. Claude, can you also validate that our scheduling "deducts" available TPU counts when a user requests a given TPU type? Only a single TPU task should be assigned to a TPU VM. |
|
Claude finished @rjpower's task in 4m 1s —— View job Investigation:
|
The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on clean Python exits. When an Iris worker is preempted (SIGKILL) or OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists. The next task on the same worker then fails with "TPU initialization failed: open(/dev/vfio/N): Device or resource busy" and all --max-retries retries fail because Iris re-assigns to the same worker.
Fix: call _remove_stale_tpu_lockfiles() at the top of _start_vllm_native_server() before spawning the vllm process. Unconditionally deletes /tmp/libtpu_lockfile and /tmp/libtpu.so_lockfile if present, so a re-launched task on a recycled worker recovers instead of blocking on a stale lock.
Found and verified during the SWE-ZERO 1B-token scaling experiment (#4666): 50-shard distributed pipeline where preempted shards retried on the same workers and got stuck on the stale lockfile. Adding this pre-start cleanup resolved the issue.