Commit 9c0e2b6
[inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server
The existing remove_tpu_lockfile_on_exit (atexit handler) only fires on
clean Python exits. When an Iris worker is preempted (SIGKILL) or
OOM-killed, the handler never runs and /tmp/libtpu_lockfile persists.
The next task on the same worker then fails with "TPU initialization
failed: open(/dev/vfio/N): Device or resource busy" and all
--max-retries retries fail because Iris re-assigns to the same worker.
Fix: call _remove_stale_tpu_lockfiles() at the top of
_start_vllm_native_server() before spawning the vllm process. This
unconditionally deletes /tmp/libtpu_lockfile and
/tmp/libtpu.so_lockfile if they exist, so a re-launched task on a
recycled worker recovers instead of blocking on a stale lock.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent f8d4889 commit 9c0e2b6
1 file changed
+26
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
801 | 801 | | |
802 | 802 | | |
803 | 803 | | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
804 | 829 | | |
805 | 830 | | |
806 | 831 | | |
| |||
811 | 836 | | |
812 | 837 | | |
813 | 838 | | |
| 839 | + | |
814 | 840 | | |
815 | 841 | | |
816 | 842 | | |
| |||
0 commit comments