Describe the Issue
In Single Copy Mode, the vllm_manager.py creates a vllm_bridge_config.json file to share CUDA IPC handles between vLLM and the Trainer. However, this file was not being cleaned up upon process termination.
If a training run crashed, was manually killed, or timed out, the stale config file would persist. Subsequent runs would attempt to attach to the memory addresses stored in the old file, which no longer exist in the new vLLM process. This results in immediate RuntimeError: CUDA error: invalid handle or similar attachment failures upon startup.
Environment/API Details
- Environment Class/Name:
example_trainer/vllm_manager.py
- Environment Configuration:
--openai.server_type vllm (Single Copy Mode)
- API Endpoint/Method Involved:
launch_vllm_server and cleanup_vllm
Steps to Reproduce
- Start a training run in Single Copy mode.
- Force-kill the Trainer process (e.g.,
Ctrl+C or kill -9).
- Attempt to start a new training run immediately.
- Observe that the Trainer tries to load the stale
vllm_bridge_config.json and fails to initialize.
Interaction Details (if applicable)
- Expected Behavior:
- The manager should explicitly delete any existing
vllm_bridge_config.json before launching a new vLLM server.
- The manager should use an
atexit or signal handler to delete the config file when the process terminates.
Setup Details
- OS: Linux
- Python Version: 3.10+
- Atropos Version: commit c20c852
- Relevant Libraries/Versions:
torch, vllm
Additional Context & Logs
This fix improves the "re-runability" of Atropos in automated environments where jobs may be pre-empted or restarted frequently.
Describe the Issue
In Single Copy Mode, the
vllm_manager.pycreates avllm_bridge_config.jsonfile to share CUDA IPC handles between vLLM and the Trainer. However, this file was not being cleaned up upon process termination.If a training run crashed, was manually killed, or timed out, the stale config file would persist. Subsequent runs would attempt to attach to the memory addresses stored in the old file, which no longer exist in the new vLLM process. This results in immediate
RuntimeError: CUDA error: invalid handleor similar attachment failures upon startup.Environment/API Details
example_trainer/vllm_manager.py--openai.server_type vllm(Single Copy Mode)launch_vllm_serverandcleanup_vllmSteps to Reproduce
Ctrl+Corkill -9).vllm_bridge_config.jsonand fails to initialize.Interaction Details (if applicable)
vllm_bridge_config.jsonbefore launching a new vLLM server.atexitor signal handler to delete the config file when the process terminates.Setup Details
torch,vllmAdditional Context & Logs
This fix improves the "re-runability" of Atropos in automated environments where jobs may be pre-empted or restarted frequently.