Describe the Issue
The kill_process_on_port utility in example_trainer/vllm_manager.py used a broad "find and kill" strategy that poses a significant risk to system stability.
The utility used lsof -t -i :port to find PIDs and then immediately issued SIGTERM/SIGKILL to those PIDs. If an unrelated system process (e.g., a database, an SSH session, or a monitoring agent) happened to be assigned that port by the OS after a previous vLLM process exited, Atropos would accidentally terminate it.
Environment/API Details
- Environment Class/Name:
example_trainer/vllm_manager.py
- API Endpoint/Method Involved:
kill_process_on_port
Steps to Reproduce
- Run an unrelated service on a port that Atropos intends to use (e.g., 9001).
- Start the Atropos Trainer.
- Observe that Atropos kills the unrelated service without any verification of its identity.
Interaction Details (if applicable)
- Expected Behavior:
- The manager should verify the identity of a process before killing it.
- Verification should check the process command line (
/proc/{pid}/cmdline) for relevant keywords like vllm, python, or atropos.
- If the process does not match, Atropos should skip the kill and alert the user of a port collision.
Setup Details
- OS: Linux
- Python Version: 3.10+
- Atropos Version: commit c20c852
Additional Context & Logs
Hardening this logic is essential for running Atropos on shared multi-tenant clusters where port collisions are common and accidental termination of other users' processes is a severe violation of safety protocols.
Describe the Issue
The
kill_process_on_portutility inexample_trainer/vllm_manager.pyused a broad "find and kill" strategy that poses a significant risk to system stability.The utility used
lsof -t -i :portto find PIDs and then immediately issuedSIGTERM/SIGKILLto those PIDs. If an unrelated system process (e.g., a database, an SSH session, or a monitoring agent) happened to be assigned that port by the OS after a previous vLLM process exited, Atropos would accidentally terminate it.Environment/API Details
example_trainer/vllm_manager.pykill_process_on_portSteps to Reproduce
Interaction Details (if applicable)
/proc/{pid}/cmdline) for relevant keywords likevllm,python, oratropos.Setup Details
Additional Context & Logs
Hardening this logic is essential for running Atropos on shared multi-tenant clusters where port collisions are common and accidental termination of other users' processes is a severe violation of safety protocols.