Skip to content

Hardening: Prevent Accidental Termination of Unrelated Processes #460

@RUFFY-369

Description

@RUFFY-369

Describe the Issue

The kill_process_on_port utility in example_trainer/vllm_manager.py used a broad "find and kill" strategy that poses a significant risk to system stability.

The utility used lsof -t -i :port to find PIDs and then immediately issued SIGTERM/SIGKILL to those PIDs. If an unrelated system process (e.g., a database, an SSH session, or a monitoring agent) happened to be assigned that port by the OS after a previous vLLM process exited, Atropos would accidentally terminate it.

Environment/API Details

  • Environment Class/Name: example_trainer/vllm_manager.py
  • API Endpoint/Method Involved: kill_process_on_port

Steps to Reproduce

  1. Run an unrelated service on a port that Atropos intends to use (e.g., 9001).
  2. Start the Atropos Trainer.
  3. Observe that Atropos kills the unrelated service without any verification of its identity.

Interaction Details (if applicable)

  • Expected Behavior:
    1. The manager should verify the identity of a process before killing it.
    2. Verification should check the process command line (/proc/{pid}/cmdline) for relevant keywords like vllm, python, or atropos.
    3. If the process does not match, Atropos should skip the kill and alert the user of a port collision.

Setup Details

  • OS: Linux
  • Python Version: 3.10+
  • Atropos Version: commit c20c852

Additional Context & Logs

Hardening this logic is essential for running Atropos on shared multi-tenant clusters where port collisions are common and accidental termination of other users' processes is a severe violation of safety protocols.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions