Skip to content

Hardening: Prevent Boot Failures from Stale CUDA IPC Handles #458

@RUFFY-369

Description

@RUFFY-369

Describe the Issue

In Single Copy Mode, the vllm_manager.py creates a vllm_bridge_config.json file to share CUDA IPC handles between vLLM and the Trainer. However, this file was not being cleaned up upon process termination.

If a training run crashed, was manually killed, or timed out, the stale config file would persist. Subsequent runs would attempt to attach to the memory addresses stored in the old file, which no longer exist in the new vLLM process. This results in immediate RuntimeError: CUDA error: invalid handle or similar attachment failures upon startup.

Environment/API Details

  • Environment Class/Name: example_trainer/vllm_manager.py
  • Environment Configuration: --openai.server_type vllm (Single Copy Mode)
  • API Endpoint/Method Involved: launch_vllm_server and cleanup_vllm

Steps to Reproduce

  1. Start a training run in Single Copy mode.
  2. Force-kill the Trainer process (e.g., Ctrl+C or kill -9).
  3. Attempt to start a new training run immediately.
  4. Observe that the Trainer tries to load the stale vllm_bridge_config.json and fails to initialize.

Interaction Details (if applicable)

  • Expected Behavior:
    1. The manager should explicitly delete any existing vllm_bridge_config.json before launching a new vLLM server.
    2. The manager should use an atexit or signal handler to delete the config file when the process terminates.

Setup Details

  • OS: Linux
  • Python Version: 3.10+
  • Atropos Version: commit c20c852
  • Relevant Libraries/Versions: torch, vllm

Additional Context & Logs

This fix improves the "re-runability" of Atropos in automated environments where jobs may be pre-empted or restarted frequently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions