Hardening: Prevent Boot Failures from Stale CUDA IPC Handles

## Describe the Issue

In Single Copy Mode, the `vllm_manager.py` creates a `vllm_bridge_config.json` file to share CUDA IPC handles between vLLM and the Trainer. However, this file was not being cleaned up upon process termination.

If a training run crashed, was manually killed, or timed out, the stale config file would persist. Subsequent runs would attempt to attach to the memory addresses stored in the old file, which no longer exist in the new vLLM process. This results in immediate `RuntimeError: CUDA error: invalid handle` or similar attachment failures upon startup.

## Environment/API Details

- **Environment Class/Name:** `example_trainer/vllm_manager.py`
- **Environment Configuration:** `--openai.server_type vllm` (Single Copy Mode)
- **API Endpoint/Method Involved:** `launch_vllm_server` and `cleanup_vllm`

## Steps to Reproduce

1. Start a training run in Single Copy mode.
2. Force-kill the Trainer process (e.g., `Ctrl+C` or `kill -9`).
3. Attempt to start a new training run immediately.
4. Observe that the Trainer tries to load the stale `vllm_bridge_config.json` and fails to initialize.

## Interaction Details (if applicable)

- **Expected Behavior:** 
  1. The manager should explicitly delete any existing `vllm_bridge_config.json` *before* launching a new vLLM server.
  2. The manager should use an `atexit` or signal handler to delete the config file when the process terminates.

## Setup Details

- **OS:** Linux
- **Python Version:** 3.10+
- **Atropos Version:** commit c20c852
- **Relevant Libraries/Versions:** `torch`, `vllm`

## Additional Context & Logs

This fix improves the "re-runability" of Atropos in automated environments where jobs may be pre-empted or restarted frequently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardening: Prevent Boot Failures from Stale CUDA IPC Handles #458

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hardening: Prevent Boot Failures from Stale CUDA IPC Handles #458

Description

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions