Hardening: Prevent Silent Data Corruption in Single-Copy Mode

## Describe the Issue

In Single Copy (Shared Memory) mode, the Trainer attaches to vLLM's GPU memory via CUDA IPC. The original implementation blindly accepted the `dtype` from the vLLM metadata without validating it against the Trainer's configuration. 

This creates a high risk of **silent data corruption**: if vLLM is running in `float16` and the Trainer is configured for `bfloat16`, the Trainer will interpret the bits incorrectly. This does not result in a crash; it results in "garbage" weights and training divergence that is extremely difficult to debug.

## Environment/API Details

- **Environment Class/Name:** `example_trainer/model.py`
- **Environment Configuration:** `--openai.server_type vllm` (Single Copy Mode)
- **API Endpoint/Method Involved:** `reconstruct_vllm_tensor`

## Steps to Reproduce

1. Launch vLLM in `fp16`.
2. Configure Atropos Trainer in `bf16`.
3. Start training in Single Copy mode.
4. Observe that the Trainer attaches successfully but interprets weight values incorrectly, leading to immediate loss divergence.

## Interaction Details (if applicable)

- **Expected Behavior:** The Trainer should perform a bit-level validation of the dtype and raise a `RuntimeError` if the vLLM source and Trainer target do not match exactly.

## Setup Details

- **OS:** Linux
- **Python Version:** 3.10+
- **Atropos Version:** commit c20c852
- **Relevant Libraries/Versions:** `torch`, `vllm`

## Additional Context & Logs

This fix also removes broad `try-except` blocks that were masking CUDA initialization failures, moving the framework to a "Fail-Fast" architecture for better reliability.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardening: Prevent Silent Data Corruption in Single-Copy Mode #454

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hardening: Prevent Silent Data Corruption in Single-Copy Mode #454

Description

Describe the Issue

Environment/API Details

Steps to Reproduce

Interaction Details (if applicable)

Setup Details

Additional Context & Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions