Describe the Issue
In Single Copy (Shared Memory) mode, the Trainer attaches to vLLM's GPU memory via CUDA IPC. The original implementation blindly accepted the dtype from the vLLM metadata without validating it against the Trainer's configuration.
This creates a high risk of silent data corruption: if vLLM is running in float16 and the Trainer is configured for bfloat16, the Trainer will interpret the bits incorrectly. This does not result in a crash; it results in "garbage" weights and training divergence that is extremely difficult to debug.
Environment/API Details
- Environment Class/Name:
example_trainer/model.py
- Environment Configuration:
--openai.server_type vllm (Single Copy Mode)
- API Endpoint/Method Involved:
reconstruct_vllm_tensor
Steps to Reproduce
- Launch vLLM in
fp16.
- Configure Atropos Trainer in
bf16.
- Start training in Single Copy mode.
- Observe that the Trainer attaches successfully but interprets weight values incorrectly, leading to immediate loss divergence.
Interaction Details (if applicable)
- Expected Behavior: The Trainer should perform a bit-level validation of the dtype and raise a
RuntimeError if the vLLM source and Trainer target do not match exactly.
Setup Details
- OS: Linux
- Python Version: 3.10+
- Atropos Version: commit c20c852
- Relevant Libraries/Versions:
torch, vllm
Additional Context & Logs
This fix also removes broad try-except blocks that were masking CUDA initialization failures, moving the framework to a "Fail-Fast" architecture for better reliability.
Describe the Issue
In Single Copy (Shared Memory) mode, the Trainer attaches to vLLM's GPU memory via CUDA IPC. The original implementation blindly accepted the
dtypefrom the vLLM metadata without validating it against the Trainer's configuration.This creates a high risk of silent data corruption: if vLLM is running in
float16and the Trainer is configured forbfloat16, the Trainer will interpret the bits incorrectly. This does not result in a crash; it results in "garbage" weights and training divergence that is extremely difficult to debug.Environment/API Details
example_trainer/model.py--openai.server_type vllm(Single Copy Mode)reconstruct_vllm_tensorSteps to Reproduce
fp16.bf16.Interaction Details (if applicable)
RuntimeErrorif the vLLM source and Trainer target do not match exactly.Setup Details
torch,vllmAdditional Context & Logs
This fix also removes broad
try-exceptblocks that were masking CUDA initialization failures, moving the framework to a "Fail-Fast" architecture for better reliability.