Skip to content

Hardening: Prevent Silent Data Corruption in Single-Copy Mode #454

@RUFFY-369

Description

@RUFFY-369

Describe the Issue

In Single Copy (Shared Memory) mode, the Trainer attaches to vLLM's GPU memory via CUDA IPC. The original implementation blindly accepted the dtype from the vLLM metadata without validating it against the Trainer's configuration.

This creates a high risk of silent data corruption: if vLLM is running in float16 and the Trainer is configured for bfloat16, the Trainer will interpret the bits incorrectly. This does not result in a crash; it results in "garbage" weights and training divergence that is extremely difficult to debug.

Environment/API Details

  • Environment Class/Name: example_trainer/model.py
  • Environment Configuration: --openai.server_type vllm (Single Copy Mode)
  • API Endpoint/Method Involved: reconstruct_vllm_tensor

Steps to Reproduce

  1. Launch vLLM in fp16.
  2. Configure Atropos Trainer in bf16.
  3. Start training in Single Copy mode.
  4. Observe that the Trainer attaches successfully but interprets weight values incorrectly, leading to immediate loss divergence.

Interaction Details (if applicable)

  • Expected Behavior: The Trainer should perform a bit-level validation of the dtype and raise a RuntimeError if the vLLM source and Trainer target do not match exactly.

Setup Details

  • OS: Linux
  • Python Version: 3.10+
  • Atropos Version: commit c20c852
  • Relevant Libraries/Versions: torch, vllm

Additional Context & Logs

This fix also removes broad try-except blocks that were masking CUDA initialization failures, moving the framework to a "Fail-Fast" architecture for better reliability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions