Skip to content

Hardening: Prevent API Server OOM via Rollout Backpressure #459

@RUFFY-369

Description

@RUFFY-369

Describe the Issue

The Atropos API server (atroposlib/api/server.py) lacked a mechanism to limit the size of its rollout trajectory queue.

In high-throughput environments where rollout workers generate data faster than the Trainer can process it, the trajectories would accumulate indefinitely in the server's memory. This leads to unbounded memory growth (RAM) and eventually causes the API server to be terminated by the system's OOM (Out of Memory) Killer.

Environment/API Details

  • Environment Class/Name: atroposlib/api/server.py
  • API Endpoint/Method Involved: /scored_data (submission of trajectories)

Steps to Reproduce

  1. Launch a training run with a high number of parallel rollout workers.
  2. Slow down the Trainer (e.g., by increasing gradient accumulation steps or using a very large model).
  3. Monitor the RAM usage of the API server process.
  4. Observe that memory increases linearly until the process crashes.

Interaction Details (if applicable)

  • Expected Behavior:
    1. The API server should have a configurable MAX_QUEUE_SIZE.
    2. When the queue is full, the server should return an HTTP 503 Service Unavailable status to rollout workers, forcing them to wait or retry (Backpressure).

Setup Details

  • OS: Linux
  • Python Version: 3.10+
  • Atropos Version: commit c20c852
  • Relevant Libraries/Versions: fastapi, uvicorn

Additional Context & Logs

Implementing backpressure ensures that the entire training system remains stable even when there is a mismatch between data generation and consumption rates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions