Describe the Issue
The Atropos API server (atroposlib/api/server.py) lacked a mechanism to limit the size of its rollout trajectory queue.
In high-throughput environments where rollout workers generate data faster than the Trainer can process it, the trajectories would accumulate indefinitely in the server's memory. This leads to unbounded memory growth (RAM) and eventually causes the API server to be terminated by the system's OOM (Out of Memory) Killer.
Environment/API Details
- Environment Class/Name:
atroposlib/api/server.py
- API Endpoint/Method Involved:
/scored_data (submission of trajectories)
Steps to Reproduce
- Launch a training run with a high number of parallel rollout workers.
- Slow down the Trainer (e.g., by increasing gradient accumulation steps or using a very large model).
- Monitor the RAM usage of the API server process.
- Observe that memory increases linearly until the process crashes.
Interaction Details (if applicable)
- Expected Behavior:
- The API server should have a configurable
MAX_QUEUE_SIZE.
- When the queue is full, the server should return an
HTTP 503 Service Unavailable status to rollout workers, forcing them to wait or retry (Backpressure).
Setup Details
- OS: Linux
- Python Version: 3.10+
- Atropos Version: commit c20c852
- Relevant Libraries/Versions:
fastapi, uvicorn
Additional Context & Logs
Implementing backpressure ensures that the entire training system remains stable even when there is a mismatch between data generation and consumption rates.
Describe the Issue
The Atropos API server (
atroposlib/api/server.py) lacked a mechanism to limit the size of its rollout trajectory queue.In high-throughput environments where rollout workers generate data faster than the Trainer can process it, the trajectories would accumulate indefinitely in the server's memory. This leads to unbounded memory growth (RAM) and eventually causes the API server to be terminated by the system's OOM (Out of Memory) Killer.
Environment/API Details
atroposlib/api/server.py/scored_data(submission of trajectories)Steps to Reproduce
Interaction Details (if applicable)
MAX_QUEUE_SIZE.HTTP 503 Service Unavailablestatus to rollout workers, forcing them to wait or retry (Backpressure).Setup Details
fastapi,uvicornAdditional Context & Logs
Implementing backpressure ensures that the entire training system remains stable even when there is a mismatch between data generation and consumption rates.