| Author | Gyubong Lee (gbl@lablup.com) |
|---|---|
| Status | Draft |
| Created | 2025-01-21 |
| Created-Version | 26.2.0 |
| Target-Version | 26.2.0 |
| Implemented-Version |
This BEP defines a comprehensive request ID tracing system for Backend.AI's distributed architecture. It establishes standards for generating, propagating, and utilizing request IDs across all components (Manager, Agent, Storage-Proxy, App-Proxy, Web Server) to enable end-to-end request tracing, debugging, and observability.
Backend.AI is a distributed system where a single user request can traverse multiple components:
- A user request enters via the Manager's HTTP API
- The Manager may invoke RPCs on Agents, HTTP calls to Storage-Proxy, or communicate with App-Proxy
- Background tasks and event handlers process work asynchronously
- Each component generates logs independently
Without a unified request ID tracing system:
- Debugging is difficult: When an error occurs, correlating logs across components requires manual timestamp matching
- Root cause analysis is slow: Tracing the full path of a failed request requires examining logs from multiple services
- Observability is limited: Cannot build request flow visualizations or measure end-to-end latencies
- Audit trails are incomplete: Cannot definitively link related operations across service boundaries
Request ID tracing is not systematically implemented:
- Manager has
request_id_middlewarefor HTTP requests, but doesn't propagate to other services - Agent RPC does not support request_id
- Background tasks do not have request_id
- App-Proxy does not have request_id support
This leads to:
- No way to correlate logs across components
- Cannot trace the full path of a request through the system
- Context-based Propagation: Use Python's
contextvarsto carry request ID through async call chains - Automatic Generation: Generate request IDs at system entry points if not provided
- Standard Headers: Use consistent header names across HTTP and RPC protocols
- Extensible Structure: Design for future additions (correlation_id, trace_id, span_id for OpenTelemetry)
- Minimal Overhead: Request ID handling should have negligible performance impact
- Backward Compatibility: Support gradual migration without breaking existing deployments
| Component | Current | Proposed |
|---|---|---|
| Common Infrastructure | request_id_middleware exists |
Add utilities, RPC headers model |
| Manager | HTTP middleware only | Propagate to Agent, Storage-Proxy, App-Proxy |
| Agent | No request_id support | Add RPC headers support |
| Storage-Proxy | HTTP middleware only | Add background task support |
| App-Proxy Coordinator | No request_id support | Add middleware and propagation |
| App-Proxy Worker | No request_id support | Add middleware and propagation |
| Component | Notes |
|---|---|
| Account Manager | Separate service, may follow later |
| Web Server | Frontend proxy, may add middleware |
| WSProxy | WebSocket proxy, pass-through behavior |
External Request Background Task / Event Handler
│ │
▼ ▼
┌───────────┐ ┌─────────────────────────┐
│ HTTP │ │ 1. Use propagated rid │
│ Middleware│ │ (from event/context) │
└─────┬─────┘ │ 2. Auto-generate │
│ │ (fallback only) │
│ └───────────┬─────────────┘
│ │
└────────────────┬─────────────────────┘
▼
┌───────────────────────────────────────┐
│ _request_id_var (ContextVar) │
│ │
│ current_request_id() → str | None │
│ bind_request_id(id) → context manager│
└───────────────────┬───────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Agent │ │ Storage │ │ App-Proxy │
│ (RPC) │ │ Proxy │ │ (HTTP) │
│ │ │ (HTTP) │ │ │
│ headers: { │ │ X-Backend- │ │ X-Backend- │
│ request_id │ │ Request-ID │ │ Request-ID │
│ } │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Detailed specifications are organized into component-specific documents:
| Document | Description |
|---|---|
| manager.md | Manager component: entry points, outbound propagation |
| agent.md | Agent component: RPC headers design |
| storage-proxy.md | Storage-Proxy component |
| app-proxy.md | App-Proxy Coordinator and Worker |
| Component | Current State | Target State | Priority |
|---|---|---|---|
| Common | HTTP middleware only | Add utilities, RPC headers model | High |
| Manager | HTTP middleware only | Propagate to all outbound calls | High |
| Agent | No request_id support | RPC headers support | High |
| Storage-Proxy | HTTP middleware only | Add context binding for background tasks | Low |
| App-Proxy | No request_id support | Add middleware | Medium |
- Add decorator for background task/event handler context binding
- Enhance context utilities for request ID propagation
- Add RPC headers model to Agent (including agent_id for message routing)
- Implement new RPC APIs with built-in header support
- Update Manager to use new APIs with headers
- Update Manager to propagate request_id to App-Proxy (HTTP header)
- Add HTTP middleware for request ID to Coordinator and Worker
- Ensure Worker ↔ Coordinator propagation
- Apply context binding decorator to all background tasks and event handlers
- Add request_id to event system metadata
New RPC APIs are created with built-in header support, separate from existing APIs. This avoids backward compatibility concerns entirely - legacy Agents continue using existing APIs while new Agents use the new APIs with header support.
HTTP services can adopt request ID middleware incrementally - no breaking changes.
Ideas under consideration for future iterations. These are not part of the current proposal.
Instead of plain UUID strings, use a structured type to carry origin metadata:
@dataclass
class RequestIdData:
request_id: str # UUID string
component_source: str # Where generated: "manager", "agent", "storage-proxy"
source_detail: str | None = None # Optional detail: "event_handler", "background_task", "rpc_handler"| component_source | source_detail (examples) | Description |
|---|---|---|
manager |
None |
HTTP API request |
manager |
event_handler |
Event handler processing |
manager |
background_task |
Background task execution |
agent |
rpc_handler |
Agent RPC handler |
storage-proxy |
background_task |
Storage background task |
Benefits:
- Quickly identify request origin when debugging
- Filter logs by component and detail
- Structured data enables better querying
- No string parsing required (unlike prefix approach)
Considerations:
- Requires updating context utilities to handle structured data
- Serialization format for RPC/HTTP headers (JSON or separate headers)