-
Notifications
You must be signed in to change notification settings - Fork 161
Open
Description
Description
It would be great to have native OpenTelemetry (OTel) support in vllm-mlx for production observability.
Motivation
OpenTelemetry has become the industry standard for distributed tracing, metrics, and logs. Many organizations use OTel-compatible backends (Jaeger, Prometheus, Grafana, Datadog, etc.) for monitoring their ML inference services.
Proposed Features
1. Metrics
- Request latency (P50, P95, P99)
- Tokens per second (input/output)
- Queue length / pending requests
- GPU/memory utilization
- Batch size distribution
- Time-to-first-token (TTFT)
2. Traces
- Request lifecycle (receive → queue → prefill → decode → response)
- Model inference spans
- Token generation steps
- Tool call execution (for MCP)
3. Logs (optional)
- Structured logging with trace correlation
Configuration
Environment variables following OTel conventions:
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_SERVICE_NAME=vllm-mlx
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1
Use Cases
- Production monitoring dashboards (Grafana)
- Debug slow requests with distributed tracing
- Performance optimization with detailed metrics
- SLA/SLO monitoring
- Cost attribution per model/request
References
- OpenTelemetry Python SDK
- vLLM's OTel implementation (upstream reference)
- Langfuse integration (alternative approach)
Additional Context
This would complement existing monitoring approaches and enable seamless integration with modern observability stacks without vendor lock-in.
Happy to contribute or discuss implementation details!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels