You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I previously submitted PR #130 to enhance the OpenTelemetry tracing implementation in OpenDerisk, which was reviewed and merged by @csunny. However, PR #133 subsequently simplified the opentelemetry.py file and removed most of the enhancements from PR #130.
I'd like to open this discussion to align on the desired direction for OpenTelemetry tracing in OpenDerisk, and propose a plan to incrementally re-introduce production-grade tracing features.
Current State
The current opentelemetry.py (after PR #133) provides basic span creation and export, but lacks several features that are important for production observability:
Feature
Status
Impact
Semantic Conventions (HTTP/GenAI)
❌ Removed
Traces not queryable in Jaeger/Tempo by standard attributes
SpanKind mapping (SERVER/CLIENT/INTERNAL)
❌ Removed
No distinction between HTTP entry, LLM calls, and internal ops
Span Events (errors, token usage)
❌ Removed
No error details or LLM usage tracking in traces
Span Status (OK/ERROR)
❌ Removed
Cannot filter failed spans
Stale span cleanup
❌ Removed
Memory leak risk from orphaned spans
Multi-exporter support (gRPC/HTTP/Console)
❌ Removed
Only gRPC exporter available
Graceful degradation
❌ Removed
Hard ImportError if opentelemetry not installed
Auto-discovery constructor compatibility
❌ Removed
Incompatible with model_scan mechanism
Proposal
I propose re-introducing these features incrementally through smaller, focused PRs, making each change easier to review and discuss:
Phase 1: Foundation (Small PRs)
Graceful degradation — Silently disable when opentelemetry is not installed, instead of crashing
Constructor compatibility — Support both model_scan(system_app, tracer_parameters) and legacy (service_name) signatures
Rich Resource attributes — Add service.namespace, deployment.environment, host.name, etc.
Phase 2: Standards Compliance
SpanKind mapping — SERVER for HTTP entry (Webserver), CLIENT for LLM calls (ModelWorker), INTERNAL for others
Semantic Conventions — Standard OTel attributes for HTTP (http.request.method, http.route) and GenAI (gen_ai.request.model, gen_ai.usage.*)
Stale span cleanup — Daemon thread to prevent memory leaks (configurable TTL)
Multi-exporter support — OTLP/gRPC (default), OTLP/HTTP, Console (dev mode)
Why This Matters
For traces to be useful in tools like Jaeger, Grafana Tempo, or Datadog, they must follow OpenTelemetry Semantic Conventions. Without proper SpanKind, semantic attributes, and error handling, traces are hard to query, filter, and visualize.
This is especially important for OpenDerisk's multi-agent architecture, where understanding the full request lifecycle across HTTP → Agent → LLM → Tool calls is critical for debugging and performance optimization.
Background
I previously submitted PR #130 to enhance the OpenTelemetry tracing implementation in OpenDerisk, which was reviewed and merged by @csunny. However, PR #133 subsequently simplified the
opentelemetry.pyfile and removed most of the enhancements from PR #130.I'd like to open this discussion to align on the desired direction for OpenTelemetry tracing in OpenDerisk, and propose a plan to incrementally re-introduce production-grade tracing features.
Current State
The current
opentelemetry.py(after PR #133) provides basic span creation and export, but lacks several features that are important for production observability:Proposal
I propose re-introducing these features incrementally through smaller, focused PRs, making each change easier to review and discuss:
Phase 1: Foundation (Small PRs)
opentelemetryis not installed, instead of crashingmodel_scan(system_app, tracer_parameters)and legacy(service_name)signaturesservice.namespace,deployment.environment,host.name, etc.Phase 2: Standards Compliance
http.request.method,http.route) and GenAI (gen_ai.request.model,gen_ai.usage.*)Phase 3: Production Hardening
Why This Matters
For traces to be useful in tools like Jaeger, Grafana Tempo, or Datadog, they must follow OpenTelemetry Semantic Conventions. Without proper SpanKind, semantic attributes, and error handling, traces are hard to query, filter, and visualize.
This is especially important for OpenDerisk's multi-agent architecture, where understanding the full request lifecycle across HTTP → Agent → LLM → Tool calls is critical for debugging and performance optimization.
Reference
Questions for Maintainers
I'm happy to discuss and adjust the plan based on the team's preferences. Looking forward to your feedback! 🙏