[Discussion] OpenTelemetry Tracing Enhancement - Semantic Conventions, SpanKind, and Production-Grade Features

### Background

I previously submitted PR #130 to enhance the OpenTelemetry tracing implementation in OpenDerisk, which was reviewed and merged by @csunny. However, PR #133 subsequently simplified the `opentelemetry.py` file and removed most of the enhancements from PR #130.

I'd like to open this discussion to align on the desired direction for OpenTelemetry tracing in OpenDerisk, and propose a plan to incrementally re-introduce production-grade tracing features.

### Current State

The current `opentelemetry.py` (after PR #133) provides basic span creation and export, but lacks several features that are important for production observability:

| Feature | Status | Impact |
|---------|--------|--------|
| Semantic Conventions (HTTP/GenAI) | ❌ Removed | Traces not queryable in Jaeger/Tempo by standard attributes |
| SpanKind mapping (SERVER/CLIENT/INTERNAL) | ❌ Removed | No distinction between HTTP entry, LLM calls, and internal ops |
| Span Events (errors, token usage) | ❌ Removed | No error details or LLM usage tracking in traces |
| Span Status (OK/ERROR) | ❌ Removed | Cannot filter failed spans |
| Stale span cleanup | ❌ Removed | Memory leak risk from orphaned spans |
| Multi-exporter support (gRPC/HTTP/Console) | ❌ Removed | Only gRPC exporter available |
| Graceful degradation | ❌ Removed | Hard ImportError if opentelemetry not installed |
| Auto-discovery constructor compatibility | ❌ Removed | Incompatible with model_scan mechanism |

### Proposal

I propose re-introducing these features **incrementally** through smaller, focused PRs, making each change easier to review and discuss:

#### Phase 1: Foundation (Small PRs)
1. **Graceful degradation** — Silently disable when `opentelemetry` is not installed, instead of crashing
2. **Constructor compatibility** — Support both `model_scan(system_app, tracer_parameters)` and legacy `(service_name)` signatures
3. **Rich Resource attributes** — Add `service.namespace`, `deployment.environment`, `host.name`, etc.

#### Phase 2: Standards Compliance
4. **SpanKind mapping** — SERVER for HTTP entry (Webserver), CLIENT for LLM calls (ModelWorker), INTERNAL for others
5. **Semantic Conventions** — Standard OTel attributes for HTTP (`http.request.method`, `http.route`) and GenAI (`gen_ai.request.model`, `gen_ai.usage.*`)
6. **Span Status & Events** — OK/ERROR status, exception events, token usage events

#### Phase 3: Production Hardening
7. **Stale span cleanup** — Daemon thread to prevent memory leaks (configurable TTL)
8. **Multi-exporter support** — OTLP/gRPC (default), OTLP/HTTP, Console (dev mode)

### Why This Matters

For traces to be useful in tools like **Jaeger**, **Grafana Tempo**, or **Datadog**, they must follow [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/). Without proper SpanKind, semantic attributes, and error handling, traces are hard to query, filter, and visualize.

This is especially important for OpenDerisk's multi-agent architecture, where understanding the full request lifecycle across **HTTP → Agent → LLM → Tool** calls is critical for debugging and performance optimization.

### Reference

- **PR #130** (merged, then reverted): [#130](https://github.com/derisk-ai/OpenDerisk/pull/130) — Full implementation with all features
- **PR #133** (simplified): [#133](https://github.com/derisk-ai/OpenDerisk/pull/133) — Removed most PR #130 enhancements
- **PR #128** (merged): Prometheus metrics — Complements tracing for complete observability
- **OTel HTTP Semantic Conventions**: https://opentelemetry.io/docs/specs/semconv/http/
- **OTel GenAI Semantic Conventions**: https://opentelemetry.io/docs/specs/semconv/gen-ai/

### Questions for Maintainers

1. Is the incremental approach acceptable, or do you prefer a single comprehensive PR?
2. Are there specific features from PR #130 that you'd like to keep simplified?
3. Should advanced features (multi-exporter, stale cleanup) be behind a configuration flag?

I'm happy to discuss and adjust the plan based on the team's preferences. Looking forward to your feedback! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] OpenTelemetry Tracing Enhancement - Semantic Conventions, SpanKind, and Production-Grade Features #172

Background

Current State

Proposal

Phase 1: Foundation (Small PRs)

Phase 2: Standards Compliance

Phase 3: Production Hardening

Why This Matters

Reference

Questions for Maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Status	Impact
Semantic Conventions (HTTP/GenAI)	❌ Removed	Traces not queryable in Jaeger/Tempo by standard attributes
SpanKind mapping (SERVER/CLIENT/INTERNAL)	❌ Removed	No distinction between HTTP entry, LLM calls, and internal ops
Span Events (errors, token usage)	❌ Removed	No error details or LLM usage tracking in traces
Span Status (OK/ERROR)	❌ Removed	Cannot filter failed spans
Stale span cleanup	❌ Removed	Memory leak risk from orphaned spans
Multi-exporter support (gRPC/HTTP/Console)	❌ Removed	Only gRPC exporter available
Graceful degradation	❌ Removed	Hard ImportError if opentelemetry not installed
Auto-discovery constructor compatibility	❌ Removed	Incompatible with model_scan mechanism

[Discussion] OpenTelemetry Tracing Enhancement - Semantic Conventions, SpanKind, and Production-Grade Features #172

Description

Background

Current State

Proposal

Phase 1: Foundation (Small PRs)

Phase 2: Standards Compliance

Phase 3: Production Hardening

Why This Matters

Reference

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions