Skip to content

Conversation

@goyamegh
Copy link

@goyamegh goyamegh commented Dec 18, 2025

Summary

Add OpenTelemetry (OTEL) instrumentation to the experimental AG-UI chat endpoint for observability and distributed tracing. Traces are exported to AWS OSIS (OpenSearch Ingestion Service) using SigV4 authentication.

Motivation

  • Enable observability for the AG-UI chat endpoint to monitor agent runs and tool executions
  • Track tool call durations and success/failure rates
  • Support correlation across distributed traces using Gen AI semantic conventions
  • Integrate with AWS OpenSearch for trace storage and visualization

Key changes:

  • New experimental/otel module with tracer initialization and attributes
  • Instrument AG-UI /api/agui/chat endpoint with root spans and tool spans
  • Support AWS SigV4 authentication for OSIS endpoints
  • Follow Gen AI semantic conventions for span attributes
  • Add unit tests and integration tests with OpenSearch verification

Environment variables:

  • OTEL_ENABLED: Set to 'true' to enable tracing
  • OTEL_EXPORTER_OTLP_ENDPOINT: OSIS pipeline URL
  • OTEL_AWS_PROFILE: Optional AWS profile for OSIS auth
  • OTEL_AWS_REGION: Optional region (auto-detected from endpoint)

Changes

New Module: experimental/otel/

  • tracing.py: Tracer initialization with AWS SigV4 auth for OSIS endpoints
  • attributes.py: Gen AI semantic convention attribute constants (REQUEST_ID, TOOL_NAME, etc.)
  • __init__.py: Package exports

AG-UI Integration (server-agui.py)

  • Root span (agent.run) created for each chat request with correlation attributes
  • Child spans (tool.execute) for each tool call with duration tracking
  • Error recording on spans when exceptions occur

Tests

  • test_otel.py: Unit tests (no external dependencies required)
  • test_otel_integration.py: Integration tests with real OSIS endpoint + OpenSearch verification

Dependencies

  • Added opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http

Configuration

Environment variables (all optional - tracing disabled by default):

Variable Description
OTEL_ENABLED Set to true to enable tracing
OTEL_EXPORTER_OTLP_ENDPOINT OSIS pipeline URL (e.g., https://xxx.us-east-1.osis.amazonaws.com/path/v1/traces)
OTEL_AWS_PROFILE AWS profile for OSIS authentication (optional)
OTEL_AWS_REGION AWS region (optional, auto-detected from endpoint)
OTEL_SERVICE_NAME Service name in traces (default: holmesgpt)

Testing

Unit Tests

poetry run python experimental/otel/test_otel.py
Output:
============================================================
OTEL Module Verification
============================================================

Testing imports...
  ✅ All imports successful

Testing truncate function...
  ✅ truncate(None) returns ''
  ✅ truncate('short string') returns unchanged
  ✅ truncate(string of 8192 chars) returns unchanged
  ✅ truncate(string of 8292 chars) truncates correctly
  ✅ truncate('') returns ''

Testing region extraction...
  ✅ Extracts 'us-west-2' from valid OSIS endpoint
  ✅ Extracts 'eu-central-1' from valid OSIS endpoint
  ✅ Falls back to 'us-east-1' for non-OSIS endpoint
  ✅ Falls back to 'us-east-1' for malformed URL
  ✅ Falls back to 'us-east-1' for empty string

Testing tracer initialization (disabled)...
  ✅ init_otel_tracer() returns False when OTEL_ENABLED not setget_tracer() returns a no-op tracer when disabled
  ✅ No-op tracer can create and end spans without error

Testing set_span_error...
  ✅ set_span_error() handles exceptions without raising

Testing attribute constants...
  ✅ Gen AI standard attributes defined correctly
  ✅ Tool attributes defined correctly
  ✅ Span names defined correctly

Testing server-agui.py compatibility...
  ✅ All attributes used by server-agui.py are available

============================================================
✅ All 7 tests passed!
============================================================

Integration Tests (with real OSIS endpoint)

  export OTEL_ENABLED=true
  export OTEL_EXPORTER_OTLP_ENDPOINT=https://xxx.us-east-1.osis.amazonaws.com/path/v1/traces
  export OTEL_AWS_PROFILE=your-profile
  export OPENSEARCH_ENDPOINT=https://your-opensearch.on.aws
  export OPENSEARCH_USERNAME=admin
  export OPENSEARCH_PASSWORD=your-password

  poetry run python experimental/otel/test_otel_integration.py
  Output:
  ============================================================
  OTEL Integration Test with OSIS + OpenSearch Verification
  ============================================================

  ✅ Environment configured correctly
  ✅ Tracer initialized successfully
  ✅ All spans created successfully
  ✅ Root span ended
  ✅ Error span test completed
  ✅ Spans flushed successfully

  Test: Verify Traces in OpenSearch
✅ Found 5 traces in index 'otel-v1-apm-span-*'!

Sample trace data:
  - Trace ID: 622865aacd67a42a691020935a1be700
    Span Name: agent.run
    Service: holmesgpt
  - Trace ID: a9eb3f375f5ca2688ceeec5b1500ff94
    Span Name: tool.execute
    Service: holmesgpt

  ============================================================
  ✅ All 4 tests passed!
  ============================================================

  Pre-commit Checks

  All checks pass:
  poetry-check.............................................................Passed
  ruff.....................................................................Passed
  ruff-format..............................................................Passed
  detect private key.......................................................Passed
  fix end of files.........................................................Passed
  mypy.....................................................................Passed

Notes

  • Tracing is disabled by default - no impact unless OTEL_ENABLED=true
  • Supports separate AWS profiles for OSIS vs other services (e.g., Bedrock)
  • Thread safety warning added for env var manipulation during startup
  • No hardcoded credentials - all sensitive config via environment variables

Summary by CodeRabbit

  • New Features

    • Added OpenTelemetry distributed tracing for AI agent executions, enabling monitoring of agent runs and tool invocations with correlation tracking.
    • Captures tool execution metrics and error details for enhanced observability.
  • Tests

    • Added comprehensive test coverage for tracing functionality and OpenSearch integration.
  • Chores

    • Added OpenTelemetry dependencies.

✏️ Tip: You can customize this high-level summary in your review settings.

Add OTEL instrumentation to the experimental AG-UI chat endpoint for
observability and distributed tracing. Traces are exported to AWS OSIS
(OpenSearch Ingestion Service) using SigV4 authentication.

Signed-off-by: Megha Goyal <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Dec 18, 2025

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

Walkthrough

Introduces OpenTelemetry tracing infrastructure into the experimental AG-UI server. Adds tracer initialization with optional AWS SigV4 authentication for OSIS endpoints, semantic attribute constants, and integrates tracing into the server's streaming event generator to capture agent runs, tool executions, and error states.

Changes

Cohort / File(s) Change Summary
OTEL Core Infrastructure
experimental/otel/tracing.py, experimental/otel/attributes.py, experimental/otel/__init__.py
Implements OpenTelemetry tracer management with OTLP HTTP exporter and AWS SigV4 signing support. Defines semantic attribute constants (correlation IDs, token counts, tool metadata, span names). Provides public API exports for tracer lifecycle and attribute utilities.
AG-UI Server Integration
experimental/ag-ui/server-agui.py
Integrates OTEL tracing into streaming event generator. Creates root span per chat run with correlation attributes. Records child spans for tool invocations with durations and metadata. Tracks tool call counts and handles successful completion and error states.
Test Suites
experimental/otel/test_otel.py, experimental/otel/test_otel_integration.py
Unit tests validating imports, truncation logic, tracer initialization, and attribute constants. Integration tests verify span creation, error handling, OpenSearch verification, and tracer flushing.
Dependencies
pyproject.toml
Adds three OpenTelemetry packages: opentelemetry-api, opentelemetry-sdk, and opentelemetry-exporter-otlp-proto-http (^1.20.0).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Server as AG-UI Server
    participant Tracer as OTEL Tracer
    participant Exporter as OTLP/OSIS Backend
    
    Client->>Server: Chat request
    activate Server
    Server->>Tracer: init_otel_tracer()
    activate Tracer
    Tracer->>Exporter: Configure OTLP HTTP exporter
    Tracer-->>Server: Tracer ready
    deactivate Tracer
    
    Server->>Tracer: start_root_span(agent_run)
    activate Tracer
    note over Tracer: Set REQUEST_ID, CONVERSATION_ID<br/>AGENT_TYPE, MODEL attributes
    
    loop For each tool invocation
        Server->>Tracer: start_child_span(tool_execute)
        note over Tracer: Set TOOL_NAME, TOOL_CALL_ID<br/>Record TOOL_DURATION_MS<br/>Record TOOL_OUTPUT
        Server->>Tracer: end_span()
        note over Tracer: Increment tool_call_count
    end
    
    alt Success
        Server->>Tracer: mark_root_span_success()
        note over Tracer: Set RESULT_SUCCESS = true
    else Error
        Server->>Tracer: set_span_error(exception)
        note over Tracer: Record ERROR_MESSAGE<br/>ERROR_TYPE
    end
    
    Server->>Tracer: end_root_span()
    deactivate Tracer
    Tracer->>Exporter: flush_spans()
    note over Exporter: Spans exported to backend
    
    Server-->>Client: Chat response
    deactivate Server
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • AWS SigV4 authentication logic in tracing.py — Session creation, credential handling, and fallback to unauthenticated export
  • Span lifecycle management in server-agui.py — Proper handling of root and child spans, error propagation, and finally-block cleanup in async streaming context
  • Attribute truncation and size limits — Validation that truncate() function properly handles edge cases and prevents OTEL payload errors
  • Integration test coverage — Verify that test_otel_integration.py correctly validates real traces in OpenSearch and handles retries appropriately

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding OpenTelemetry tracing to the AG-UI endpoint, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 87.88% which is sufficient. The required threshold is 80.00%.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (9)
experimental/otel/tracing.py (2)

158-159: Debug logging may expose sensitive headers.

The Authorization header (containing AWS credentials signature) will be logged. While debug level, this could leak to log files in production if log levels are misconfigured.

🔎 Consider masking sensitive headers:
-        # Debug: show signed headers
-        logging.debug(f"[OTEL SIGN] Headers after signing: {dict(aws_request.headers)}")
+        # Debug: show signed headers (mask Authorization)
+        debug_headers = {k: v if k.lower() != 'authorization' else '[MASKED]' 
+                        for k, v in aws_request.headers.items()}
+        logging.debug(f"[OTEL SIGN] Headers after signing: {debug_headers}")

351-360: Use attribute constants for consistency.

set_span_error uses hardcoded "error.type" and "error.message" while attributes.py defines ERROR_TYPE and ERROR_MESSAGE constants. Using the constants ensures consistency.

🔎 Apply this diff:
+from experimental.otel.attributes import ERROR_TYPE, ERROR_MESSAGE
+
 def set_span_error(span: trace.Span, error: Exception) -> None:
     """Set error status and attributes on a span.

     Args:
         span: The span to set error on
         error: The exception that occurred
     """
     span.set_status(Status(StatusCode.ERROR, str(error)))
-    span.set_attribute("error.type", type(error).__name__)
-    span.set_attribute("error.message", str(error))
+    span.set_attribute(ERROR_TYPE, type(error).__name__)
+    span.set_attribute(ERROR_MESSAGE, str(error))
experimental/otel/attributes.py (1)

55-72: Truncation result exceeds max_size.

The function truncates to max_size then appends "...[TRUNCATED]" (14 chars), so the result is max_size + 14 bytes. If the goal is to enforce a strict size limit, account for the marker length.

🔎 Strict size enforcement:
+TRUNCATION_MARKER = "...[TRUNCATED]"
+
 def truncate(value: Optional[str], max_size: int = MAX_ATTRIBUTE_SIZE) -> str:
     if value is None:
         return ""
     if len(value) <= max_size:
         return value
-    return value[:max_size] + "...[TRUNCATED]"
+    return value[:max_size - len(TRUNCATION_MARKER)] + TRUNCATION_MARKER
experimental/ag-ui/server-agui.py (1)

21-29: Consider proper package structure over sys.path manipulation.

The sys.path.insert is fragile and can cause import issues. Since this is experimental code, it's acceptable, but consider adding the experimental directory as a proper package or using relative imports when stabilizing.

experimental/otel/test_otel.py (2)

1-9: Consider converting to pytest format.

The script works for quick verification but integrating with pytest would provide better CI/CD integration, fixtures for setup/teardown, and consistent test discovery.


113-117: Direct module state manipulation is fragile.

Modifying tracing._initialized and _tracer_provider directly works but is brittle. When converting to pytest, consider using importlib.reload() or providing a proper reset() function in the tracing module for testing.

experimental/otel/__init__.py (1)

1-51: Clean package API surface design.

This module correctly aggregates and re-exports the public OTEL instrumentation API. The explicit __all__ makes the public interface clear.

Optionally, consider sorting __all__ alphabetically for easier maintenance as the list grows (per static analysis hint RUF022).

experimental/otel/test_otel_integration.py (2)

111-116: Fragile coupling to private module state.

Directly resetting tracing._initialized and tracing._tracer_provider couples this test to internal implementation details. If the module internals change (e.g., renamed variables, different state management), this test will break silently or require updates.

Consider exposing a reset_tracer() test utility in the tracing module if this pattern is needed elsewhere, or document this coupling with a comment.


255-261: Direct access to private _tracer_provider.

Similar to test_tracer_initialization, this accesses private module state. Consider whether shutdown_otel_tracer() (which is publicly exported) could be used, or expose a force_flush() wrapper in the public API.

The 10-second timeout for force_flush is reasonable for integration tests.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f1d3a3 and 28dd21c.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • experimental/ag-ui/server-agui.py (8 hunks)
  • experimental/otel/__init__.py (1 hunks)
  • experimental/otel/attributes.py (1 hunks)
  • experimental/otel/test_otel.py (1 hunks)
  • experimental/otel/test_otel_integration.py (1 hunks)
  • experimental/otel/tracing.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
experimental/otel/test_otel.py (3)
experimental/otel/tracing.py (1)
  • _extract_region_from_endpoint (71-90)
experimental/otel/attributes.py (1)
  • truncate (55-72)
holmes/core/tracing.py (2)
  • start_span (104-105)
  • end (110-111)
experimental/otel/test_otel_integration.py (2)
experimental/otel/tracing.py (3)
  • init_otel_tracer (241-323)
  • get_tracer (326-340)
  • set_span_error (351-360)
experimental/otel/attributes.py (1)
  • truncate (55-72)
experimental/ag-ui/server-agui.py (3)
experimental/otel/tracing.py (3)
  • init_otel_tracer (241-323)
  • get_tracer (326-340)
  • set_span_error (351-360)
holmes/core/tracing.py (2)
  • start_span (104-105)
  • end (110-111)
experimental/otel/attributes.py (1)
  • truncate (55-72)
experimental/otel/__init__.py (2)
experimental/otel/tracing.py (4)
  • init_otel_tracer (241-323)
  • get_tracer (326-340)
  • shutdown_otel_tracer (343-348)
  • set_span_error (351-360)
experimental/otel/attributes.py (1)
  • truncate (55-72)
experimental/otel/tracing.py (1)
holmes/version.py (1)
  • get_version (48-130)
🪛 Ruff (0.14.8)
experimental/otel/test_otel.py

225-225: Do not catch blind exception: Exception

(BLE001)

experimental/otel/test_otel_integration.py

235-235: Abstract raise to an inner function

(TRY301)


235-235: Avoid specifying long messages outside the exception class

(TRY003)


237-237: Do not catch blind exception: Exception

(BLE001)


264-264: Consider moving this statement to an else block

(TRY300)


265-265: Do not catch blind exception: Exception

(BLE001)


453-453: Do not catch blind exception: Exception

(BLE001)


485-485: Do not catch blind exception: Exception

(BLE001)

experimental/ag-ui/server-agui.py

294-294: Use explicit conversion flag

Replace with conversion flag

(RUF010)

experimental/otel/__init__.py

29-51: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

experimental/otel/tracing.py

88-89: try-except-pass detected, consider logging the exception

(S110)


88-88: Do not catch blind exception: Exception

(BLE001)


106-106: Avoid specifying long messages outside the exception class

(TRY003)


234-234: Consider moving this statement to an else block

(TRY300)


318-318: Consider moving this statement to an else block

(TRY300)


369-369: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (16)
experimental/otel/tracing.py (2)

181-213: Thread-safety concern is well-documented.

The environment variable manipulation for profile isolation is a known limitation. The documentation is clear about startup-only usage. Consider adding an assertion or runtime check to prevent accidental concurrent calls if this becomes a concern.


264-272: Initialization semantics are correct.

Setting _initialized = True even when disabled or missing endpoint is intentional — it prevents repeated initialization attempts on subsequent get_tracer() calls. The early returns with False correctly indicate tracing won't be active.

experimental/otel/attributes.py (1)

8-48: Well-structured attribute constants.

The constants follow Gen AI semantic conventions and are clearly organized by category. Good documentation linking to AG-UI field mappings.

experimental/ag-ui/server-agui.py (4)

85-91: LGTM - Tracer initialization at startup.

Module-level initialization ensures tracing is configured before handling requests. The get_tracer() call safely returns a no-op tracer when disabled.


142-153: LGTM - Root span with correlation attributes.

Creating the span inside the generator ensures it covers the streaming lifecycle. The correlation attributes (REQUEST_ID, CONVERSATION_ID) properly link traces to AG-UI request/thread identifiers.


208-229: LGTM - Tool execution spans with proper parent-child linking.

The implementation correctly:

  • Links child spans to the root using trace.set_span_in_context
  • Calculates duration from tracked start times
  • Truncates potentially large tool outputs
  • Handles missing start times gracefully with pop(..., None)

288-298: LGTM - Robust error handling and span lifecycle.

The finally block ensures the root span is always ended, preventing resource leaks. Error details are properly recorded on the span before the error event is yielded.

experimental/otel/test_otel.py (2)

40-71: Good edge case coverage for truncate.

Tests cover all key scenarios: None input, strings within limit, at exact limit, over limit, and empty strings. The assertions include helpful failure messages.


220-228: Test runner correctly aggregates failures.

The broad Exception catch is intentional here to ensure all tests run even if one fails. Consider adding traceback.print_exc() for better debugging of failures.

experimental/otel/test_otel_integration.py (7)

1-28: Good documentation for integration test usage.

The docstring clearly documents all required and optional environment variables, usage instructions, and the purpose of the test. This helps developers run the test correctly.


69-73: Good practice: Masking sensitive credentials in output.

The password is correctly masked with "***" to prevent accidental exposure in logs/output.


180-194: Correct span hierarchy and context propagation.

Child tool spans are properly created with trace.set_span_in_context(root_span) ensuring correct parent-child relationships in the trace. Each tool span is correctly ended within the loop.


230-243: Intentional error simulation for testing.

The broad Exception catch and intentional ValueError raise are appropriate here—this is a test exercising the error-recording path. The finally block correctly ensures the span is ended regardless of outcome.


344-349: Good retry pattern with exponential backoff potential.

The retry loop with configurable max_retries and retry_delay is appropriate for handling indexing delays. All HTTP requests correctly use timeout=10 to prevent hangs.


489-500: Good conditional verification with credential check.

The OpenSearch verification is correctly gated behind a check for all three required variables (OPENSEARCH_ENDPOINT, OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD), and failure doesn't break the overall test result—appropriate for optional verification that may have indexing delays.


273-276: Credentials sourced from environment variables.

OpenSearch credentials are correctly read from environment variables rather than hardcoded. Empty string defaults are appropriate since the verification is optional.

Comment on lines +308 to +323
search_body = {
"size": 10,
"query": {
"bool": {
"should": [
{"match": {"resource.attributes.gen_ai@request@id": test_id}},
{"match": {"attributes.gen_ai@request@id": test_id}},
{"match": {"gen_ai.request.id": test_id}},
{"wildcard": {"traceId": "*"}},
],
"minimum_should_match": 1,
"filter": [{"range": {"startTime": {"gte": "now-5m"}}}],
}
},
"sort": [{"startTime": {"order": "desc"}}],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Overly broad wildcard query may return unrelated traces.

The {"wildcard": {"traceId": "*"}} clause matches any document with a traceId. Combined with minimum_should_match: 1, this query could return traces unrelated to the test, relying only on the 5-minute time filter for relevance.

Consider removing the wildcard clause or making it more specific—the other match clauses should be sufficient to find the test traces by test_id.

🔎 Suggested fix
     search_body = {
         "size": 10,
         "query": {
             "bool": {
                 "should": [
                     {"match": {"resource.attributes.gen_ai@request@id": test_id}},
                     {"match": {"attributes.gen_ai@request@id": test_id}},
                     {"match": {"gen_ai.request.id": test_id}},
-                    {"wildcard": {"traceId": "*"}},
                 ],
                 "minimum_should_match": 1,
                 "filter": [{"range": {"startTime": {"gte": "now-5m"}}}],
             }
         },
         "sort": [{"startTime": {"order": "desc"}}],
     }
🤖 Prompt for AI Agents
In experimental/otel/test_otel_integration.py around lines 308 to 323, the
search_body uses a very broad {"wildcard": {"traceId": "*"}} combined with
minimum_should_match: 1 which can return unrelated traces; remove the wildcard
clause (or replace it with a specific pattern tied to test_id if traceId pattern
is known) so the query relies on the explicit match clauses for
resource/attributes/gen_ai request id and the time filter, ensuring results are
scoped to the test.

Comment on lines +58 to +60
opentelemetry-api = "^1.20.0"
opentelemetry-sdk = "^1.20.0"
opentelemetry-exporter-otlp-proto-http = "^1.20.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Upgrade OpenTelemetry dependencies to a recent version.

The three packages are correctly selected for OTLP HTTP export and the ^1.20.0 constraint format is appropriate. However, current stable releases are 1.39.1, which is 19 minor versions ahead. Since security fixes are only applied to the latest minor version, pinning to 1.20.0 prevents receiving security updates. Update to a recent version like ^1.39.0 to ensure continued maintenance and security patch availability.

🤖 Prompt for AI Agents
In pyproject.toml around lines 58 to 60, the OpenTelemetry package versions are
pinned to ^1.20.0 which is outdated; update opentelemetry-api,
opentelemetry-sdk, and opentelemetry-exporter-otlp-proto-http to a recent
maintained minor version (e.g., ^1.39.0 or the current latest) so the project
receives security and maintenance fixes; change the version specifiers
accordingly and run dependency resolution (poetry lock/install or equivalent) to
ensure compatibility and update any lockfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants