agentevals evaluates AI agents by consuming their OpenTelemetry traces. Any agent that emits OTel spans can be evaluated.
This guide covers the instrumentation patterns agentevals supports, with a recommendation for new projects. Each example in this directory is a working agent you can run and modify.
The simplest way to connect any agent to agentevals. Point your standard OTel OTLP exporter at the agentevals receiver and you're done. No agentevals dependency needed in your agent code.
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_RESOURCE_ATTRIBUTES="agentevals.session_name=my-agent,agentevals.eval_set_id=my-eval"
python your_agent.pyFor OTLP/gRPC exporters, use:
export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpcagentevals accepts OTLP/HTTP on port 4318 (http/protobuf and http/json) and OTLP/gRPC on port 4317. Sessions are auto-created from incoming traces and grouped by agentevals.session_name.
| Example | Framework | LLM Provider |
|---|---|---|
| zero-code-examples/langchain/ | LangChain | OpenAI |
| zero-code-examples/ollama/ | LangChain | Ollama |
| zero-code-examples/strands/ | Strands | OpenAI |
| zero-code-examples/adk/ | Google ADK | Gemini |
| zero-code-examples/pydantic-ai/ | Pydantic AI | OpenAI |
This approach works with any framework that has OTel instrumentation: LangChain, Strands, Google ADK, etc. If your framework already emits OTel spans, you only need to add OTLPSpanExporter (and OTLPLogExporter if it uses GenAI log-based content delivery).
| Attribute | Required | Description |
|---|---|---|
agentevals.session_name |
no | Groups spans into a named session. Without it, sessions are named otlp-<traceId prefix>. |
agentevals.eval_set_id |
no | Associates the session with an eval set for scoring. |
Set them via OTEL_RESOURCE_ATTRIBUTES (env var) or Resource.create() in code.
For tighter control over session lifecycle, or if you prefer a Python API over environment variables, the AgentEvals SDK wraps all OTel boilerplate into a context manager:
from agentevals import AgentEvals
app = AgentEvals()
with app.session(eval_set_id="my-eval"):
result = my_agent.invoke("Hello!")Works with LangChain, Strands, Google ADK, and any OTel-instrumented agent. For frameworks that create their own TracerProvider (like Strands), pass it explicitly:
telemetry = StrandsTelemetry()
with app.session(eval_set_id="strands-eval", tracer_provider=telemetry.tracer_provider):
agent("Roll a die")For simple prompt-to-response agents, there's also a decorator shorthand:
app = AgentEvals(eval_set_id="my-eval")
@app.agent
def my_agent(prompt):
return llm.invoke(prompt).content
app.run(["Hello!", "Tell me a joke"])To skip streaming when the dev server isn't running, set streaming=False:
app = AgentEvals(streaming=os.getenv("AGENTEVALS_STREAM", "1") == "1")When disabled, session() and session_async() become no-ops and your agent runs normally without any WebSocket connection or OTel setup.
Requires the [streaming] extra: pip install "agentevals[streaming]". See sdk_example/ for complete working examples.
Trace format is auto-detected. Agents don't need to declare which format they use.
-
OTel GenAI Semantic Conventions (recommended for new agents). Standard
gen_ai.*span attributes defined by the OpenTelemetry GenAI working group. Framework-agnostic and interoperable. Works with LangChain, Strands, and any framework that supports the conventions. -
Framework-Native OTel Tracing. Some frameworks (like Google ADK) emit their own proprietary span attributes. agentevals has dedicated converters for these formats.
Detection checks for gen_ai.request.model / gen_ai.input.messages (GenAI semconv) or otel.scope.name == "gcp.vertex.agent" (ADK).
| Example | Framework | LLM Provider | Instrumentation | Content Delivery |
|---|---|---|---|---|
| zero-code-examples/langchain/ | LangChain | OpenAI | GenAI semconv (logs) | Standard OTLP export |
| zero-code-examples/ollama/ | LangChain | Ollama | GenAI semconv (logs) | Standard OTLP export |
| zero-code-examples/strands/ | Strands | OpenAI | GenAI semconv (events*) | Standard OTLP export |
| zero-code-examples/adk/ | Google ADK | Gemini | ADK built-in | Standard OTLP export |
| zero-code-examples/pydantic-ai/ | Pydantic AI | OpenAI | GenAI semconv (span attrs) | Standard OTLP export |
| langchain_agent | LangChain | OpenAI | GenAI semconv (logs) | SDK WebSocket |
| strands_agent | Strands | OpenAI | GenAI semconv (events*) | SDK WebSocket |
| dice_agent | Google ADK | Gemini | ADK built-in | SDK WebSocket |
*Span events are being deprecated in favor of log-based events. agentevals supports both. See docs/otel-compatibility.md for details.
The zero-code and SDK examples implement the same toy agent (dice rolling + prime checking) so you can compare the two approaches directly.
| Example | Description |
|---|---|
| kubernetes/ | Deploy agentevals with kagent on Kubernetes using native OTLP gRPC ingestion (or optionally an OTel Collector). Includes a walkthrough for comparing two kagent agents (different models) and evaluating them with tool trajectory and response match scores. |
Tip
The sections below apply to the SDK WebSocket examples (langchain_agent, strands_agent, dice_agent).
For the zero-code OTLP examples, none of this manual wiring is needed.
The OTel GenAI semantic conventions define what data is captured (gen_ai.request.model, gen_ai.input.messages, gen_ai.output.messages, token counts, etc.) but allow flexibility in how message content is delivered. agentevals supports both approaches:
Logs-Based Content (langchain_agent)
Used by auto-instrumentation libraries like opentelemetry-instrumentation-openai-v2. Spans carry metadata (model, tokens, finish reasons), while message content is emitted as separate OTel Log Records.
This pattern requires both a TracerProvider and a LoggerProvider, with matching processors:
os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
tracer_provider = TracerProvider()
logger_provider = LoggerProvider()
processor = AgentEvalsStreamingProcessor(ws_url=..., session_id=..., trace_id=...)
tracer_provider.add_span_processor(processor)
log_processor = AgentEvalsLogStreamingProcessor(processor) # shares WebSocket connection
logger_provider.add_log_record_processor(log_processor)
OpenAIInstrumentor().instrument() # auto-instruments the OpenAI SDKWithout OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true, only metadata is captured and no conversation text will appear.
See langchain_agent/README.md for the full walkthrough.
Events-Based Content (strands_agent)
Note
The OTel community is deprecating span events in favor of log-based events emitted via the Logs API. Frameworks currently using span events (like Strands) are expected to migrate to log-based events in future versions. agentevals supports both patterns and will continue to handle span events for backward compatibility.
Used by frameworks that emit message content as span events rather than separate log records. The AgentEvalsStreamingProcessor automatically promotes gen_ai.input.messages and gen_ai.output.messages from event attributes to span attributes, so downstream processing sees a uniform shape.
This pattern needs only a TracerProvider, no LoggerProvider or log processor:
os.environ["OTEL_SEMCONV_STABILITY_OPT_IN"] = "gen_ai_latest_experimental"
telemetry = StrandsTelemetry() # creates TracerProvider internally
processor = AgentEvalsStreamingProcessor(ws_url=..., session_id=..., trace_id=...)
telemetry.tracer_provider.add_span_processor(processor)- For new instrumentation, prefer the logs-based pattern. The OTel community recommends emitting events as log records rather than span events going forward.
- Check your framework/library docs first. They will tell you whether message content is emitted as logs or span events.
- If your instrumentation library requires a
LoggerProvider(likeopentelemetry-instrumentation-openai-v2), use the logs-based pattern. - If your framework currently emits GenAI span events (like Strands with
StrandsTelemetry), the events-based pattern works today. When the framework migrates to log-based events, switch to the logs-based pattern. - If you're using Google ADK, skip GenAI semconv entirely. See the next section.
For a detailed overview of OTel compatibility and the ongoing migration, see docs/otel-compatibility.md.
Google ADK instruments agents automatically under the gcp.vertex.agent OTel scope. It emits proprietary attributes (gcp.vertex.agent.llm_request, gcp.vertex.agent.llm_response, etc.) directly on spans. agentevals has a dedicated converter for this format.
No GenAI semconv environment variables or log providers are needed:
provider = TracerProvider()
trace.set_tracer_provider(provider)
processor = AgentEvalsStreamingProcessor(ws_url=..., session_id=..., trace_id=...)
provider.add_span_processor(processor)
# ADK agents automatically emit spans through the global TracerProviderSee dice_agent/README.md for a complete example.
agentevals serve --devcd ui && npm run dev
# Open http://localhost:5173, select "I am developing an agent"# Zero-code OTLP (recommended):
python examples/zero-code-examples/langchain/run.py
python examples/zero-code-examples/ollama/run.py
python examples/zero-code-examples/strands/run.py
python examples/zero-code-examples/adk/run.py
python examples/zero-code-examples/pydantic-ai/run.py
# SDK examples:
python examples/sdk_example/context_manager_example.py
python examples/sdk_example/decorator_example.py
python examples/sdk_example/async_example.py
# Manual OTel setup examples:
python examples/dice_agent/main.py
python examples/langchain_agent/main.py
python examples/strands_agent/main.pyTraces stream to the dev server in real-time. Evaluation runs automatically when the session completes.
See each example's README for prerequisites and detailed instructions:
- zero-code-examples/ (LangChain, Strands, ADK, OpenAI Agents, Pydantic AI — standard OTLP)
- dice_agent/README.md (Google ADK + Gemini)
- langchain_agent/README.md (LangChain + OpenAI, SDK)
- strands_agent/ (Strands + OpenAI, SDK)
For details on the WebSocket streaming protocol, see docs/streaming.md.