Skip to content

Latest commit

Β 

History

History
673 lines (453 loc) Β· 29.2 KB

File metadata and controls

673 lines (453 loc) Β· 29.2 KB

Guide: OTEL Observability β€” .NET Aspire Dashboard

← Back to README | Prerequisites & Setup | Architecture Overview | AI Toolkit variant β†’

This step-by-step guide walks you through adding OpenTelemetry (OTEL) distributed tracing to the Children's Story Studio backend and viewing traces in the .NET Aspire Dashboard β€” a Docker-based trace viewer. By the end, you'll be able to see the full lifecycle of every story generation request β€” across all five agents β€” in a visual trace waterfall.

Prefer staying in VS Code? The AI Toolkit variant of this guide shows the same OTEL instrumentation but uses the AI Toolkit for VS Code to view traces directly in your editor β€” no Docker required.

Note: Unlike the Activity Page Agents and TTS guides, this guide does not use GitHub Copilot to generate the implementation. Every code change is provided directly β€” you'll copy the code, understand what it does, and wire it in yourself.


Table of Contents


What You'll Build

After completing this guide, every story generation request will emit distributed traces that flow through the entire multi-agent workflow:

  POST /api/generate-story          ← FastAPI auto-instrumented span
    β”‚
    β”œβ”€β”€ Workflow: run                ← Agent Framework workflow span
    β”‚   β”œβ”€β”€ Executor: orchestrator   ← Agent Framework executor spans
    β”‚   β”‚   └── LLM: chat/completions
    β”‚   β”œβ”€β”€ Executor: story_architect
    β”‚   β”‚   └── LLM: chat/completions
    β”‚   β”œβ”€β”€ Executor: art_director
    β”‚   β”‚   β”œβ”€β”€ LLM: chat/completions
    β”‚   β”‚   β”œβ”€β”€ Image: generate (page 1)   ← Parallel image generation
    β”‚   β”‚   β”œβ”€β”€ Image: generate (page 2)
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ Executor: story_reviewer
    β”‚   β”‚   └── LLM: chat/completions
    β”‚   └── Executor: decision
    β”‚
    └── Response streamed via SSE

These traces are exported via the OpenTelemetry Protocol (OTLP) to the .NET Aspire Dashboard β€” a lightweight, local trace viewer that runs in Docker. No Azure resources are required.


Why This Matters

Concept What It Shows
Distributed Tracing for Agents How OTEL traces flow through a multi-agent workflow, giving visibility into every executor invocation
Per-Agent Latency Analysis Which agents are the bottleneck β€” e.g., image generation in ArtDirector vs. LLM calls in StoryArchitect
Debugging Revision Loops When the StoryReviewer rejects a draft, the revision loop back to Orchestrator is clearly visible as repeated spans
LLM Call Correlation Each LLM call is nested under its parent executor, showing token usage and response times in context
Production Observability Patterns The same OTEL setup works with any OTLP-compatible backend β€” Aspire today, Application Insights or Jaeger tomorrow
Zero-Code Auto-Instrumentation Agent Framework and FastAPI both emit spans automatically when a TracerProvider is configured β€” no manual span creation needed

NOTE: If you do not want to implement OTEL observability in the app manually, feel free to experiment with prompting GitHub Copilot to follow the documentation in this file to perform the implementation.

Before You Start

1. Complete Base Prerequisites

Ensure you've completed all steps in Prerequisites & Setup and that the base application is working (Running the Demo).

2. Install Docker Desktop

The Aspire Dashboard runs as a Docker container. If you don't already have Docker Desktop installed:

Verify Docker is running:

docker --version

3. Create a Working Branch

git checkout main
git pull origin main
git checkout -b my-otel-observability

Step 1: Start the Aspire Dashboard

The .NET Aspire Dashboard is a lightweight OpenTelemetry trace viewer that runs locally in Docker. It accepts OTLP data over gRPC and provides a browser-based UI for exploring traces, logs, and metrics.

Run the following command to start it:

docker run --rm -it -d \
  -p 18888:18888 \
  -p 4317:18889 \
  --name aspire-dashboard \
  mcr.microsoft.com/dotnet/aspire-dashboard:latest

Port mapping:

Host Port Container Port Purpose
18888 18888 Aspire Dashboard web UI
4317 18889 OTLP gRPC receiver (where the backend sends traces)

Open the dashboard in your browser:

http://localhost:18888

The dashboard requires a login token on first access. Retrieve it from the container logs:

docker logs aspire-dashboard

Look for a line like:

Login to the dashboard at http://localhost:18888/login?t=<YOUR_TOKEN>

Copy that full URL into your browser to authenticate.

Tip: The Aspire Dashboard will show "No resources found" until the backend starts sending traces. That's expected β€” we'll configure the backend in the next steps.


Step 2: Add OTEL Python Packages

Good news: agent-framework-core already bundles opentelemetry-api, opentelemetry-sdk, and opentelemetry-semantic-conventions-ai as transitive dependencies. You only need to add the OTLP exporter (to send data to Aspire) and the FastAPI instrumentor (to auto-trace HTTP requests).

Open backend/requirements.txt and append the following lines at the end of the file:

# OpenTelemetry (api + sdk are already included via agent-framework-core)
opentelemetry-exporter-otlp-proto-grpc>=1.28.0
opentelemetry-instrumentation-fastapi>=0.49b0

Then install the new dependencies:

cd backend
source .venv/bin/activate
pip install -r requirements.txt

Or run the Backend: Install Python deps VS Code task from the Command Palette.

What these packages do:

Package Purpose
opentelemetry-exporter-otlp-proto-grpc Exports spans to any OTLP-compatible receiver (Aspire, Jaeger, etc.) over gRPC
opentelemetry-instrumentation-fastapi Auto-instruments FastAPI β€” creates spans for every incoming HTTP request automatically

Step 3: Add OTEL Settings

Agent Framework uses a combination of standard OpenTelemetry environment variables and Agent Framework–specific environment variables to control observability. Add the following to your backend/.env file:

# ── Agent Framework Observability ──────────────────────────────────────────
# Activates Agent Framework's built-in instrumentation (spans for agent
# invocations, LLM chat calls, and tool executions).
ENABLE_INSTRUMENTATION=true

# Includes prompts, completions, function arguments and results in span
# attributes.  See the WARNING below before enabling.
ENABLE_SENSITIVE_DATA=true

# ── Standard OpenTelemetry ─────────────────────────────────────────────────
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_SERVICE_NAME=children-story-studio

⚠️ WARNING β€” Sensitive Data: When ENABLE_SENSITIVE_DATA=true, Agent Framework records full prompt text, LLM responses, function call arguments, and function results as span attributes. This is extremely useful for debugging during development, but may expose personally identifiable information (PII), API keys embedded in prompts, or other confidential data in your trace viewer. Only enable this in development or test environments. Set ENABLE_SENSITIVE_DATA=false (or remove the variable entirely) before deploying to any shared or production environment.

What these variables do:

Variable Default Description
ENABLE_INSTRUMENTATION false Activates Agent Framework's OpenTelemetry instrumentation code paths β€” without this, the framework will not emit invoke_agent, chat, or execute_tool spans
ENABLE_SENSITIVE_DATA false When true, includes prompt/response content and function arguments in span attributes. Development only β€” see warning above
OTEL_EXPORTER_OTLP_ENDPOINT (none) The OTLP gRPC endpoint. Points to the Aspire Dashboard's receiver on port 4317
OTEL_SERVICE_NAME agent_framework The service name that appears in the trace viewer

Next, open backend/app/config.py and add a single new field to the Settings class β€” a master switch that lets you disable OTEL entirely without removing environment variables:

from pydantic_settings import BaseSettings, SettingsConfigDict


class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
        extra="ignore",
    )

    foundry_project_endpoint: str = ""
    foundry_model_deployment_name: str = "gpt-4o"
    foundry_image_model_deployment_name: str = "gpt-image-1"

    # CORS origin for the React dev server
    cors_origin: str = "http://localhost:5173"

    # OpenTelemetry β€” master switch (set to False to disable without removing env vars)
    otel_enabled: bool = True


settings = Settings()

The only new field is otel_enabled. All other OTEL configuration is handled by the standard environment variables above, which Agent Framework's configure_otel_providers() reads automatically.


Step 4: Create the Telemetry Module

Create a new file at backend/app/telemetry.py with the following contents:

"""
telemetry.py β€” OpenTelemetry bootstrap for the story-generation backend.

Uses Agent Framework's built-in ``configure_otel_providers()`` to set up
the TracerProvider, exporters, and Agent Framework instrumentation from
environment variables.  Also auto-instruments FastAPI so every incoming
HTTP request gets its own trace span automatically.

See: https://learn.microsoft.com/en-us/agent-framework/agents/observability
"""

import logging

from agent_framework.observability import configure_otel_providers
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from .config import settings

logger = logging.getLogger(__name__)


def configure_telemetry(app: FastAPI) -> None:
    """Set up OTEL providers via Agent Framework and instrument FastAPI.

    This must be called **before** any agent-framework imports that create
    workflows or executors, so the framework can pick up the active
    TracerProvider and emit its own spans.

    Does nothing if ``settings.otel_enabled`` is False.
    """
    if not settings.otel_enabled:
        logger.info("OpenTelemetry is disabled (OTEL_ENABLED=false)")
        return

    # 1. Configure OTEL providers β€” reads OTEL_EXPORTER_OTLP_ENDPOINT,
    #    OTEL_SERVICE_NAME, ENABLE_INSTRUMENTATION, and ENABLE_SENSITIVE_DATA
    #    from environment variables automatically.
    configure_otel_providers()

    # 2. Auto-instrument FastAPI (creates a parent span for every HTTP request).
    #    This is NOT covered by configure_otel_providers(), so we add it here.
    FastAPIInstrumentor.instrument_app(app)

    logger.info("OpenTelemetry configured via Agent Framework")

What this does:

  1. configure_otel_providers() β€” Agent Framework's built-in bootstrap function. It:
    • Reads OTEL_EXPORTER_OTLP_ENDPOINT and creates an OTLP gRPC exporter targeting the Aspire Dashboard
    • Creates a TracerProvider (plus log and metric providers) with OTEL_SERVICE_NAME as the service resource
    • Reads ENABLE_INSTRUMENTATION β€” when true, activates the framework's instrumentation code paths so it emits invoke_agent, chat, and execute_tool spans automatically
    • Reads ENABLE_SENSITIVE_DATA β€” when true, includes prompt text, LLM responses, and function arguments/results as span attributes
    • Registers everything as the global OTEL providers
  2. FastAPIInstrumentor.instrument_app(app) β€” Wraps every FastAPI route handler to create a parent span for each HTTP request. This is separate from Agent Framework's instrumentation and must be added explicitly.

Step 5: Wire Telemetry into the App

Now modify backend/app/main.py to call configure_telemetry() at startup.

Important β€” Import Ordering: The story_workflow object in workflow.py is a module-level singleton β€” it's created the moment workflow.py is imported. That import chain starts when main.py imports StoryGenerator (which imports workflow.py). For Agent Framework to emit its own spans, the global TracerProvider must be active before the workflow is built. This means we need to configure telemetry before importing StoryGenerator.

Replace the contents of backend/app/main.py with:

"""
main.py β€” FastAPI application entry point.

Endpoints:
  GET  /api/health              β€” health check
  POST /api/generate-story      β€” runs the story workflow; streams SSE progress events
"""

import logging

from dotenv import load_dotenv
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from sse_starlette.sse import EventSourceResponse

load_dotenv()  # Agent Framework reads env vars directly β€” ensure .env is loaded early

from .config import settings  # noqa: E402
from .models import StoryRequest  # noqa: E402

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(name)s β€” %(message)s",
)
logger = logging.getLogger(__name__)

# ─── App ──────────────────────────────────────────────────────────────────────

app = FastAPI(
    title="Children's Story Multi-Agent API",
    description=(
        "Multi-agent orchestration for generating illustrated children's stories "
        "using Microsoft Agent Framework."
    ),
    version="1.0.0",
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=[settings.cors_origin, "http://localhost:5173", "http://localhost:5174"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# ─── Telemetry (must be configured BEFORE importing StoryGenerator) ───────────

from .telemetry import configure_telemetry  # noqa: E402

configure_telemetry(app)

# ─── Service instances ────────────────────────────────────────────────────────

from .story_generator import StoryGenerator  # noqa: E402

_story_generator = StoryGenerator()

# ─── Health check ─────────────────────────────────────────────────────────────


@app.get("/api/health")
async def health() -> dict:
    return {"status": "ok", "service": "children-story-multi-agent"}


# ─── Story generation (SSE) ───────────────────────────────────────────────────


@app.post("/api/generate-story")
async def generate_story(request: StoryRequest) -> EventSourceResponse:
    """Accepts story parameters and streams back SSE events as the multi-agent
    workflow progresses.  The final event (type: 'complete') contains the
    full illustrated StoryResponse.
    """
    return _story_generator.event_source_response(request)

What changed (compared to the original main.py):

  1. load_dotenv() was added near the top of the file. Agent Framework's configure_otel_providers() reads environment variables directly (via os.environ), not through Pydantic settings β€” so we must ensure the .env file is loaded into the process environment before calling it.
  2. The from .story_generator import StoryGenerator import was moved down β€” it now appears after configure_telemetry(app) is called.
  3. Two new lines were added between the CORS middleware and the service instances:
    from .telemetry import configure_telemetry
    configure_telemetry(app)
  4. The # noqa: E402 comments suppress the linter warning about imports not being at the top of the file. This is intentional β€” the import ordering is required for correct OTEL initialization.

Everything else β€” the endpoints, CORS config, logging β€” is unchanged.


Step 6: Generate a Story and View Traces

1. Start the backend

cd backend
source .venv/bin/activate
uvicorn app.main:app --reload --port 8000

You should see a log line confirming telemetry is active:

OpenTelemetry configured via Agent Framework

2. Start the frontend

cd frontend
npm run dev

3. Generate a story

Open the app at http://localhost:5173, fill in the story form, and click Generate Story. Wait for the story to complete.

4. View traces in the Aspire Dashboard

Open the Aspire Dashboard at http://localhost:18888 and navigate to the Traces tab.

You should see a trace for the POST /api/generate-story request. Click on it to open the trace waterfall view, which shows a timeline of all spans in the request.


What to Look For

When examining a trace in the Aspire Dashboard, look for these patterns:

Span Hierarchy

The top-level span is the FastAPI HTTP request (POST /api/generate-story). Beneath it, you should see spans emitted by Agent Framework for the workflow and each agent. The framework uses OpenTelemetry GenAI Semantic Conventions for span naming:

Span Name Pattern What It Represents
POST /api/generate-story The full HTTP request lifecycle (auto-instrumented by FastAPI)
invoke_agent <agent_name> Each agent invocation β€” the top-level span for an agent's work within an executor
chat <model_name> An LLM chat completion call. When ENABLE_SENSITIVE_DATA=true, the prompt and response text appear as span attributes
execute_tool <function_name> A function tool execution (if your agents use tools). Includes arguments and results when sensitive data is enabled

Per-Agent Latency

The waterfall view shows each executor's duration as a horizontal bar. You'll typically see:

  • Orchestrator β€” Fast (single LLM call to create an outline)
  • StoryArchitect β€” Moderate (one LLM call to write the full narrative)
  • ArtDirector β€” Longest (LLM call for image prompts + parallel image generation calls)
  • StoryReviewer β€” Moderate (one LLM call to review the draft)
  • Decision β€” Near-instant (routing logic only)

Revision Loops

If the StoryReviewer rejects a draft, you'll see the workflow loop back to Orchestrator. In the trace waterfall, this appears as repeated executor spans β€” a second pass through Orchestrator β†’ StoryArchitect β†’ ArtDirector β†’ StoryReviewer β†’ Decision. The revision count (max 2) is visible as repeated cycles.

Parallel Image Generation

Inside the ArtDirector executor span, look for multiple image generation spans running concurrently (up to 5 in parallel, controlled by the semaphore in the existing code). These will appear as overlapping horizontal bars in the waterfall.

Sensitive Data in Span Attributes

If you set ENABLE_SENSITIVE_DATA=true, expand any chat span's attributes to see:

  • gen_ai.request.instructions β€” The system prompt sent to the LLM
  • gen_ai.usage.input_tokens / gen_ai.usage.output_tokens β€” Token counts for cost tracking
  • gen_ai.response.id β€” The LLM response identifier

This is invaluable for debugging prompt issues but remember to disable it before sharing traces or moving to a shared environment.


Alternative: Exporting to Azure Application Insights

The Aspire Dashboard is great for local development. For production or shared environments, you can export the same OTEL traces to Azure Application Insights instead β€” with no changes to the agent or workflow code.

What Application Insights Adds

Feature Description
Application Map Visual diagram showing the call flow between agents β€” automatically generated from trace data
Live Metrics Real-time view of incoming requests, failures, and performance
Transaction Search Search and filter across all traces by executor name, duration, status, etc.
KQL Querying Write custom queries against trace data (e.g., "show me all stories where ArtDirector took > 30 seconds")
Smart Detection Automatic alerts for anomalies in failure rates or response times
Long-Term Retention Traces are retained for 90 days by default (configurable) β€” vs. Aspire which only holds data while the container is running

Implementation

1. Provision an Application Insights Resource

Create an Application Insights resource in the Azure Portal (or via CLI). Copy the Connection String from the resource's Overview page.

2. Swap the Exporter Package

In backend/requirements.txt, replace the OTLP gRPC exporter with the Azure Monitor package:

# OpenTelemetry (api + sdk are already included via agent-framework-core)
opentelemetry-instrumentation-fastapi>=0.49b0
azure-monitor-opentelemetry>=1.6.4

Note: The azure-monitor-opentelemetry meta-package bundles the Azure Monitor exporter with auto-instrumentation for HTTP clients and more.

Install the updated dependencies:

pip install -r requirements.txt

3. Add the Connection String

Add the Application Insights connection string to your backend/.env:

APPLICATIONINSIGHTS_CONNECTION_STRING=InstrumentationKey=xxx;IngestionEndpoint=https://xxx.in.applicationinsights.azure.com/;...

And add a corresponding field in backend/app/config.py inside the Settings class:

    # Azure Application Insights (alternative to local Aspire Dashboard)
    applicationinsights_connection_string: str = ""

4. Update the Telemetry Module

Modify backend/app/telemetry.py to use configure_azure_monitor along with Agent Framework's enable_instrumentation(). Replace the file contents with:

"""
telemetry.py β€” OpenTelemetry bootstrap (Application Insights variant).

Uses azure-monitor-opentelemetry to export traces to Application Insights,
then activates Agent Framework instrumentation.

See: https://learn.microsoft.com/en-us/agent-framework/agents/observability
"""

import logging

from azure.monitor.opentelemetry import configure_azure_monitor
from agent_framework.observability import create_resource, enable_instrumentation
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from .config import settings

logger = logging.getLogger(__name__)


def configure_telemetry(app: FastAPI) -> None:
    if not settings.otel_enabled:
        logger.info("OpenTelemetry is disabled (OTEL_ENABLED=false)")
        return

    if not settings.applicationinsights_connection_string:
        logger.warning(
            "APPLICATIONINSIGHTS_CONNECTION_STRING is not set β€” telemetry disabled"
        )
        return

    # 1. Configure Azure Monitor with the Agent Framework resource
    configure_azure_monitor(
        connection_string=settings.applicationinsights_connection_string,
        resource=create_resource(),  # Uses OTEL_SERVICE_NAME from env
        enable_live_metrics=True,
    )

    # 2. Activate Agent Framework instrumentation (reads ENABLE_SENSITIVE_DATA
    #    from env vars, or pass it explicitly here)
    enable_instrumentation()

    # 3. Auto-instrument FastAPI
    FastAPIInstrumentor.instrument_app(app)

    logger.info("OpenTelemetry configured β€” exporting to Application Insights")

The key differences from the Aspire version are:

  1. configure_azure_monitor() replaces configure_otel_providers() β€” it sets up the Azure Monitor exporter with the connection string
  2. create_resource() β€” Agent Framework helper that builds an OTEL Resource using OTEL_SERVICE_NAME and the framework version
  3. enable_instrumentation() β€” Explicitly activates Agent Framework's instrumentation code paths (alternatively, this is handled by the ENABLE_INSTRUMENTATION env var, but calling it explicitly ensures it's active regardless of env configuration)

main.py does not need any changes β€” the import ordering, load_dotenv(), and configure_telemetry(app) call stay the same.


Troubleshooting

Aspire Dashboard is not reachable at http://localhost:18888

Ensure Docker is running and the container is up:

docker ps --filter name=aspire-dashboard

If the container isn't listed, start it again with the docker run command from Step 1.

No traces appear in the Aspire Dashboard

  1. Check the backend logs β€” You should see OpenTelemetry configured via Agent Framework. If you see OpenTelemetry is disabled, check that OTEL_ENABLED is not set to false in your .env.

  2. Verify load_dotenv() is called β€” Agent Framework's configure_otel_providers() reads environment variables directly from os.environ, not from Pydantic settings. If load_dotenv() is missing from main.py, the .env values for OTEL_EXPORTER_OTLP_ENDPOINT, ENABLE_INSTRUMENTATION, etc. won't be available.

  3. Verify the OTLP endpoint β€” The exporter must be able to reach localhost:4317, which Docker maps to the Aspire container's OTLP receiver on port 18889. If you changed the port mapping, update OTEL_EXPORTER_OTLP_ENDPOINT accordingly.

  4. Generate at least one story β€” Traces only appear after a request has been made. The health check endpoint (GET /api/health) will also generate a span if you want a quick test.

FastAPI spans appear but no Agent Framework spans

This usually means one of two things:

  1. ENABLE_INSTRUMENTATION is not set to true β€” This is the most common cause. Without this environment variable, Agent Framework will not emit invoke_agent, chat, or execute_tool spans even if a TracerProvider is active. Check your backend/.env file.

  2. Import ordering β€” The TracerProvider must be registered as the global provider before the story_workflow singleton is created. This happens at module import time in workflow.py. Double-check that your main.py follows the import ordering from Step 5:

    • load_dotenv() is called first (so env vars are available)
    • configure_telemetry(app) is called second
    • from .story_generator import StoryGenerator comes after

If the StoryGenerator import is above the configure_telemetry() call, the workflow will be built before the provider is active.

Aspire Dashboard asks for a login token

Retrieve the token from the container logs:

docker logs aspire-dashboard 2>&1 | grep "login"

Copy the full URL (including the ?t= parameter) into your browser.

Stopping the Aspire Dashboard

When you're done, stop and remove the container:

docker stop aspire-dashboard

The --rm flag in the original docker run command ensures the container is automatically removed when stopped.


Next Steps

  • Experiment with custom spans β€” add manual tracing to specific operations (e.g., JSON parsing, image prompt construction) using tracer.start_as_current_span().
  • Explore the Metrics and Structured Logs tabs in the Aspire Dashboard β€” OTEL supports all three signals, and the SDK can be extended to export metrics and logs alongside traces.
  • Try the Application Insights alternative to see how the same traces look in a production-grade monitoring tool with Application Map, KQL queries, and alerting.
  • If you haven't already, try the Activity Page Agents guide or Text-to-Speech guide to extend the application with new capabilities.

← Back to README