Partnership Agent Observability

This document describes the observability features implemented in the Partnership Agent solution, including OpenTelemetry integration and AI response evaluation metrics.

Overview

The Partnership Agent now includes comprehensive observability features that automatically capture and send evaluation metrics to Application Insights via OpenTelemetry. This enables continuous monitoring of AI response quality across all user interactions.

Features

AI Response Evaluation

Coherence: Measures how well the response flows logically and maintains internal consistency
Fluency: Evaluates the grammatical correctness and natural language quality
Equivalence: (When ground truth provided) Measures semantic similarity to expected answers
Groundedness: (When ground truth provided) Evaluates factual accuracy against reference material

Telemetry Integration

OpenTelemetry integration with Application Insights
Automatic activity tracing for all AI interactions
Custom metrics for evaluation scores
Structured logging with correlation IDs

Configuration

Application Insights Connection String

Set your Application Insights connection string in one of the following ways:

Option 1: appsettings.json

{
  "ApplicationInsights": {
    "ConnectionString": "InstrumentationKey=your-key;IngestionEndpoint=https://your-region.in.applicationinsights.azure.com/"
  }
}

Option 2: Environment Variable

export APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=your-key;IngestionEndpoint=https://your-region.in.applicationinsights.azure.com/"

Evaluation Configuration

Enable or disable AI response evaluation:

{
  "Evaluation": {
    "Enabled": true,
    "LogToConsole": true
  }
}

Usage

Viewing Metrics in Application Insights

Navigate to Application Insights in the Azure portal
Go to Logs and run KQL queries to analyze AI quality metrics:

// View all evaluation metrics
traces
| where customDimensions contains "gen_ai.evaluation"
| project timestamp, operation_Id, customDimensions
| sort by timestamp desc

// Average coherence scores by time
traces
| where customDimensions has "gen_ai.evaluation.Coherence.score"
| extend CoherenceScore = todouble(customDimensions["gen_ai.evaluation.Coherence.score"])
| summarize AvgCoherence = avg(CoherenceScore) by bin(timestamp, 1h)
| sort by timestamp desc

// Response quality distribution
traces
| where customDimensions has "gen_ai.evaluation.Fluency.score"
| extend FluencyScore = todouble(customDimensions["gen_ai.evaluation.Fluency.score"])
| summarize count() by bin(FluencyScore, 0.1)
| sort by FluencyScore asc

Create dashboards to monitor:
- Average evaluation scores over time
- Response quality distribution
- Failed evaluations and error rates
- User interaction patterns

Custom Metrics Available

Each AI interaction generates the following telemetry tags:

gen_ai.full_nl_response: The complete AI response text
gen_ai.evaluation.module: Which AI component generated the response (e.g., "FAQAgent", "EntityResolutionAgent")
gen_ai.evaluation.user_prompt: The original user input
gen_ai.evaluation.Coherence.score: Coherence evaluation score (0.0-1.0)
gen_ai.evaluation.Fluency.score: Fluency evaluation score (0.0-1.0)
gen_ai.evaluation.has_ground_truth: Whether ground truth comparison was performed

When ground truth is available:

gen_ai.evaluation.Equivalence.score: Semantic similarity score
gen_ai.evaluation.Groundedness.score: Factual accuracy score

Architecture

Components

IAssistantResponseEvaluator: Interface for AI response evaluation
AssistantResponseEvaluator: Implementation using Microsoft.Extensions.AI.Evaluation
OpenTelemetry Configuration: Automatic tracing and metrics collection
Activity Sources:
- PartnershipAgent.StepOrchestration: Main orchestration flow
- PartnershipAgent.Agents: Individual agent activities
- PartnershipAgent.Evaluation: Evaluation activities

Integration Points

StepOrchestrationService: Evaluates complete conversation quality
ResponseGenerationStep: Evaluates FAQ agent responses
EntityResolutionStep: Evaluates entity extraction quality

Best Practices

Monitoring

Set up alerts for low evaluation scores (< 0.7)
Monitor evaluation failure rates
Track response time impact of evaluation

Performance

Evaluation runs asynchronously and doesn't block user responses
Failed evaluations are logged but don't affect user experience
Consider disabling evaluation in high-throughput scenarios

Ground Truth Evaluation

Provide expected answers when available for more comprehensive metrics
Use integration tests with known good responses
Implement feedback loops to improve model performance

Troubleshooting

Common Issues

Evaluation metrics not appearing in Application Insights:

Verify connection string is set correctly
Check that Evaluation:Enabled is set to true
Ensure the Microsoft.Extensions.AI.Evaluation package is compatible

High latency:

Monitor evaluation execution time in logs
Consider disabling evaluation for high-traffic endpoints
Use sampling to reduce evaluation frequency

Missing ground truth metrics:

Equivalence and Groundedness evaluators only run when expectedAnswer is provided
Check that ground truth data is being passed to evaluation methods

Sample Queries

Recent Partnership Agent Dependencies (Most Commonly Used)

// Monitor recent dependencies - commonly used for troubleshooting
dependencies
| where timestamp >= ago(30m)
| where cloud_RoleName == "PartnershipAgent"
| project timestamp, name, target, duration, success, customDimensions
| order by timestamp desc
| limit 20

Monitor AI Quality Trends

traces
| where customDimensions has "gen_ai.evaluation"
| extend 
    Coherence = todouble(customDimensions["gen_ai.evaluation.Coherence.score"]),
    Fluency = todouble(customDimensions["gen_ai.evaluation.Fluency.score"]),
    Module = tostring(customDimensions["gen_ai.evaluation.module"])
| where isnotnull(Coherence) and isnotnull(Fluency)
| summarize 
    AvgCoherence = avg(Coherence),
    AvgFluency = avg(Fluency),
    Count = count()
    by Module, bin(timestamp, 1h)
| sort by timestamp desc

Identify Low-Quality Responses

traces
| where customDimensions has "gen_ai.evaluation.Coherence.score"
| extend 
    Coherence = todouble(customDimensions["gen_ai.evaluation.Coherence.score"]),
    Response = tostring(customDimensions["gen_ai.full_nl_response"]),
    UserPrompt = tostring(customDimensions["gen_ai.evaluation.user_prompt"])
| where Coherence < 0.7
| project timestamp, UserPrompt, Response, Coherence
| sort by timestamp desc

Dependency Performance Analysis

// Analyze dependency performance patterns
dependencies
| where timestamp >= ago(1h)
| where cloud_RoleName == "PartnershipAgent"
| summarize 
    Count = count(),
    AvgDuration = avg(duration),
    FailureRate = countif(success == false) * 100.0 / count()
    by name, target
| order by AvgDuration desc

Failed Dependencies Investigation

// Investigate failed dependencies in the last hour
dependencies
| where timestamp >= ago(1h)
| where cloud_RoleName == "PartnershipAgent"
| where success == false
| project timestamp, name, target, duration, resultCode, customDimensions
| order by timestamp desc

Extended Time Range Dependencies

// Dependencies over the last 24 hours for trend analysis
dependencies
| where timestamp >= ago(24h)
| where cloud_RoleName == "PartnershipAgent"
| project timestamp, name, target, duration, success, customDimensions
| order by timestamp desc
| limit 100

Dependency Success Rate by Service

// Success rates by dependency target
dependencies
| where timestamp >= ago(2h)
| where cloud_RoleName == "PartnershipAgent"
| summarize 
    Total = count(),
    Successful = countif(success == true),
    Failed = countif(success == false),
    SuccessRate = round(countif(success == true) * 100.0 / count(), 2)
    by target
| order by SuccessRate asc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partnership Agent Observability

Overview

Features

AI Response Evaluation

Telemetry Integration

Configuration

Application Insights Connection String

Option 1: appsettings.json

Option 2: Environment Variable

Evaluation Configuration

Usage

Viewing Metrics in Application Insights

Custom Metrics Available

Architecture

Components

Integration Points

Best Practices

Monitoring

Performance

Ground Truth Evaluation

Troubleshooting

Common Issues

Sample Queries

Recent Partnership Agent Dependencies (Most Commonly Used)

Monitor AI Quality Trends

Identify Low-Quality Responses

Dependency Performance Analysis

Failed Dependencies Investigation

Extended Time Range Dependencies

Dependency Success Rate by Service

FilesExpand file tree

OBSERVABILITY.md

Latest commit

History

OBSERVABILITY.md

File metadata and controls

Partnership Agent Observability

Overview

Features

AI Response Evaluation

Telemetry Integration

Configuration

Application Insights Connection String

Option 1: appsettings.json

Option 2: Environment Variable

Evaluation Configuration

Usage

Viewing Metrics in Application Insights

Custom Metrics Available

Architecture

Components

Integration Points

Best Practices

Monitoring

Performance

Ground Truth Evaluation

Troubleshooting

Common Issues

Sample Queries

Recent Partnership Agent Dependencies (Most Commonly Used)

Monitor AI Quality Trends

Identify Low-Quality Responses

Dependency Performance Analysis

Failed Dependencies Investigation

Extended Time Range Dependencies

Dependency Success Rate by Service