This document describes the observability features implemented in the Partnership Agent solution, including OpenTelemetry integration and AI response evaluation metrics.
The Partnership Agent now includes comprehensive observability features that automatically capture and send evaluation metrics to Application Insights via OpenTelemetry. This enables continuous monitoring of AI response quality across all user interactions.
- Coherence: Measures how well the response flows logically and maintains internal consistency
- Fluency: Evaluates the grammatical correctness and natural language quality
- Equivalence: (When ground truth provided) Measures semantic similarity to expected answers
- Groundedness: (When ground truth provided) Evaluates factual accuracy against reference material
- OpenTelemetry integration with Application Insights
- Automatic activity tracing for all AI interactions
- Custom metrics for evaluation scores
- Structured logging with correlation IDs
Set your Application Insights connection string in one of the following ways:
{
"ApplicationInsights": {
"ConnectionString": "InstrumentationKey=your-key;IngestionEndpoint=https://your-region.in.applicationinsights.azure.com/"
}
}export APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=your-key;IngestionEndpoint=https://your-region.in.applicationinsights.azure.com/"Enable or disable AI response evaluation:
{
"Evaluation": {
"Enabled": true,
"LogToConsole": true
}
}- Navigate to Application Insights in the Azure portal
- Go to Logs and run KQL queries to analyze AI quality metrics:
// View all evaluation metrics
traces
| where customDimensions contains "gen_ai.evaluation"
| project timestamp, operation_Id, customDimensions
| sort by timestamp desc
// Average coherence scores by time
traces
| where customDimensions has "gen_ai.evaluation.Coherence.score"
| extend CoherenceScore = todouble(customDimensions["gen_ai.evaluation.Coherence.score"])
| summarize AvgCoherence = avg(CoherenceScore) by bin(timestamp, 1h)
| sort by timestamp desc
// Response quality distribution
traces
| where customDimensions has "gen_ai.evaluation.Fluency.score"
| extend FluencyScore = todouble(customDimensions["gen_ai.evaluation.Fluency.score"])
| summarize count() by bin(FluencyScore, 0.1)
| sort by FluencyScore asc- Create dashboards to monitor:
- Average evaluation scores over time
- Response quality distribution
- Failed evaluations and error rates
- User interaction patterns
Each AI interaction generates the following telemetry tags:
gen_ai.full_nl_response: The complete AI response textgen_ai.evaluation.module: Which AI component generated the response (e.g., "FAQAgent", "EntityResolutionAgent")gen_ai.evaluation.user_prompt: The original user inputgen_ai.evaluation.Coherence.score: Coherence evaluation score (0.0-1.0)gen_ai.evaluation.Fluency.score: Fluency evaluation score (0.0-1.0)gen_ai.evaluation.has_ground_truth: Whether ground truth comparison was performed
When ground truth is available:
gen_ai.evaluation.Equivalence.score: Semantic similarity scoregen_ai.evaluation.Groundedness.score: Factual accuracy score
- IAssistantResponseEvaluator: Interface for AI response evaluation
- AssistantResponseEvaluator: Implementation using Microsoft.Extensions.AI.Evaluation
- OpenTelemetry Configuration: Automatic tracing and metrics collection
- Activity Sources:
PartnershipAgent.StepOrchestration: Main orchestration flowPartnershipAgent.Agents: Individual agent activitiesPartnershipAgent.Evaluation: Evaluation activities
- StepOrchestrationService: Evaluates complete conversation quality
- ResponseGenerationStep: Evaluates FAQ agent responses
- EntityResolutionStep: Evaluates entity extraction quality
- Set up alerts for low evaluation scores (< 0.7)
- Monitor evaluation failure rates
- Track response time impact of evaluation
- Evaluation runs asynchronously and doesn't block user responses
- Failed evaluations are logged but don't affect user experience
- Consider disabling evaluation in high-throughput scenarios
- Provide expected answers when available for more comprehensive metrics
- Use integration tests with known good responses
- Implement feedback loops to improve model performance
Evaluation metrics not appearing in Application Insights:
- Verify connection string is set correctly
- Check that
Evaluation:Enabledis set totrue - Ensure the Microsoft.Extensions.AI.Evaluation package is compatible
High latency:
- Monitor evaluation execution time in logs
- Consider disabling evaluation for high-traffic endpoints
- Use sampling to reduce evaluation frequency
Missing ground truth metrics:
- Equivalence and Groundedness evaluators only run when expectedAnswer is provided
- Check that ground truth data is being passed to evaluation methods
// Monitor recent dependencies - commonly used for troubleshooting
dependencies
| where timestamp >= ago(30m)
| where cloud_RoleName == "PartnershipAgent"
| project timestamp, name, target, duration, success, customDimensions
| order by timestamp desc
| limit 20traces
| where customDimensions has "gen_ai.evaluation"
| extend
Coherence = todouble(customDimensions["gen_ai.evaluation.Coherence.score"]),
Fluency = todouble(customDimensions["gen_ai.evaluation.Fluency.score"]),
Module = tostring(customDimensions["gen_ai.evaluation.module"])
| where isnotnull(Coherence) and isnotnull(Fluency)
| summarize
AvgCoherence = avg(Coherence),
AvgFluency = avg(Fluency),
Count = count()
by Module, bin(timestamp, 1h)
| sort by timestamp desctraces
| where customDimensions has "gen_ai.evaluation.Coherence.score"
| extend
Coherence = todouble(customDimensions["gen_ai.evaluation.Coherence.score"]),
Response = tostring(customDimensions["gen_ai.full_nl_response"]),
UserPrompt = tostring(customDimensions["gen_ai.evaluation.user_prompt"])
| where Coherence < 0.7
| project timestamp, UserPrompt, Response, Coherence
| sort by timestamp desc// Analyze dependency performance patterns
dependencies
| where timestamp >= ago(1h)
| where cloud_RoleName == "PartnershipAgent"
| summarize
Count = count(),
AvgDuration = avg(duration),
FailureRate = countif(success == false) * 100.0 / count()
by name, target
| order by AvgDuration desc// Investigate failed dependencies in the last hour
dependencies
| where timestamp >= ago(1h)
| where cloud_RoleName == "PartnershipAgent"
| where success == false
| project timestamp, name, target, duration, resultCode, customDimensions
| order by timestamp desc// Dependencies over the last 24 hours for trend analysis
dependencies
| where timestamp >= ago(24h)
| where cloud_RoleName == "PartnershipAgent"
| project timestamp, name, target, duration, success, customDimensions
| order by timestamp desc
| limit 100// Success rates by dependency target
dependencies
| where timestamp >= ago(2h)
| where cloud_RoleName == "PartnershipAgent"
| summarize
Total = count(),
Successful = countif(success == true),
Failed = countif(success == false),
SuccessRate = round(countif(success == true) * 100.0 / count(), 2)
by target
| order by SuccessRate asc