Add continuous agent evaluation engine by Cataldir · Pull Request #974 · Azure-Samples/holiday-peak-hub

Cataldir · 2026-05-06T19:14:35Z

Adds the foundation for continuous agent response evaluation using Azure AI Foundry evaluation with deterministic local fallback.

Summary:

Add shared evaluation contracts, dataset discovery, runner strategy selection, drift detection, and result events.
Wire evaluation config and runtime endpoints into the shared agent builder/app factory path for agentic services.
Add self-healing quality-drift classification as a manual-review escalation path.
Add pilot .foundry evaluation configs/datasets for ecommerce catalog search, search enrichment, and truth enrichment.
Add CI and scheduled continuous evaluation workflows.
Fix ecommerce catalog streaming fallback so optional model stream failures emit degraded events instead of fatal error events.

Validation:

Pre-push gate passed locally: isort, Black, pylint error/fatal gate, mypy, governance markdown links, canonical event schema contracts.
Pre-push lib tests passed: 1325 passed.
Pre-push app tests passed: 702 passed.
Focused catalog stream tests passed: 44 passed.
Earlier full repository pytest passed locally: 2076 passed.
Pilot evaluation CLI passed for ecommerce-catalog-search, search-enrichment-agent, and truth-enrichment with 10 seed cases each and no drift detected.

Live validation note:

Dev APIM currently still serves the old image for ecommerce catalog stream behavior. After this branch is deployed, the live stream probe should show event: degraded for optional model failures and no fatal event: error in that fallback path.

Fixes #897

github-actions · 2026-05-06T19:15:42Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

+        evaluation_history.append(result_payload)
+        try:
+            tracer.record_evaluation(result_payload)
+        except (AttributeError, TypeError):


github-actions · 2026-05-06T19:44:31Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

github-actions · 2026-05-06T20:21:12Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

github-actions · 2026-05-06T20:42:56Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

Cataldir · 2026-05-06T21:31:35Z

Continuous evaluation PR validation status

Current branch head: 53df0f1f99a1f1724b8292aa0bb0398df02d6da1.

Validated so far:

Local/pre-push gates passed for the evaluation-engine changes before push.
PR checks are green for backend tests, lint, dependency audit, CodeQL, UI contract, and all advisory evaluation jobs.
Live dev APIM paths for the currently deployed ecommerce-catalog-search service return 200 for /health, /ready, /invoke, and /invoke/stream.

Deployment validation still pending:

Replacement branch-preview deployment run 25461729174 is queued before deploy / detect-changes and has not been assigned a GitHub-hosted ubuntu-latest runner yet.
There are no pending environment approvals and no other queued/in-progress repository runs blocking it.
Live AKS still shows ecommerce-catalog-search on the old image tag deterministic-fanout-20260420-190823, and Flux still points at main@sha1:133f87f23eff98b0756606d6a54892161de6425d; therefore the Foundry strict runtime contract has not yet been validated on the new branch image.

Known PR blocker:

ui-quality is failing, but this PR has no apps/ui file changes and recent main test workflow runs were already failing.
Tracked separately as Fix ui-quality regression blocking PR validation #975 so the evaluation-engine PR does not absorb unrelated UI test/runtime drift.

Cataldir · 2026-05-06T21:38:03Z

Deployment validation update

Re-dispatched after the previous queued replacement run cancelled cleanly.

Fresh run: 25462486136
State: queued before deploy / detect-changes, no assigned ubuntu-latest runner yet
Pending deployments: none observed
Active repo runs: this run only
Branch image exists in ACR: holidaypeakhub405devacr.azurecr.io/ecommerce-catalog-search:53df0f1f99a1f1724b8292aa0bb0398df02d6da1 -> sha256:e59d2094b26e83318a095e957fdbb3534052110921663366e3ac09fadc85f868

Conclusion: image build availability is confirmed, and the remaining live validation gap is workflow execution through rendered manifest commit, Flux reconciliation, and post-reconcile Foundry strict runtime validation.

…d identity

github-actions · 2026-05-07T17:47:53Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

…ates Refactors wait-flux-reconciliation to actively poll until both kustomizations apply the published manifest revision (post commit-rendered-manifests SHA) AND any migrated HelmReleases reach Ready=UpgradeSucceeded. Without this, ensure-foundry-agents and validate-agc-readiness observed the previous main revision while preview manifests sat unapplied due to dependency cascades (holiday-peak-gitops-holiday-peak-agents waits on holiday-peak-gitops-holiday-peak-crud). Changes: - commit-rendered-manifests: expose published_sha output (post-commit HEAD SHA) - wait-flux-reconciliation: * Force-reconcile GitRepository source until artifact contains published_sha * Active poll for both kustomizations to lastAppliedRevision==published_sha AND Ready=True (with reconcile triggered in dependency order: crud first, agents after) * For each migrated HelmRelease (changed agent service), force reconcile the HR and wait for Ready=True with InstallSucceeded/UpgradeSucceeded/ ReconciliationSucceeded/TestSucceeded reason restore-flux-source-default-branch already runs after gates via needs:, so no reorder is required there. Issue: #897

github-actions · 2026-05-07T21:22:23Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

…s secret-masked GitHub Actions secret-masks AGC_SUBNET_ID outputs that contain a subscription GUID matching a configured secret. The render publication-context steps then fall through to live-cluster recovery, which fails when the ApplicationLoadBalancer was previously pruned, producing a CRUD manifest without ALB+Gateway. Flux applies the truncated manifest, prunes them again, and validate-agc-readiness fails. Fix: add an Azure CLI fallback that resolves the AGC delegated subnet directly via 'az network vnet subnet show' so the render context is authoritative even when both the masked output and the live cluster recovery are empty. Hard-fail when AGC_SUBNET_ID still cannot be resolved instead of silently rendering an incomplete manifest.

github-actions · 2026-05-07T23:06:44Z

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

github-code-quality Bot found potential problems May 6, 2026

View reviewed changes

Comment thread lib/src/holiday_peak_lib/app_factory_components/endpoints.py

evaluation_history.append(result_payload)

try:

tracer.record_evaluation(result_payload)

except (AttributeError, TypeError):

Add continuous agent evaluation engine

4b02af8

Cataldir force-pushed the issue/897-evaluation-engine-foundation branch from 976c340 to 4b02af8 Compare May 6, 2026 19:43

Cataldir had a problem deploying to dev May 6, 2026 19:45 — with GitHub Actions Failure

github-actions Bot and others added 2 commits May 6, 2026 19:53

deploy: update rendered manifests [skip ci]

c8573be

fix: run foundry validation after flux reconcile

16b8550

fix: route catalog preview deploys through branch environment

53df0f1

Cataldir temporarily deployed to branch May 6, 2026 20:42 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 20:45 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 20:46 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 20:48 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 20:57 — with GitHub Actions Inactive

Cataldir had a problem deploying to branch May 6, 2026 21:17 — with GitHub Actions Error

Cataldir mentioned this pull request May 6, 2026

Fix ui-quality regression blocking PR validation #975

Closed

6 tasks

Cataldir temporarily deployed to branch May 6, 2026 22:09 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 22:34 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 22:42 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 22:47 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 6, 2026 22:48 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 7, 2026 15:51 — with GitHub Actions Inactive

deploy: update Flux GitOps artifacts [skip ci]

d9d2554

Cataldir temporarily deployed to branch May 7, 2026 17:13 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 7, 2026 17:16 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 7, 2026 17:18 — with GitHub Actions Inactive

fix(deploy): resolve UAMI client IDs at deploy time for chart workloa…

85bc580

…d identity

github-actions Bot and others added 3 commits May 7, 2026 17:56

deploy: update Flux GitOps artifacts [skip ci]

ea7e09c

deploy: update Flux GitOps artifacts [skip ci]

dbe6cac

github-actions Bot and others added 2 commits May 7, 2026 21:28

deploy: update Flux GitOps artifacts [skip ci]

a8801fa

github-actions Bot added 3 commits May 8, 2026 01:47

deploy: update Flux GitOps artifacts [skip ci]

d2ec119

deploy: update Flux GitOps artifacts [skip ci]

d1ffea0

deploy: update Flux GitOps artifacts [skip ci]

283f681

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add continuous agent evaluation engine#974

Add continuous agent evaluation engine#974
Cataldir wants to merge 28 commits into
mainfrom
issue/897-evaluation-engine-foundation

Cataldir commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Cataldir commented May 6, 2026

Uh oh!

Cataldir commented May 6, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cataldir commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Uh oh!

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Uh oh!

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Uh oh!

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Uh oh!

Cataldir commented May 6, 2026

Continuous evaluation PR validation status

Uh oh!

Cataldir commented May 6, 2026

Deployment validation update

Uh oh!

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Uh oh!

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Uh oh!

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant