Skip to content

Add continuous agent evaluation engine#974

Open
Cataldir wants to merge 28 commits into
mainfrom
issue/897-evaluation-engine-foundation
Open

Add continuous agent evaluation engine#974
Cataldir wants to merge 28 commits into
mainfrom
issue/897-evaluation-engine-foundation

Conversation

@Cataldir
Copy link
Copy Markdown
Contributor

@Cataldir Cataldir commented May 6, 2026

Adds the foundation for continuous agent response evaluation using Azure AI Foundry evaluation with deterministic local fallback.

Summary:

  • Add shared evaluation contracts, dataset discovery, runner strategy selection, drift detection, and result events.
  • Wire evaluation config and runtime endpoints into the shared agent builder/app factory path for agentic services.
  • Add self-healing quality-drift classification as a manual-review escalation path.
  • Add pilot .foundry evaluation configs/datasets for ecommerce catalog search, search enrichment, and truth enrichment.
  • Add CI and scheduled continuous evaluation workflows.
  • Fix ecommerce catalog streaming fallback so optional model stream failures emit degraded events instead of fatal error events.

Validation:

  • Pre-push gate passed locally: isort, Black, pylint error/fatal gate, mypy, governance markdown links, canonical event schema contracts.
  • Pre-push lib tests passed: 1325 passed.
  • Pre-push app tests passed: 702 passed.
  • Focused catalog stream tests passed: 44 passed.
  • Earlier full repository pytest passed locally: 2076 passed.
  • Pilot evaluation CLI passed for ecommerce-catalog-search, search-enrichment-agent, and truth-enrichment with 10 seed cases each and no drift detected.

Live validation note:

  • Dev APIM currently still serves the old image for ecommerce catalog stream behavior. After this branch is deployed, the live stream probe should show event: degraded for optional model failures and no fatal event: error in that fallback path.

Fixes #897

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

evaluation_history.append(result_payload)
try:
tracer.record_evaluation(result_payload)
except (AttributeError, TypeError):
@Cataldir Cataldir force-pushed the issue/897-evaluation-engine-foundation branch from 976c340 to 4b02af8 Compare May 6, 2026 19:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

@Cataldir
Copy link
Copy Markdown
Contributor Author

Cataldir commented May 6, 2026

Continuous evaluation PR validation status

Current branch head: 53df0f1f99a1f1724b8292aa0bb0398df02d6da1.

Validated so far:

  • Local/pre-push gates passed for the evaluation-engine changes before push.
  • PR checks are green for backend tests, lint, dependency audit, CodeQL, UI contract, and all advisory evaluation jobs.
  • Live dev APIM paths for the currently deployed ecommerce-catalog-search service return 200 for /health, /ready, /invoke, and /invoke/stream.

Deployment validation still pending:

  • Replacement branch-preview deployment run 25461729174 is queued before deploy / detect-changes and has not been assigned a GitHub-hosted ubuntu-latest runner yet.
  • There are no pending environment approvals and no other queued/in-progress repository runs blocking it.
  • Live AKS still shows ecommerce-catalog-search on the old image tag deterministic-fanout-20260420-190823, and Flux still points at main@sha1:133f87f23eff98b0756606d6a54892161de6425d; therefore the Foundry strict runtime contract has not yet been validated on the new branch image.

Known PR blocker:

  • ui-quality is failing, but this PR has no apps/ui file changes and recent main test workflow runs were already failing.
  • Tracked separately as Fix ui-quality regression blocking PR validation #975 so the evaluation-engine PR does not absorb unrelated UI test/runtime drift.

@Cataldir
Copy link
Copy Markdown
Contributor Author

Cataldir commented May 6, 2026

Deployment validation update

Re-dispatched after the previous queued replacement run cancelled cleanly.

  • Fresh run: 25462486136
  • State: queued before deploy / detect-changes, no assigned ubuntu-latest runner yet
  • Pending deployments: none observed
  • Active repo runs: this run only
  • Branch image exists in ACR: holidaypeakhub405devacr.azurecr.io/ecommerce-catalog-search:53df0f1f99a1f1724b8292aa0bb0398df02d6da1 -> sha256:e59d2094b26e83318a095e957fdbb3534052110921663366e3ac09fadc85f868

Conclusion: image build availability is confirmed, and the remaining live validation gap is workflow execution through rendered manifest commit, Flux reconciliation, and post-reconcile Foundry strict runtime validation.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

github-actions Bot and others added 3 commits May 7, 2026 17:56
…ates

Refactors wait-flux-reconciliation to actively poll until both kustomizations
apply the published manifest revision (post commit-rendered-manifests SHA)
AND any migrated HelmReleases reach Ready=UpgradeSucceeded. Without this,
ensure-foundry-agents and validate-agc-readiness observed the previous main
revision while preview manifests sat unapplied due to dependency cascades
(holiday-peak-gitops-holiday-peak-agents waits on holiday-peak-gitops-holiday-peak-crud).

Changes:
- commit-rendered-manifests: expose published_sha output (post-commit HEAD SHA)
- wait-flux-reconciliation:
  * Force-reconcile GitRepository source until artifact contains published_sha
  * Active poll for both kustomizations to lastAppliedRevision==published_sha
    AND Ready=True (with reconcile triggered in dependency order: crud first,
    agents after)
  * For each migrated HelmRelease (changed agent service), force reconcile
    the HR and wait for Ready=True with InstallSucceeded/UpgradeSucceeded/
    ReconciliationSucceeded/TestSucceeded reason

restore-flux-source-default-branch already runs after gates via needs:, so
no reorder is required there.

Issue: #897
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

github-actions Bot and others added 2 commits May 7, 2026 21:28
…s secret-masked

GitHub Actions secret-masks AGC_SUBNET_ID outputs that contain a subscription GUID matching a configured secret. The render publication-context steps then fall through to live-cluster recovery, which fails when the ApplicationLoadBalancer was previously pruned, producing a CRUD manifest without ALB+Gateway. Flux applies the truncated manifest, prunes them again, and validate-agc-readiness fails. Fix: add an Azure CLI fallback that resolves the AGC delegated subnet directly via 'az network vnet subnet show' so the render context is authoritative even when both the masked output and the live cluster recovery are empty. Hard-fail when AGC_SUBNET_ID still cannot be resolved instead of silently rendering an incomplete manifest.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Agent evaluation advisory gate

Pilot agent evaluations completed in advisory mode. Download the eval-* artifacts from this workflow run for detailed metrics.

This gate is intentionally non-blocking while baselines are calibrated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P0] evaluation: Pydantic models — EvalConfig, EvalCase, EvalBaseline, DriftReport

1 participant