Operational response for UI proxy failures in apps/ui/app/api/[...path]/route.ts where users see 502 bursts or degraded fallback behavior.
This playbook preserves existing payload contracts and APIM-first routing policy semantics.
Primary triggers:
- Sustained
502rate on critical UI endpoints exceeds threshold. - Spike in proxy failure events grouped by
failureKind.
Required telemetry dimensions for triage:
failureKindupstreamPathsourceKeystatusfallbackUsed
Alert thresholds:
- Warn:
502rate >= 3% over 10 minutes, request count >= 30. - Critical:
502rate >= 8% over 5 minutes, request count >= 50.
- Confirm incident scope.
- Identify impacted endpoint groups from
upstreamPath. - Validate whether failures are isolated to critical endpoints from issue #546 contract set.
- Identify impacted endpoint groups from
- Segment by failure class.
- Group by
failureKindand rank by event volume. - Check whether
fallbackUsed=trueis absorbing user impact.
- Group by
- Run class-specific checks.
Signals:
failureKind=configsourceKeymissing or unset- Immediate
502without upstream call evidence
Checks:
- Validate UI runtime env aliases are present and non-empty.
- Confirm deployment slot/environment contains expected APIM base URL values.
Signals:
failureKind=policy- Base URL present but rejected by proxy policy
Checks:
- Verify configured proxy target matches APIM host policy (
*.azure-api.net) unless explicit local-dev override is enabled. - Confirm no recent config drift introduced non-APIM endpoints.
Signals:
failureKind=network- Proxy fetch exceptions, no upstream status code
- Catalog reads (
/api/products,/api/categories) may enter this class after a 10 second per-attempt proxy timeout aborts a stalled upstream call before falling back to200with[].
Checks:
- Validate DNS resolution and outbound connectivity from UI runtime to APIM endpoint.
- Check firewall/NSG/egress policy changes in preceding deployment window.
Signals:
failureKind=upstream- Upstream status present and often
502
Checks:
- Inspect APIM backend health and recent policy changes.
- Correlate with AGC/AKS service health and backend dependency failures.
- Config/policy class:
- Restore APIM gateway URL aliases in UI environment configuration.
- Redeploy UI runtime to apply corrected variables.
- Network class:
- Roll back or fix blocking network policy.
- Validate APIM reachability from UI runtime after change.
- Upstream class:
- Engage owning API/agent service team.
- Execute backend recovery runbooks and APIM smoke validation.
- During mitigation:
- Monitor
fallbackUsedvolume to ensure degraded mode remains explicit and temporary.
- Monitor
- Add non-prod synthetic probes for critical
/api/*endpoints with per-endpoint502ratio tracking. - Gate releases on proxy policy conformance and environment alias completeness checks.
- Track weekly trend of
failureKinddistribution and fallback usage to identify recurring drift.
- Primary owner: Platform Engineering (incident coordination).
- Secondary owner: Frontend platform (UI runtime configuration).
- Upstream owner: API/agent domain team when
failureKind=upstreamdominates.
Escalate to Sev2 when:
- Critical threshold is breached for >15 minutes, or
- More than two critical endpoint groups are impacted simultaneously.
Use these as templates in Application Insights / Log Analytics, adapting table names to deployed schema.
let lookback = 15m;
requests
| where timestamp > ago(lookback)
| where url has "/api/"
| extend status = tostring(resultCode)
| extend failureKind = tostring(customDimensions.failureKind)
| extend upstreamPath = tostring(customDimensions.upstreamPath)
| extend sourceKey = tostring(customDimensions.sourceKey)
| extend fallbackUsed = tostring(customDimensions.fallbackUsed)
| summarize total=count(), failures=countif(status == "502") by upstreamPath, failureKind, sourceKey, fallbackUsed
| extend failureRate = iff(total == 0, 0.0, todouble(failures) / todouble(total))
| order by failures desclet lookback = 30m;
requests
| where timestamp > ago(lookback)
| where url has "/api/"
| extend upstreamPath = tostring(customDimensions.upstreamPath)
| extend fallbackUsed = tostring(customDimensions.fallbackUsed)
| summarize requests=count(), fallbackCount=countif(fallbackUsed == "true") by upstreamPath, bin(timestamp, 5m)
| order by timestamp descRun the following controlled tests in dev before promoting alert rules:
- Config failure injection.
- Remove proxy env aliases and call a critical endpoint.
- Verify
failureKind=configandstatus=502telemetry.
- Policy failure injection.
- Point proxy target to non-APIM URL with policy override disabled.
- Verify
failureKind=policyand threshold behavior under repeated traffic.
- Network failure injection.
- Apply temporary egress deny from UI runtime to APIM or force a stalled upstream response beyond the 10 second catalog-read timeout.
- Verify
failureKind=networkand critical alert trigger timing.
- Upstream failure injection.
- Force APIM/backend to return
502on one critical endpoint. - Verify
failureKind=upstream,upstreamStatus=502, and escalation routing.
- Force APIM/backend to return
- Fallback verification.
- Force fallback-enabled route upstream
404/5xx. - Verify success response with fallback marker and no false
502paging.
- Force fallback-enabled route upstream