UI Proxy Failures Playbook

Scope

Operational response for UI proxy failures in apps/ui/app/api/[...path]/route.ts where users see 502 bursts or degraded fallback behavior.

This playbook preserves existing payload contracts and APIM-first routing policy semantics.

Trigger conditions and detection metrics

Primary triggers:

Sustained 502 rate on critical UI endpoints exceeds threshold.
Spike in proxy failure events grouped by failureKind.

Required telemetry dimensions for triage:

failureKind
upstreamPath
sourceKey
status
fallbackUsed

Alert thresholds:

Warn: 502 rate >= 3% over 10 minutes, request count >= 30.
Critical: 502 rate >= 8% over 5 minutes, request count >= 50.

Triage sequence

Confirm incident scope.
- Identify impacted endpoint groups from upstreamPath.
- Validate whether failures are isolated to critical endpoints from issue #546 contract set.
Segment by failure class.
- Group by failureKind and rank by event volume.
- Check whether fallbackUsed=true is absorbing user impact.
Run class-specific checks.

Class A: config failures

Signals:

failureKind=config
sourceKey missing or unset
Immediate 502 without upstream call evidence

Checks:

Validate UI runtime env aliases are present and non-empty.
Confirm deployment slot/environment contains expected APIM base URL values.

Class B: policy failures

Signals:

failureKind=policy
Base URL present but rejected by proxy policy

Checks:

Verify configured proxy target matches APIM host policy (*.azure-api.net) unless explicit local-dev override is enabled.
Confirm no recent config drift introduced non-APIM endpoints.

Class C: network failures

Signals:

failureKind=network
Proxy fetch exceptions, no upstream status code
Catalog reads (/api/products, /api/categories) may enter this class after a 10 second per-attempt proxy timeout aborts a stalled upstream call before falling back to 200 with [].

Checks:

Validate DNS resolution and outbound connectivity from UI runtime to APIM endpoint.
Check firewall/NSG/egress policy changes in preceding deployment window.

Class D: upstream failures

Signals:

failureKind=upstream
Upstream status present and often 502

Checks:

Inspect APIM backend health and recent policy changes.
Correlate with AGC/AKS service health and backend dependency failures.

Mitigation steps

Config/policy class:
- Restore APIM gateway URL aliases in UI environment configuration.
- Redeploy UI runtime to apply corrected variables.
Network class:
- Roll back or fix blocking network policy.
- Validate APIM reachability from UI runtime after change.
Upstream class:
- Engage owning API/agent service team.
- Execute backend recovery runbooks and APIM smoke validation.
During mitigation:
- Monitor fallbackUsed volume to ensure degraded mode remains explicit and temporary.

Prevention actions

Add non-prod synthetic probes for critical /api/* endpoints with per-endpoint 502 ratio tracking.
Gate releases on proxy policy conformance and environment alias completeness checks.
Track weekly trend of failureKind distribution and fallback usage to identify recurring drift.

Escalation path and ownership

Primary owner: Platform Engineering (incident coordination).
Secondary owner: Frontend platform (UI runtime configuration).
Upstream owner: API/agent domain team when failureKind=upstream dominates.

Escalate to Sev2 when:

Critical threshold is breached for >15 minutes, or
More than two critical endpoint groups are impacted simultaneously.

Implementation snippets (KQL templates)

Use these as templates in Application Insights / Log Analytics, adapting table names to deployed schema.

let lookback = 15m;
requests
| where timestamp > ago(lookback)
| where url has "/api/"
| extend status = tostring(resultCode)
| extend failureKind = tostring(customDimensions.failureKind)
| extend upstreamPath = tostring(customDimensions.upstreamPath)
| extend sourceKey = tostring(customDimensions.sourceKey)
| extend fallbackUsed = tostring(customDimensions.fallbackUsed)
| summarize total=count(), failures=countif(status == "502") by upstreamPath, failureKind, sourceKey, fallbackUsed
| extend failureRate = iff(total == 0, 0.0, todouble(failures) / todouble(total))
| order by failures desc

let lookback = 30m;
requests
| where timestamp > ago(lookback)
| where url has "/api/"
| extend upstreamPath = tostring(customDimensions.upstreamPath)
| extend fallbackUsed = tostring(customDimensions.fallbackUsed)
| summarize requests=count(), fallbackCount=countif(fallbackUsed == "true") by upstreamPath, bin(timestamp, 5m)
| order by timestamp desc

Non-prod verification plan

Run the following controlled tests in dev before promoting alert rules:

Config failure injection.
- Remove proxy env aliases and call a critical endpoint.
- Verify failureKind=config and status=502 telemetry.
Policy failure injection.
- Point proxy target to non-APIM URL with policy override disabled.
- Verify failureKind=policy and threshold behavior under repeated traffic.
Network failure injection.
- Apply temporary egress deny from UI runtime to APIM or force a stalled upstream response beyond the 10 second catalog-read timeout.
- Verify failureKind=network and critical alert trigger timing.
Upstream failure injection.
- Force APIM/backend to return 502 on one critical endpoint.
- Verify failureKind=upstream, upstreamStatus=502, and escalation routing.
Fallback verification.
- Force fallback-enabled route upstream 404/5xx.
- Verify success response with fallback marker and no false 502 paging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UI Proxy Failures Playbook

Scope

Trigger conditions and detection metrics

Triage sequence

Class A: config failures

Class B: policy failures

Class C: network failures

Class D: upstream failures

Mitigation steps

Prevention actions

Escalation path and ownership

Implementation snippets (KQL templates)

Non-prod verification plan

FilesExpand file tree

playbook-ui-proxy-failures.md

Latest commit

History

playbook-ui-proxy-failures.md

File metadata and controls

UI Proxy Failures Playbook

Scope

Trigger conditions and detection metrics

Triage sequence

Class A: config failures

Class B: policy failures

Class C: network failures

Class D: upstream failures

Mitigation steps

Prevention actions

Escalation path and ownership

Implementation snippets (KQL templates)

Non-prod verification plan