Index of operational playbooks for the Holiday Peak Hub accelerator.
- Incident response baseline — Severity model, ownership, escalation, and communication flow
| Playbook | Area | Trigger |
|---|---|---|
| Agent latency spikes | Agents | P95 latency breach or timeouts |
| Tool call failures | Agents/Tools | Tool error rate > threshold |
| Model degradation | Agents/Models | Quality regression or increased hallucinations |
| Adapter failure | Adapters | Upstream API errors/outage |
| Adapter latency spikes | Adapters | P95 adapter latency breach |
| Adapter schema changes | Adapters | Contract mismatch or parsing errors |
| Redis OOM | Memory/Hot | Evictions or OOM errors |
| Cosmos high RU consumption | Memory/Warm | 429s or RU saturation |
| Blob throttling | Memory/Cold | 503s or slow downloads/uploads |
| Connection pool exhaustion | Memory | Connection timeouts/too many connections |
| TTL not expiring | Memory | Stale data or unbounded growth |
| UI proxy failures | Frontend/Platform | Sustained /api/* 502 bursts or fallback anomalies |
| Observability query templates | Platform/Observability | Correlated triage across APIM, AKS, data, and agent services |
Every operational playbook should include the following sections:
- Scope
- Trigger conditions and detection metrics
- Triage sequence
- Mitigation steps
- Prevention actions
- Escalation path and ownership
- Implementation snippets (when applicable)
This checklist aligns with governance requirements in docs/governance/README.md and infrastructure/back-end governance policies.