Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Operational Playbooks

Index of operational playbooks for the Holiday Peak Hub accelerator.

Baseline

Playbook Area Trigger
Agent latency spikes Agents P95 latency breach or timeouts
Tool call failures Agents/Tools Tool error rate > threshold
Model degradation Agents/Models Quality regression or increased hallucinations
Adapter failure Adapters Upstream API errors/outage
Adapter latency spikes Adapters P95 adapter latency breach
Adapter schema changes Adapters Contract mismatch or parsing errors
Redis OOM Memory/Hot Evictions or OOM errors
Cosmos high RU consumption Memory/Warm 429s or RU saturation
Blob throttling Memory/Cold 503s or slow downloads/uploads
Connection pool exhaustion Memory Connection timeouts/too many connections
TTL not expiring Memory Stale data or unbounded growth
UI proxy failures Frontend/Platform Sustained /api/* 502 bursts or fallback anomalies
Observability query templates Platform/Observability Correlated triage across APIM, AKS, data, and agent services

Playbook Policy Checklist

Every operational playbook should include the following sections:

  1. Scope
  2. Trigger conditions and detection metrics
  3. Triage sequence
  4. Mitigation steps
  5. Prevention actions
  6. Escalation path and ownership
  7. Implementation snippets (when applicable)

This checklist aligns with governance requirements in docs/governance/README.md and infrastructure/back-end governance policies.