Performance SLA, SLO, and KPI Framework

Category: Cross-Cutting Concern — Applies to all six OpenEAGO specification phases

Overview

Performance accountability is a first-class specification goal in OpenEAGO. Regulated enterprise environments require measurable, contractually-enforceable service commitments, not aspirational guidelines. This document defines:

The normative distinction between SLA and SLO as used throughout the specification
A formal SLO objective taxonomy covering the four dimensions that MUST be specified for every execution plan
The canonical sla_guarantees schema — the single normative definition that unifies all per-phase references
A normative SLA breach state machine governing the lifecycle of a breach from detection to resolution
The specification-level KPI catalog — what the specification itself MUST track and emit, grouped by category
The agent registry minimum performance bar that every registered agent MUST meet

This document is the canonical cross-phase reference for performance. For the risk scoring model that incorporates SLA breach probability into Operational Risk, see Risk Management Framework.

SLA vs. SLO — Normative Definitions

OpenEAGO uses these terms with precise, distinct meanings throughout all specification documents and schemas. Implementations MUST NOT use them interchangeably.

Term	OpenEAGO Definition
SLA (Service Level Agreement)	A contractual commitment made between an agent provider and its consumer, agreed during Phase 2 (Planning & Negotiation). Breaching an SLA is a reportable event requiring escalation per the breach state machine.
SLO (Service Level Objective)	An internal specification target for a specific measurable property at a defined percentile. SLOs are the mechanism by which SLA commitments are operationalized. An SLA is only met when all constituent SLOs are met.
SLI (Service Level Indicator)	The raw measured value of a specific property (e.g., observed p99 latency in ms). SLIs are compared against SLO targets at runtime to determine SLO status.

Relationship: SLA ⊇ one or more SLOs. Each SLO is backed by one SLI. An SLA is met only when all of its constituent SLOs are met.

SLO Objective Taxonomy

Every sla_guarantees object MUST specify all four SLO objective types. These are the four dimensions that Phase 2 negotiation MUST assess for feasibility, and that Phase 4 execution MUST monitor in real time.

1. Latency SLO

Measures the elapsed time from request submission to the delivery of a valid response at a given percentile.

Property	Definition
`latency_p50_ms`	Median (50th percentile) response time, in milliseconds
`latency_p95_ms`	95th percentile response time, in milliseconds
`latency_p99_ms`	99th percentile response time, in milliseconds — this is the primary SLA gate

Phase-specific normative targets (default; MAY be overridden per contract):

Phase	`latency_p99_ms` Target
Phase 1 — Contract Management	≤ 200 ms
Phase 2 — Planning & Negotiation	≤ 1,000 ms
Phase 3 — Validation & Compliance (automated path)	≤ 500 ms
Phase 4 — Execution & Resilience (per-task latency)	≤ 5,000 ms (task-level); ≤ 30,000 ms (workflow)
Phase 5 — Context & State Management	≤ 100 ms
Phase 6 — Communication & Delivery	≤ 300 ms

2. Availability SLO

Measures the fraction of time in a rolling window during which an agent is reachable and providing valid responses.

Property	Definition
`availability_pct`	Required availability as a decimal fraction, e.g., `0.9900` = 99.00%
`measurement_window`	Rolling window over which availability is measured, e.g., `"30d"`

Normative minimum: availability_pct ≥ 0.9900 (99.00%) for all agents registered in the Agent Registry. This maps to a maximum allowed downtime of approximately 7.3 hours per 30-day period.

3. Throughput SLO

Measures the sustained rate of requests an agent MUST be capable of handling.

Property	Definition
`throughput_rps`	Minimum guaranteed requests per second under normal load conditions
`burst_rps`	Maximum requests per second the agent MUST handle without degrading below Latency SLO

4. Error Rate SLO

Measures the fraction of requests that result in an error response (5xx equivalent or specification-level failure).

Property	Definition
`error_rate_max`	Maximum tolerable fraction of requests that may result in an error, e.g., `0.05` = 5.00% max error rate

Normative maximum: error_rate_max ≤ 0.05 (5.00%) for all registered agents. An error_rate_max = 0.01 (1.00%) is RECOMMENDED for agents handling regulated or high-impact workflows.

Canonical `sla_guarantees` Schema

This is the single normative definition of sla_guarantees used across all specification phases. All per-phase references (planning.md, execution.md, spec.json, and the schemas) MUST conform to this structure.

{
  "sla_guarantees": {
    "sla_id": "SLA_A7B8C9",
    "sla_version": "1.0",
    "agreed_at": "2026-02-27T10:00:00Z",
    "valid_until": "2027-02-27T10:00:00Z",
    "provider_agent_id": "spiffe://example.org/agent/pii-validator",
    "consumer_contract_id": "CONTRACT_E7D3A1",
    "latency": {
      "p50_ms": 150,
      "p95_ms": 400,
      "p99_ms": 800
    },
    "availability": {
      "availability_pct": 0.9950,
      "measurement_window": "30d"
    },
    "throughput": {
      "throughput_rps": 50,
      "burst_rps": 200
    },
    "error_rate": {
      "error_rate_max": 0.02
    },
    "breach_policy": {
      "at_risk_threshold_pct": 0.90,
      "breach_response": "pause_and_review",
      "escalation_contact": "sre-oncall@example.org"
    }
  }
}

Required Fields

All fields in latency, availability, throughput, and error_rate are REQUIRED. breach_policy is REQUIRED when sla_guarantees is part of a regulated-profile execution plan.

SLA Breach State Machine

The SLA breach state machine governs the lifecycle of an SLA from healthy operation through potential breach and resolution. All Phase 4 (Execution & Resilience) implementations MUST implement this state machine.

stateDiagram-v2
    [*] --> Active : SLA agreed in Phase 2; execution begins

    Active --> AtRisk : SLI approaches threshold\n(SLI > at_risk_threshold_pct × SLO target)
    AtRisk --> Active : SLI recovers within window
    AtRisk --> Breached : SLI crosses SLO target

    Active --> Breached : SLI crosses SLO target directly (step change)
    Breached --> PauseAndReview : breach_response = "pause_and_review"
    Breached --> FallbackActivated : breach_response = "activate_fallback"
    Breached --> Escalated : breach_response = "escalate"

    PauseAndReview --> Active : Human clears; execution resumes with modified SLA
    PauseAndReview --> Terminated : Human rejects; execution terminated
    FallbackActivated --> Active : Fallback agent meets SLO; monitoring continues
    FallbackActivated --> Escalated : Fallback also breaches SLO
    Escalated --> Terminated : Risk management review; execution aborted
    Escalated --> Active : Emergency override approved; SLA renegotiated

    Active --> Completed : Execution completes; all SLOs met
    Terminated --> [*] : Breach recorded; risk event emitted
    Completed --> [*] : SLA compliance = "met" recorded in Phase 6 output

Breach State Definitions

State	Definition	Required Action
`active`	All SLIs are within their SLO targets	Continue execution; standard monitoring
`at_risk`	One or more SLIs have exceeded `at_risk_threshold_pct` of their SLO target	Increase monitoring frequency; emit `sla_at_risk` event
`breached`	One or more SLIs have crossed their SLO target	Emit `sla_breach_event`; execute `breach_response`
`pause_and_review`	Execution paused; awaiting human decision	Freeze cost-generating tasks; notify escalation contact
`fallback_activated`	Fallback agent substituted for failing agent	Emit `fallback_activation_event`; re-evaluate SLO feasibility
`escalated`	Breach cannot be resolved automatically	Emit high-priority alert; trigger risk escalation process
`completed`	Execution finished with all SLOs met throughout	Record `sla_compliance_status = "met"`
`terminated`	Execution aborted due to unresolvable breach	Record `sla_compliance_status = "breached"`; include in risk report

SLA Breach Event Payload

When the breach state machine transitions to breached or beyond, implementations MUST emit the following event to the audit trail:

{
  "event_type": "sla_breach_event",
  "event_timestamp": "2026-02-27T11:45:22.000Z",
  "execution_id": "EXEC_X5Y6Z7",
  "breach_state": "breached",
  "sla_id": "SLA_A7B8C9",
  "breached_slo": "latency_p99_ms",
  "slo_target": 800,
  "sli_observed": 1240,
  "breach_response": "activate_fallback",
  "risk_event_emitted": true,
  "risk_dimension": "operational_risk"
}

Specification-Level KPI Catalog

These are the KPIs the OpenEAGO specification itself MUST track and emit. They are specification-level observability requirements — distinct from business-level KPIs defined by individual implementations.

All implementations MUST expose these KPIs via the declared observability stack (OpenTelemetry + Prometheus per spec/v0.1.0/spec.json). KPI data MUST be available for query by authorized monitoring systems.

Reliability KPIs

KPI	Definition	Normative Target	Phase
`phase_success_rate`	Fraction of phase executions that complete without error, per phase	≥ 0.99	Each phase
`workflow_e2e_success_rate`	Fraction of six-phase workflows that complete successfully end-to-end	≥ 0.95	All phases
`agent_uptime`	Availability of each registered agent over a 30-day rolling window	≥ 0.9900	Phase 4
`circuit_breaker_trip_rate`	Fraction of executions that trip the Phase 4 circuit breaker	≤ 0.01 (1%)	Phase 4
`fallback_activation_rate`	Fraction of executions that activate a fallback agent	≤ 0.05 (5%)	Phase 4

Performance KPIs

KPI	Definition	Normative Target	Phase
`phase_latency_p99_ms`	99th percentile end-to-end latency for each phase	See Latency SLO table	Each phase
`phase_latency_p95_ms`	95th percentile end-to-end latency for each phase	80% of `p99` target	Each phase
`agent_queue_depth`	Current pending request queue depth per agent	≤ 100 (alert threshold)	Phase 4
`workflow_throughput_rps`	Completed workflows per second across the deployment	Deployment-specific; declare in conformance profile	All phases
`planning_agent_selection_time_ms`	Time for the Planning Agent to complete agent discovery and scoring	≤ 500 ms p99	Phase 2
`validation_latency_ms`	Time for the Validation Agent to compute composite risk score and decision	≤ 500 ms p99 (automated path)	Phase 3

Compliance KPIs

KPI	Definition	Normative Target	Phase
`policy_pass_rate`	Fraction of execution plans that pass all policy checks in Phase 3	Baseline; alert on ≥ 10% drop week-over-week	Phase 3
`hitl_intervention_rate`	Fraction of executions that trigger the HITL gate	Track; target determined by organizational risk appetite	Phase 3
`hitl_response_time_hours`	Time from HITL trigger to human decision	≤ 4 hours (SLA)	Phase 3
`risk_prediction_accuracy`	Fraction of risk tier assessments that match post-execution actual outcomes	≥ 0.85 (85%)	Phase 3, measured in arrears
`sla_compliance_rate`	Fraction of executions where all SLOs remain in `active` or `completed` state	≥ 0.95	Phase 4
`audit_completeness_rate`	Fraction of executions with fully populated `risk_context` in audit trail	1.00 (100% — mandatory)	Phase 5/6

Financial KPIs

KPI	Definition	Normative Target	Phase
`acu_budget_adherence_rate`	Fraction of executions where actual ACU consumption stays within approved budget	≥ 0.95	Phase 4
`cost_overrun_rate`	Fraction of executions where actual USD cost exceeds approved limit	≤ 0.05	Phase 4
`cost_per_successful_workflow`	Average USD cost of a successfully completed six-phase workflow	Deployment-specific baseline; alert on ≥ 20% week-over-week increase	All phases

Security KPIs

KPI	Definition	Normative Target	Phase
`auth_failure_rate`	Fraction of inter-agent communication attempts that fail authentication	≤ 0.001 (0.1%)	All phases
`certificate_rotation_compliance_rate`	Fraction of agent certificates rotated within their 48-hour TTL	1.00 (100% — mandatory)	Phase 1 setup
`anomaly_detection_rate`	Fraction of executions flagged by anomaly detection	Track; alert on sudden increase	Phase 4
`policy_override_rate`	Fraction of `critical`-tier rejections that were subsequently overridden	≤ 0.005 (0.5%); each override MUST carry board/legal approval reference	Phase 3

Agent Registry Minimum Performance Bar

Every agent registered in the OpenEAGO Agent Registry MUST meet the following minimum performance bar. The Planning Agent MUST NOT select an agent that fails any of these minimums for an execution plan, regardless of cost or capability fit score.

Property	Minimum Required Value	Eviction Rule
`reliability_score`	≥ 0.95	Agent removed from registry if rolling 7-day score drops below 0.90
`availability_pct` (30d)	≥ 0.9900 (99.00%)	Agent marked `degraded` if drops below 0.9900; removed if below 0.9500
`error_rate` (7d rolling)	≤ 0.05 (5.00%)	Agent marked `degraded` if exceeds 0.05; removed if exceeds 0.10
`latency_p99_ms` (7d rolling)	≤ declared `latency.p99_ms` × 1.20	Agent marked `degraded` if exceeds declared SLO; removed if exceeds 2× declared SLO
`compliance_certification_valid`	`true`	Agent suspended immediately if certification expires

Registry states: healthy → degraded → suspended → removed. Agents in degraded state MAY be selected for non-critical workflows only. Agents in suspended or removed state MUST NOT be selected.

SLA/SLO Feasibility Check in Phase 2

During Phase 2 (Planning & Negotiation), the Planning Agent MUST perform an explicit SLA/SLO feasibility check as a required negotiation sub-step before forwarding the execution plan to Phase 3.

The feasibility check MUST verify all four SLO objective types for every selected agent:

For each selected_agent in execution_plan.selected_agents:
  1. Retrieve agent.sla_guarantees from Agent Registry
  2. Check latency_p99_ms: agent.sla_guarantees.latency.p99_ms ≤ plan.latency_requirement.p99_ms
  3. Check availability_pct: agent.sla_guarantees.availability.availability_pct ≥ plan.availability_requirement
  4. Check throughput_rps: agent.sla_guarantees.throughput.throughput_rps ≥ plan.throughput_requirement
  5. Check error_rate_max: agent.sla_guarantees.error_rate.error_rate_max ≤ plan.error_rate_tolerance
  6. Derive sla_breach_probability = f(historical_sli_variance, agent_reliability_score)
  7. If sla_breach_probability > 0.20: mark agent as "sla_at_risk"; select fallback
  8. If no compliant agent available: return negotiation.status = "rejected" with reason "sla_slo_infeasible"

Negotiation checks MUST include "sla_slo" in the checks array. A negotiation result with status = "accepted" MUST mean that all four SLO objective checks passed for every selected agent.

Performance Metrics for This Framework

Metric	Definition	Target
SLO Feasibility Check Latency	Time to complete the Phase 2 SLA/SLO feasibility sub-step	≤ 100 ms p99
Breach Detection Latency	Time from SLO target breach to `sla_breach_event` emission	≤ 5 seconds
Breach Recovery Time (Fallback)	Time from fallback activation to `sla_state = active` restoration	≤ 30 seconds
KPI Export Latency	Maximum age of KPI data available to Prometheus scrape	≤ 60 seconds

Summary

Performance accountability in OpenEAGO is enforced through three concrete mechanisms:

SLA Negotiation (Phase 2): Every execution plan MUST pass a formal four-dimension SLO feasibility check. Plans where agent SLOs cannot be met are rejected at negotiation, not discovered at runtime.
SLA Monitoring (Phase 4): The breach state machine provides a normative lifecycle — active → at_risk → breached → [pause/fallback/escalate] — with mandatory event emission at each transition. SLA status is a first-class field in execution outputs.
KPI Catalog (All Phases): The specification mandates 20+ KPIs across reliability, performance, compliance, financial, and security dimensions, exposed via OpenTelemetry/Prometheus. These KPIs are specification-level requirements, not optional instrumentation.

For the risk model that incorporates SLA breach probability into the Operational Risk dimension, see Risk Management Framework.

For machine-readable schema definitions of sla_guarantees and sla_compliance_status, see spec/v0.1.0/schemas/planning-negotiation.schema.json and spec/v0.1.0/schemas/execution-resilience.schema.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance SLA, SLO, and KPI Framework

Overview

SLA vs. SLO — Normative Definitions

SLO Objective Taxonomy

1. Latency SLO

2. Availability SLO

3. Throughput SLO

4. Error Rate SLO

Canonical `sla_guarantees` Schema

Required Fields

SLA Breach State Machine

Breach State Definitions

SLA Breach Event Payload

Specification-Level KPI Catalog

Reliability KPIs

Performance KPIs

Compliance KPIs

Financial KPIs

Security KPIs

Agent Registry Minimum Performance Bar

SLA/SLO Feasibility Check in Phase 2

Performance Metrics for This Framework

Summary

FilesExpand file tree

performance-sla-slo-kpi.md

Latest commit

History

performance-sla-slo-kpi.md

File metadata and controls

Performance SLA, SLO, and KPI Framework

Overview

SLA vs. SLO — Normative Definitions

SLO Objective Taxonomy

1. Latency SLO

2. Availability SLO

3. Throughput SLO

4. Error Rate SLO

Canonical sla_guarantees Schema

Required Fields

SLA Breach State Machine

Breach State Definitions

SLA Breach Event Payload

Specification-Level KPI Catalog

Reliability KPIs

Performance KPIs

Compliance KPIs

Financial KPIs

Security KPIs

Agent Registry Minimum Performance Bar

SLA/SLO Feasibility Check in Phase 2

Performance Metrics for This Framework

Summary

Canonical `sla_guarantees` Schema