You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Standardize SLO metrics to p90 and add minimum QPS constraint
Changed all E2E latency SLO targets from p95 to p90 for consistency
across the system. All SLO targets now use p90 percentile (TTFT p90,
TPOT p90, E2E p90).
Key changes:
- Updated SLO templates with calculated e2e_p90 values using formula:
ttft_p90 + (generation_tokens_mean - 1) * tpot_p90
- Modified schemas to use e2e_p90_target_ms and predicted_e2e_p90_ms
- Updated capacity planner to calculate p90 E2E latency (removed 1.2x
p95 buffer)
- Changed all UI labels and metrics from "E2E p95" to "E2E p90"
- Updated documentation, test files, and data files for consistency
- Added minimum QPS of 0.1 in traffic profile generation to handle
small workloads (prevents errors with <100 user scenarios)
This change makes SLO percentiles consistent across all metrics and
provides more realistic E2E latency targets based on actual token
generation requirements.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andre Fredette <afredette@redhat.com>
- Rationale: E2E latency varies by workload (generation length, streaming mode, use case), so it's calculated per-request rather than stored as a fixed benchmark value
0 commit comments