Agents built on BaseRetailAgent and Foundry integration.
Latency spikes usually come from one of four sources: model inference, tool calls, memory tier access, or adapter I/O. This playbook guides you to isolate the slow span and implement guardrails that keep p95/p99 within SLA.
- P95 or P99 latency exceeds SLA
- Spike in request timeouts
- Check model response time (SLM/LLM split).
- Check tool-call latency and error rate.
- Check memory tier latency (hot/warm/cold).
- Check downstream adapter latency.
- Model selection skewed to LLM unexpectedly
- Tool calls running sequentially vs parallel
- Redis/Cosmos/Blob latency increases
- Rate limits or throttling from upstream systems
- Temporarily raise
complexity_thresholdto favor SLM. - Disable or defer non-critical tools for hot paths.
- Increase timeouts slightly only if upstream is healthy.
- Reduce payload size and prompt length.
- Add latency budgets per tool.
- Configure circuit breakers for slow tools.
- Add tracing to identify hot spans.
- Add per-tool latency measurement and budgets.
- Parallelize independent tool calls.
- Route to SLM for low-complexity queries.
- Add timeouts around memory tier calls.
import asyncio
import time
async def call_tool_with_budget(tool, payload, budget_ms: int):
start = time.time()
result = await tool(payload)
duration_ms = (time.time() - start) * 1000
if duration_ms > budget_ms:
logger.warning("tool budget exceeded", extra={"tool": tool.__name__, "ms": duration_ms})
return resultasync def run_tools_parallel(tool_calls):
return await asyncio.gather(*[tc() for tc in tool_calls])def should_use_slm(request: dict, threshold: float) -> bool:
complexity = request.get("complexity_score", 0.0)
return complexity < thresholdflowchart TD
A[Latency spike detected] --> B{Where is time spent?}
B -->|Model| C[Check SLM/LLM latency]
B -->|Tools| D[Check tool call p95]
B -->|Memory| E[Check hot/warm/cold latency]
B -->|Adapters| F[Check external I/O]
C --> G[Mitigate: route to SLM or rollback model]
D --> H[Mitigate: parallelize, add budgets]
E --> I[Mitigate: cache, reduce tier calls]
F --> J[Mitigate: retries, pooling, rate limits]
If latency persists > 30 minutes, open incident with model provider and platform owner.