Skip to content

Commit 5cadc96

Browse files
Merge pull request #14 from SaridakisStamatisChristos/codex/evaluate-production-readiness-and-assign-grade
Add operational runbook documentation
2 parents 260e358 + e048ffa commit 5cadc96

File tree

2 files changed

+75
-0
lines changed

2 files changed

+75
-0
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,12 @@ Visualise the outputs via `benchmarks/bench_plots.ipynb`, and read [`docs/benchm
133133

134134
---
135135

136+
## 🛡️ Operations
137+
138+
For day-2 guidance—monitoring, alerting, capacity planning, and a step-by-step upgrade playbook—see the [Operational Guide](docs/operations.md).
139+
140+
---
141+
136142
## 🗺️ Roadmap
137143

138144
* Optional NumPy/C hot paths for sort/merge.

docs/operations.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Operational Guide
2+
3+
This guide summarizes the day-2 operational practices for running the KLL sketch in production services and batch pipelines.
4+
5+
## Monitoring & Observability
6+
7+
### Key service level indicators
8+
- **Ingestion throughput**: items/second (overall and per tenant). Track moving averages and high-water marks to detect ingestion stalls.
9+
- **Sketch merge latency**: wall-clock latency and failure rate for merge jobs. Watch for sustained increases (>2× baseline) which indicate undersized capacity.
10+
- **Quantile query latency**: P50/P95/P99 latency per query type (`quantile`, `quantiles`, `rank`).
11+
- **Serialized blob size**: mean and max payload size emitted by `to_bytes()`. Sudden jumps usually mean skewed data or capacity misconfiguration.
12+
- **Compaction counters**: number of compactions per level. Rising compaction frequency implies either a bursty workload or an undersized `capacity`.
13+
14+
### Recommended instrumentation
15+
- Wrap ingestion entry points (e.g., `add`, `extend`) and query methods with metrics (`Counter`/`Histogram`). Expose via Prometheus or your platform’s native telemetry.
16+
- Log sketch metadata at debug level: `capacity`, `size`, level configuration, deterministic seed. Mask PII, logging only aggregates.
17+
- Emit structured events when merges occur, including source shard identifiers and duration.
18+
- Sample serialized blobs and validate round-trips in background tasks to catch corruption early.
19+
20+
### Alerting policies
21+
- **Ingestion stalled**: alert when ingestion throughput drops to zero for more than 1 minute while upstream traffic persists.
22+
- **Merge backlog**: alert when merge queue length exceeds 2× normal baseline for 5 consecutive intervals.
23+
- **Query SLO breach**: alert when P95 query latency exceeds the agreed SLO (e.g., 50 ms) for 3 consecutive intervals.
24+
- **Serialization errors**: alert on any deserialization failures or checksum mismatches.
25+
26+
### Dashboards & diagnostics
27+
- Chart ingestion throughput, query latency, and serialized blob sizes on a single operational dashboard.
28+
- Maintain a table of per-level buffer sizes and compaction counts to aid debugging.
29+
- Surface release version, git SHA, and configuration flags in dashboard annotations.
30+
31+
## Capacity & Configuration Management
32+
- Size `capacity` based on required rank error ε using `ε ≈ 1 / capacity`. Double the capacity if you observe compaction hot spots or serialized blobs exceeding transport limits.
33+
- For workloads with heavy merges, align shard capacities; a single small shard can dominate error.
34+
- Document default seeds and level configuration in configuration management. Keep environment-specific overrides in version control.
35+
36+
## Upgrade Playbook
37+
38+
1. **Review release notes**
39+
- Read `CHANGELOG.md` for breaking changes, new features, and migration steps.
40+
- Check dependency bumps, especially the minimum supported Python version and `setuptools` constraints.
41+
42+
2. **Stage the upgrade**
43+
- Pin the target version in your dependency management tool and deploy to a staging environment.
44+
- Run the full pytest suite plus representative workload benchmarks (`benchmarks/bench_kll.py`) on staging data.
45+
- Validate that serialized blobs created by the previous version deserialize correctly with the new release (forwards compatibility) and vice versa (backwards compatibility when rolling back).
46+
47+
3. **Production rollout**
48+
- Perform a canary deployment (5–10% traffic) and monitor ingestion throughput, query latency, and error rates for at least one compaction window.
49+
- If metrics remain within SLOs, proceed with progressive rollout to all shards or services.
50+
51+
4. **Rollback procedure**
52+
- Maintain the previous version pinned and ready for redeploy.
53+
- Because serialization is version-stable, downgrades are safe provided no breaking schema change is noted in the changelog. Always confirm with staged rollback tests.
54+
- After rollback, clear metrics annotations and document the incident.
55+
56+
5. **Post-upgrade validation**
57+
- Confirm dashboards show the new version identifiers.
58+
- Update operational runbooks with any new configuration flags or behaviours introduced in the release.
59+
60+
## Incident Response Checklist
61+
- Capture failing serialized blobs and store them with timestamps and shard identifiers.
62+
- Dump per-level buffer states via `KLL.debug_state()` (if enabled) or the equivalent introspection helper for forensic analysis.
63+
- Reconstruct workloads that triggered failures using recorded input batches and replay them in a sandbox before patching production.
64+
65+
## Documentation & Runbook Hygiene
66+
- Store this guide alongside other operational runbooks in your organization’s knowledge base.
67+
- Schedule quarterly reviews to update thresholds, metrics, and playbooks in line with observed production behaviour.
68+
- When onboarding new services, link this document from their service runbooks to ensure consistent operational standards.
69+

0 commit comments

Comments
 (0)