|
| 1 | +# Operational Guide |
| 2 | + |
| 3 | +This guide summarizes the day-2 operational practices for running the KLL sketch in production services and batch pipelines. |
| 4 | + |
| 5 | +## Monitoring & Observability |
| 6 | + |
| 7 | +### Key service level indicators |
| 8 | +- **Ingestion throughput**: items/second (overall and per tenant). Track moving averages and high-water marks to detect ingestion stalls. |
| 9 | +- **Sketch merge latency**: wall-clock latency and failure rate for merge jobs. Watch for sustained increases (>2× baseline) which indicate undersized capacity. |
| 10 | +- **Quantile query latency**: P50/P95/P99 latency per query type (`quantile`, `quantiles`, `rank`). |
| 11 | +- **Serialized blob size**: mean and max payload size emitted by `to_bytes()`. Sudden jumps usually mean skewed data or capacity misconfiguration. |
| 12 | +- **Compaction counters**: number of compactions per level. Rising compaction frequency implies either a bursty workload or an undersized `capacity`. |
| 13 | + |
| 14 | +### Recommended instrumentation |
| 15 | +- Wrap ingestion entry points (e.g., `add`, `extend`) and query methods with metrics (`Counter`/`Histogram`). Expose via Prometheus or your platform’s native telemetry. |
| 16 | +- Log sketch metadata at debug level: `capacity`, `size`, level configuration, deterministic seed. Mask PII, logging only aggregates. |
| 17 | +- Emit structured events when merges occur, including source shard identifiers and duration. |
| 18 | +- Sample serialized blobs and validate round-trips in background tasks to catch corruption early. |
| 19 | + |
| 20 | +### Alerting policies |
| 21 | +- **Ingestion stalled**: alert when ingestion throughput drops to zero for more than 1 minute while upstream traffic persists. |
| 22 | +- **Merge backlog**: alert when merge queue length exceeds 2× normal baseline for 5 consecutive intervals. |
| 23 | +- **Query SLO breach**: alert when P95 query latency exceeds the agreed SLO (e.g., 50 ms) for 3 consecutive intervals. |
| 24 | +- **Serialization errors**: alert on any deserialization failures or checksum mismatches. |
| 25 | + |
| 26 | +### Dashboards & diagnostics |
| 27 | +- Chart ingestion throughput, query latency, and serialized blob sizes on a single operational dashboard. |
| 28 | +- Maintain a table of per-level buffer sizes and compaction counts to aid debugging. |
| 29 | +- Surface release version, git SHA, and configuration flags in dashboard annotations. |
| 30 | + |
| 31 | +## Capacity & Configuration Management |
| 32 | +- Size `capacity` based on required rank error ε using `ε ≈ 1 / capacity`. Double the capacity if you observe compaction hot spots or serialized blobs exceeding transport limits. |
| 33 | +- For workloads with heavy merges, align shard capacities; a single small shard can dominate error. |
| 34 | +- Document default seeds and level configuration in configuration management. Keep environment-specific overrides in version control. |
| 35 | + |
| 36 | +## Upgrade Playbook |
| 37 | + |
| 38 | +1. **Review release notes** |
| 39 | + - Read `CHANGELOG.md` for breaking changes, new features, and migration steps. |
| 40 | + - Check dependency bumps, especially the minimum supported Python version and `setuptools` constraints. |
| 41 | + |
| 42 | +2. **Stage the upgrade** |
| 43 | + - Pin the target version in your dependency management tool and deploy to a staging environment. |
| 44 | + - Run the full pytest suite plus representative workload benchmarks (`benchmarks/bench_kll.py`) on staging data. |
| 45 | + - Validate that serialized blobs created by the previous version deserialize correctly with the new release (forwards compatibility) and vice versa (backwards compatibility when rolling back). |
| 46 | + |
| 47 | +3. **Production rollout** |
| 48 | + - Perform a canary deployment (5–10% traffic) and monitor ingestion throughput, query latency, and error rates for at least one compaction window. |
| 49 | + - If metrics remain within SLOs, proceed with progressive rollout to all shards or services. |
| 50 | + |
| 51 | +4. **Rollback procedure** |
| 52 | + - Maintain the previous version pinned and ready for redeploy. |
| 53 | + - Because serialization is version-stable, downgrades are safe provided no breaking schema change is noted in the changelog. Always confirm with staged rollback tests. |
| 54 | + - After rollback, clear metrics annotations and document the incident. |
| 55 | + |
| 56 | +5. **Post-upgrade validation** |
| 57 | + - Confirm dashboards show the new version identifiers. |
| 58 | + - Update operational runbooks with any new configuration flags or behaviours introduced in the release. |
| 59 | + |
| 60 | +## Incident Response Checklist |
| 61 | +- Capture failing serialized blobs and store them with timestamps and shard identifiers. |
| 62 | +- Dump per-level buffer states via `KLL.debug_state()` (if enabled) or the equivalent introspection helper for forensic analysis. |
| 63 | +- Reconstruct workloads that triggered failures using recorded input batches and replay them in a sandbox before patching production. |
| 64 | + |
| 65 | +## Documentation & Runbook Hygiene |
| 66 | +- Store this guide alongside other operational runbooks in your organization’s knowledge base. |
| 67 | +- Schedule quarterly reviews to update thresholds, metrics, and playbooks in line with observed production behaviour. |
| 68 | +- When onboarding new services, link this document from their service runbooks to ensure consistent operational standards. |
| 69 | + |
0 commit comments