Skip to content

Commit 3e7cadf

Browse files
committed
refactor: update metrics tracking and dashboard queries for E2E latency
1 parent 27a129d commit 3e7cadf

6 files changed

Lines changed: 203 additions & 93 deletions

File tree

METRICS.md

Lines changed: 20 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -13,24 +13,25 @@ SpiceBench (OTel instruments)
1313

1414
## Metric Checklist
1515

16-
| # | Metric | OTel Instrument | Source | Emitted to telemetry | Status |
17-
| --- | ------------------------------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | ------------------------ | --------------- |
18-
| 1 | **Data Size** (total bytes ingested) | `ingestion_bytes_total` (Gauge\<u64\>) | SUT adapter `metrics``ingestion.bytes_ingested` | ✅ via `Telemetry.emit()` | ✅ Implemented |
19-
| 2 | **Ingestion records/s** | `ingestion_rows_per_sec` (Gauge\<f64\>) | SUT adapter `metrics``ingestion.rows_per_sec` | ✅ via `Telemetry.emit()` | ✅ Implemented |
20-
| 3 | **Ingestion rows total** | `ingestion_rows_total` (Gauge\<u64\>) | SUT adapter `metrics``ingestion.rows_ingested` | ✅ via `Telemetry.emit()` | ✅ Implemented |
21-
| 4 | **Connections / Clients** | `active_connections` (Gauge\<u64\>) | CLI `--concurrency` + SUT adapter `metrics``ingestion.active_connections` | ✅ via `Telemetry.emit()` | ✅ Implemented |
22-
| 5 | **Queries/s, Requests/s** | `queries_per_sec` (Gauge\<f64\>), `queries_total` (Counter\<u64\>) | Computed from total iterations / test duration | ✅ via `Telemetry.emit()` | ✅ Implemented |
23-
| 6 | **Query Latency (p50)** | `median_duration_ms` (Gauge\<u64\>) | Query driver per-query statistics | ✅ via `Telemetry.emit()` | ✅ Implemented |
24-
| 7 | **Query Latency (p99)** | `p99_duration_ms` (Gauge\<u64\>) | Query driver per-query statistics | ✅ via `Telemetry.emit()` | ✅ Implemented |
25-
| 8 | **Efficiency (cores)** | `efficiency_queries_per_core` (Gauge\<f64\>) | Computed: `queries_per_sec / cpu_cores` | ✅ via `Telemetry.emit()` | ✅ Implemented |
26-
| 9 | **Resource Usage – CPU** | `sut_cpu_usage_percent` (Gauge\<f64\>) | SUT adapter `metrics``resource.cpu_usage_percent` | ✅ via `Telemetry.emit()` | ✅ Implemented |
27-
| 10 | **Resource Usage – Memory** | `peak_memory_usage_mb` / `median_memory_usage_mb` (Gauge\<f64\>), `sut_memory_usage_bytes` (Gauge\<u64\>) | Local process via `sysinfo` + SUT adapter `metrics` | ✅ via `Telemetry.emit()` | ✅ Implemented |
28-
| 11 | **Resource Usage – Disk** | `sut_disk_read_bytes` / `sut_disk_write_bytes` (Gauge\<u64\>) | SUT adapter `metrics``resource.disk_read_bytes` / `disk_write_bytes` | ✅ via `Telemetry.emit()` | ✅ Implemented |
29-
| 12 | **Resource Usage – IOPS** | `sut_disk_read_iops` / `sut_disk_write_iops` (Gauge\<u64\>) | SUT adapter `metrics``resource.disk_read_iops` / `disk_write_iops` | ✅ via `Telemetry.emit()` | ✅ Implemented |
30-
| 13 | **E2E Latency** | `e2e_latency_ms` (Histogram\<f64\>) | **Instrument defined; not yet recorded** — requires timestamped events + query-back verification | ⚠️ Instrument only | 🔲 Not yet wired |
31-
| 14 | **E2E Duration** | `test_duration_ms` (Gauge\<u64\>) | Wall-clock time of benchmark phase | ✅ via `Telemetry.emit()` | ✅ Implemented |
32-
| 15 | **Query Queue Length** | `query_queue_length` (Gauge\<u64\>) | Query worker queue depth at query execution start (attributes: `query_name`, `client_id`) | ✅ via `Telemetry.emit()` | ✅ Implemented |
33-
| 16 | **Query Queue Duration** | `query_queue_duration_ms` (Histogram\<f64\>) | Query worker queue wait time before execution (attributes: `query_name`, `client_id`) | ✅ via `Telemetry.emit()` | ✅ Implemented |
16+
| # | Metric | OTel Instrument | Source | Emitted to telemetry | Status |
17+
| --- | ------------------------------------ | --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | ------------------------ | ------------- |
18+
| 1 | **Data Size** (total bytes ingested) | `ingestion_bytes_total` (Gauge\<u64\>) | SUT adapter `metrics``ingestion.bytes_ingested` | ✅ via `Telemetry.emit()` | ✅ Implemented |
19+
| 2 | **Ingestion records/s** | `ingestion_rows_per_sec` (Gauge\<f64\>) | SUT adapter `metrics``ingestion.rows_per_sec` | ✅ via `Telemetry.emit()` | ✅ Implemented |
20+
| 3 | **Ingestion rows total** | `ingestion_rows_total` (Gauge\<u64\>) | SUT adapter `metrics``ingestion.rows_ingested` | ✅ via `Telemetry.emit()` | ✅ Implemented |
21+
| 4 | **Connections / Clients** | `active_connections` (Gauge\<u64\>) | CLI `--concurrency` + SUT adapter `metrics``ingestion.active_connections` | ✅ via `Telemetry.emit()` | ✅ Implemented |
22+
| 5 | **Queries/s, Requests/s** | `queries_per_sec` (Gauge\<f64\>), `queries_total` (Counter\<u64\>) | Computed from total iterations / test duration | ✅ via `Telemetry.emit()` | ✅ Implemented |
23+
| 6 | **Query Latency (p50)** | `median_duration_ms` (Gauge\<u64\>) | Query driver per-query statistics | ✅ via `Telemetry.emit()` | ✅ Implemented |
24+
| 7 | **Query Latency (p99)** | `p99_duration_ms` (Gauge\<u64\>) | Query driver per-query statistics | ✅ via `Telemetry.emit()` | ✅ Implemented |
25+
| 8 | **Efficiency (cores)** | `efficiency_queries_per_core` (Gauge\<f64\>) | Computed: `queries_per_sec / cpu_cores` | ✅ via `Telemetry.emit()` | ✅ Implemented |
26+
| 9 | **Resource Usage – CPU** | `sut_cpu_usage_percent` (Gauge\<f64\>) | SUT adapter `metrics``resource.cpu_usage_percent` | ✅ via `Telemetry.emit()` | ✅ Implemented |
27+
| 10 | **Resource Usage – Memory** | `peak_memory_usage_mb` / `median_memory_usage_mb` (Gauge\<f64\>), `sut_memory_usage_bytes` (Gauge\<u64\>) | Local process via `sysinfo` + SUT adapter `metrics` | ✅ via `Telemetry.emit()` | ✅ Implemented |
28+
| 11 | **Resource Usage – Disk** | `sut_disk_read_bytes` / `sut_disk_write_bytes` (Gauge\<u64\>) | SUT adapter `metrics``resource.disk_read_bytes` / `disk_write_bytes` | ✅ via `Telemetry.emit()` | ✅ Implemented |
29+
| 12 | **Resource Usage – IOPS** | `sut_disk_read_iops` / `sut_disk_write_iops` (Gauge\<u64\>) | SUT adapter `metrics``resource.disk_read_iops` / `disk_write_iops` | ✅ via `Telemetry.emit()` | ✅ Implemented |
30+
| 13 | **E2E Latency** | `e2e_latency_ms` (Histogram\<f64\>) | Raw freshness scraper samples (`MAX(__created_at)` deltas); percentiles are computed in dashboard queries | ✅ via `Telemetry.emit()` | ✅ Implemented |
31+
| 14 | **E2E Duration** | `test_duration_ms` (Gauge\<u64\>) | Wall-clock time of benchmark phase | ✅ via `Telemetry.emit()` | ✅ Implemented |
32+
| 15 | **Query Queue Length** | `query_queue_length` (Gauge\<u64\>) | Query worker queue depth at query execution start (attributes: `query_name`, `client_id`) | ✅ via `Telemetry.emit()` | ✅ Implemented |
33+
| 16 | **Query Queue Duration** | `query_queue_duration_ms` (Histogram\<f64\>) | Query worker queue wait time before execution (attributes: `query_name`, `client_id`) | ✅ via `Telemetry.emit()` | ✅ Implemented |
34+
| 17 | **Checkpoint In-flight Queries** | `checkpoint_in_flight_queries` (Gauge\<u64\>) | Active in-flight query count while checkpoint validation windows are enabled (`client_id`) | ✅ via `Telemetry.emit()` | ✅ Implemented |
3435

3536
## Streaming Metrics (real-time, optional)
3637

@@ -89,7 +90,4 @@ The default `Handler::metrics()` implementation returns empty metrics, so existi
8990

9091
## Remaining Work
9192

92-
- [ ] **E2E Latency**: Implement event-creation-to-queryable latency measurement. This requires:
93-
1. Timestamping generated events at creation time
94-
2. Querying the SUT for those events after ingestion
95-
3. Recording the delta as `e2e_latency_ms` histogram observations
93+
- [ ] **E2E Latency dashboard expansion**: Add optional additional percentile panels (e.g., p50/p90/p99.9) computed from `e2e_latency_ms` in Flux.

README.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -160,27 +160,28 @@ Common CLI/workflow usage:
160160

161161
### Metrics
162162

163-
| Metric | OTel Instrument | Description | Status |
164-
| ----------------------- | ------------------------------------------------ | ----------------------------------------------------- | --------------- |
165-
| Iterations | `iterations` (Gauge) | Number of query iterations per query | ✅ Implemented |
166-
| Query Status | `query_status` (Gauge) | Pass/fail status per query | ✅ Implemented |
167-
| Query Latency (p50) | `median_duration_ms` (Gauge) | Median duration per query | ✅ Implemented |
168-
| Query Latency (min/max) | `min_duration_ms`, `max_duration_ms` | Min and max duration per query | ✅ Implemented |
169-
| Query Latency (p99) | `p99_duration_ms` (Gauge) | 99th percentile duration per query | ✅ Implemented |
170-
| Health Latency | `health_latency_ms` (Histogram) | Latency of `/health` and `/v1/ready` probes | ✅ Implemented |
171-
| E2E Duration | `test_duration_ms` (Gauge) | Total wall-clock time for the benchmark phase | ✅ Implemented |
172-
| Peak/Median Memory | `peak_memory_usage_mb`, `median_memory_usage_mb` | Memory usage of the spiced process | ✅ Implemented |
173-
| Ingestion Rows/Bytes | `ingestion_rows_total`, `ingestion_bytes_total` | Total data ingested (from SUT adapter) | ✅ Implemented |
174-
| Ingestion records/s | `ingestion_rows_per_sec` (Gauge) | Sustained ingestion throughput (from SUT adapter) | ✅ Implemented |
175-
| Queries/s | `queries_per_sec` (Gauge) | Query throughput under load | ✅ Implemented |
176-
| Total Queries | `queries_total` (Counter) | Total queries executed during the run | ✅ Implemented |
177-
| Active Connections | `active_connections` (Gauge) | Number of concurrent connections/clients | ✅ Implemented |
178-
| SUT CPU | `sut_cpu_usage_percent` (Gauge) | SUT CPU utilization (from adapter `metrics`) | ✅ Implemented |
179-
| SUT Memory | `sut_memory_usage_bytes` (Gauge) | SUT memory usage (from adapter `metrics`) | ✅ Implemented |
180-
| SUT Disk I/O | `sut_disk_{read,write}_bytes` (Gauge) | SUT disk read/write bytes (from adapter `metrics`) | ✅ Implemented |
181-
| SUT Disk IOPS | `sut_disk_{read,write}_iops` (Gauge) | SUT disk IOPS (from adapter `metrics`) | ✅ Implemented |
182-
| Efficiency | `efficiency_queries_per_core` (Gauge) | Query throughput normalized by CPU cores | ✅ Implemented |
183-
| E2E Latency | `e2e_latency_ms` (Histogram) | Time from event creation to the event being queryable | 🔲 Not yet wired |
163+
| Metric | OTel Instrument | Description | Status |
164+
| ----------------------- | ------------------------------------------------ | ------------------------------------------------------------------------------------- | ------------- |
165+
| Iterations | `iterations` (Gauge) | Number of query iterations per query | ✅ Implemented |
166+
| Query Status | `query_status` (Gauge) | Pass/fail status per query | ✅ Implemented |
167+
| Query Latency (p50) | `median_duration_ms` (Gauge) | Median duration per query | ✅ Implemented |
168+
| Query Latency (min/max) | `min_duration_ms`, `max_duration_ms` | Min and max duration per query | ✅ Implemented |
169+
| Query Latency (p99) | `p99_duration_ms` (Gauge) | 99th percentile duration per query | ✅ Implemented |
170+
| Health Latency | `health_latency_ms` (Histogram) | Latency of `/health` and `/v1/ready` probes | ✅ Implemented |
171+
| E2E Duration | `test_duration_ms` (Gauge) | Total wall-clock time for the benchmark phase | ✅ Implemented |
172+
| Peak/Median Memory | `peak_memory_usage_mb`, `median_memory_usage_mb` | Memory usage of the spiced process | ✅ Implemented |
173+
| Ingestion Rows/Bytes | `ingestion_rows_total`, `ingestion_bytes_total` | Total data ingested (from SUT adapter) | ✅ Implemented |
174+
| Ingestion records/s | `ingestion_rows_per_sec` (Gauge) | Sustained ingestion throughput (from SUT adapter) | ✅ Implemented |
175+
| Queries/s | `queries_per_sec` (Gauge) | Query throughput under load | ✅ Implemented |
176+
| Total Queries | `queries_total` (Counter) | Total queries executed during the run | ✅ Implemented |
177+
| Active Connections | `active_connections` (Gauge) | Number of concurrent connections/clients | ✅ Implemented |
178+
| SUT CPU | `sut_cpu_usage_percent` (Gauge) | SUT CPU utilization (from adapter `metrics`) | ✅ Implemented |
179+
| SUT Memory | `sut_memory_usage_bytes` (Gauge) | SUT memory usage (from adapter `metrics`) | ✅ Implemented |
180+
| SUT Disk I/O | `sut_disk_{read,write}_bytes` (Gauge) | SUT disk read/write bytes (from adapter `metrics`) | ✅ Implemented |
181+
| SUT Disk IOPS | `sut_disk_{read,write}_iops` (Gauge) | SUT disk IOPS (from adapter `metrics`) | ✅ Implemented |
182+
| Efficiency | `efficiency_queries_per_core` (Gauge) | Query throughput normalized by CPU cores | ✅ Implemented |
183+
| E2E Latency | `e2e_latency_ms` (Histogram) | Raw event-to-queryable freshness samples; percentile is computed in dashboard queries | ✅ Implemented |
184+
| Checkpoint In-flight | `checkpoint_in_flight_queries` (Gauge) | In-flight query count during checkpoint validation | ✅ Implemented |
184185

185186
#### Grafana Dashboard
186187

0 commit comments

Comments
 (0)