Skip to content

Commit eb043c6

Browse files
authored
feat(observability): add runner VM hostmetrics Grafana dashboard (#187)
* feat(observability): add runner VM hostmetrics Grafana dashboard Adds a read-only Grafana dashboard (editable: false) for runner VM host-level metrics to be served via cos-configuration-k8s using the grafana-dashboard relation, which provisions it as an immutable filesystem dashboard in Grafana. The dashboard covers: - CPU utilisation by state and load averages - Memory usage by state - Disk I/O throughput and operations - Filesystem usage % by mount point - Network traffic, errors and drops Template variables: - github_job_id: filter by GitHub Actions workflow run job ID - instance: filter by runner hostname Metric names follow the OpenTelemetry hostmetrics receiver prometheus convention (e.g. system_cpu_time_seconds_total). The github_job_id label is expected to be set as a resource attribute by the otelcol pipeline collecting metrics from the runner VMs. Related: ISD-5152 * docs: document observability layout and rename dashboard directory Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the purpose explicit at the repo root level (runner VM host metrics, not charm workload metrics). Update README with: - Repository layout overview - Observability section explaining the cos-configuration-k8s delivery mechanism and the immutability guarantee - Table of conventions for where dashboards live and what grafana_dashboards_path value to use in Terraform * fix: align dashboard labels with OTel config from github-runner-operator Replace github_job_id with github_job and instance with github_runner to match the actual attribute labels set by the pre-job OTel config (see canonical/github-runner-operator#781). Add github_repository and github_workflow template variables so the dashboard can be filtered the same way as the existing PS6 hostmetrics dashboard. * refactor(observability): mirror upstream OTel hostmetrics dashboard layout Restructure the runner VM hostmetrics dashboard to follow the upstream OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row of CPU/Memory/Root FS gauges plus Load/Cores/Total Memory stats, then CPU, Memory, Disk I/O, Filesystem and Network sections with read/write and rx/tx split axes. Make every templating variable support "All" via includeAll, multi-select and allValue ".*", and switch all label matchers to =~ so regex interpolation works. * fix(observability): correct multi-runner aggregations in hostmetrics dashboard When the runner variable resolves to multiple series (multi-select or "All"), several panels previously produced misleading values: - CPU Cores stat / System Load "cores" reference: count(count by (cpu) ...) collapses cpu indexes across runners, returning the max-cores-on-any-host rather than fleet total. Group by github_runner so cpu indexes stay distinct, then expose total cores in the stat panel and per-runner cores on the load panel (so the reference aligns with the averaged load lines). - System Load 1m/5m/15m: bare metric returns one series per runner with identical legends ("1m"/"5m"/"15m"), making the chart unreadable. Wrap in avg() to get one fleet-average line per period. - Disk Busy %: sum by (device) of fractional busy time can exceed 1 with multiple runners and gets silently clamped by max:1. Switch to avg by (device) so the value stays a meaningful 0-1 fleet average. Also soften the README guidance on editable: false. cos-configuration-k8s provisions dashboards from the filesystem, which makes them read-only in Grafana regardless of the flag, so the explicit "must" requirement was contradicted by existing dashboards in charms/planner-operator/. * docs(observability): clarify hostmetrics dashboard variable usage Expand the dashboard description to spell out the expected usage of the github_runner variable: scope it to a flavor regex (e.g. flavor-x-.*) when comparing fleets, or pick a single runner for per-host inspection. Aggregating by device/mountpoint without grouping by github_runner is intentional — it produces meaningful fleet totals/averages when the matched runners share device semantics — but assumes operators don't mix heterogeneous flavors under "All". * fix(observability): align Root FS gauge and System Load cores override - Root FS gauge: restrict the denominator to state=~"used|free" to match the Filesystem Utilization bargauge and df semantics. Without this, reserved blocks (e.g. ext4's 5% root reservation) inflate the denominator and the gauge reads artificially low. - System Load cores override: the field matcher still pointed at the old "cores" legend after the per-runner rename, so the red dashed styling never applied. Update the matcher to "cores (per runner)". * refactor(observability): switch hostmetrics dashboard to single-select Drop multi-select on the GitHub-context variables (kept includeAll + allValue: ".*" so picking "All" still widens the scope as a regex). Single-select matches the upstream OpenTelemetry hostmetrics design and makes per-host attribution work — multi-runner aggregations under sum by (device) collapsed identically-named devices across hosts and hid which runner was responsible for any given spike. With single-select assured, simplify the dense per-device/per-mountpoint panels back to bare metrics (drop sum by device on disk I/O throughput, disk IOPS, disk busy %, memory usage, filesystem usage, network throughput/packets/errors). Revert the multi-runner-defensive variants of CPU Cores, System Load 1m/5m/15m and the cores reference series. Aggregations are kept where they are inherent to the metric: overview gauges (CPU/Memory/Root FS), Memory Utilization (sum/sum ratio), Filesystem Utilization (sum by mountpoint ratio) and TCP Connections (sum by state). Drop the cross-host-aggregation note from the dashboard description since the design no longer relies on it.
1 parent 3cc3822 commit eb043c6

2 files changed

Lines changed: 1135 additions & 3 deletions

File tree

README.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,41 @@
11
# GitHub runner operators
22

3-
43
![WIP](https://img.shields.io/badge/status-WIP-yellow)
54

65
A monorepo containing charms to operate Self-Hosted GitHub Action Runners.
76

8-
At the moment, it contains initial code for the `webhook-gateway`
9-
application, that receives and forwards GitHub webhooks to an AMQP queue.
7+
## Repository layout
8+
9+
```
10+
charms/
11+
planner-operator/ # Juju charm: GitHub runner planner
12+
cos_custom/
13+
grafana_dashboards/ # Grafana dashboards for the planner charm
14+
# (served via cos-configuration-k8s, path: charms/planner-operator/cos_custom/grafana_dashboards)
15+
webhook-gateway-operator/ # Juju charm: GitHub webhook gateway
16+
17+
runner_grafana_dashboards/ # Grafana dashboards for runner VM host metrics
18+
# (served via cos-configuration-k8s, path: runner_grafana_dashboards)
19+
```
20+
21+
## Observability: Grafana dashboards
22+
23+
Dashboards in this repo are delivered to Grafana through
24+
[`cos-configuration-k8s`](https://charmhub.io/cos-configuration-k8s), which syncs
25+
JSON files from this Git repository and provisions them via the `grafana-dashboard`
26+
relation. Provisioned dashboards are **immutable** in Grafana regardless of user
27+
role — they cannot be edited or deleted through the UI.
28+
29+
### Conventions
30+
31+
| Directory | Purpose | `grafana_dashboards_path` config value |
32+
|---|---|---|
33+
| `charms/<charm>/cos_custom/grafana_dashboards/` | Dashboards for a specific charm's workload metrics | `charms/<charm>/cos_custom/grafana_dashboards` |
34+
| `runner_grafana_dashboards/` | Dashboards for runner VM host-level metrics (CPU, memory, disk, network) | `runner_grafana_dashboards` |
35+
36+
Dashboard JSON files should use `__inputs` to declare the datasource (type `prometheus`).
37+
Setting `"editable": false` is recommended for clarity, but is not strictly required:
38+
dashboards delivered through `cos-configuration-k8s` are filesystem-provisioned and
39+
therefore read-only in Grafana regardless of the JSON flag. Metric names follow the
40+
[OpenTelemetry hostmetrics receiver](https://opentelemetry.io/docs/collector/components/#receiver)
41+
Prometheus naming convention (e.g. `system_cpu_time_seconds_total`).

0 commit comments

Comments
 (0)