Commit eb043c6
authored
feat(observability): add runner VM hostmetrics Grafana dashboard (#187)
* feat(observability): add runner VM hostmetrics Grafana dashboard
Adds a read-only Grafana dashboard (editable: false) for runner VM
host-level metrics to be served via cos-configuration-k8s using the
grafana-dashboard relation, which provisions it as an immutable
filesystem dashboard in Grafana.
The dashboard covers:
- CPU utilisation by state and load averages
- Memory usage by state
- Disk I/O throughput and operations
- Filesystem usage % by mount point
- Network traffic, errors and drops
Template variables:
- github_job_id: filter by GitHub Actions workflow run job ID
- instance: filter by runner hostname
Metric names follow the OpenTelemetry hostmetrics receiver prometheus
convention (e.g. system_cpu_time_seconds_total). The github_job_id
label is expected to be set as a resource attribute by the otelcol
pipeline collecting metrics from the runner VMs.
Related: ISD-5152
* docs: document observability layout and rename dashboard directory
Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the
purpose explicit at the repo root level (runner VM host metrics, not
charm workload metrics).
Update README with:
- Repository layout overview
- Observability section explaining the cos-configuration-k8s delivery
mechanism and the immutability guarantee
- Table of conventions for where dashboards live and what
grafana_dashboards_path value to use in Terraform
* fix: align dashboard labels with OTel config from github-runner-operator
Replace github_job_id with github_job and instance with github_runner
to match the actual attribute labels set by the pre-job OTel config
(see canonical/github-runner-operator#781).
Add github_repository and github_workflow template variables so the
dashboard can be filtered the same way as the existing PS6 hostmetrics
dashboard.
* refactor(observability): mirror upstream OTel hostmetrics dashboard layout
Restructure the runner VM hostmetrics dashboard to follow the upstream
OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row
of CPU/Memory/Root FS gauges plus Load/Cores/Total Memory stats, then
CPU, Memory, Disk I/O, Filesystem and Network sections with read/write
and rx/tx split axes.
Make every templating variable support "All" via includeAll, multi-select
and allValue ".*", and switch all label matchers to =~ so regex
interpolation works.
* fix(observability): correct multi-runner aggregations in hostmetrics dashboard
When the runner variable resolves to multiple series (multi-select or
"All"), several panels previously produced misleading values:
- CPU Cores stat / System Load "cores" reference: count(count by (cpu) ...)
collapses cpu indexes across runners, returning the max-cores-on-any-host
rather than fleet total. Group by github_runner so cpu indexes stay
distinct, then expose total cores in the stat panel and per-runner cores
on the load panel (so the reference aligns with the averaged load lines).
- System Load 1m/5m/15m: bare metric returns one series per runner with
identical legends ("1m"/"5m"/"15m"), making the chart unreadable. Wrap
in avg() to get one fleet-average line per period.
- Disk Busy %: sum by (device) of fractional busy time can exceed 1 with
multiple runners and gets silently clamped by max:1. Switch to
avg by (device) so the value stays a meaningful 0-1 fleet average.
Also soften the README guidance on editable: false. cos-configuration-k8s
provisions dashboards from the filesystem, which makes them read-only in
Grafana regardless of the flag, so the explicit "must" requirement was
contradicted by existing dashboards in charms/planner-operator/.
* docs(observability): clarify hostmetrics dashboard variable usage
Expand the dashboard description to spell out the expected usage of the
github_runner variable: scope it to a flavor regex (e.g. flavor-x-.*)
when comparing fleets, or pick a single runner for per-host inspection.
Aggregating by device/mountpoint without grouping by github_runner is
intentional — it produces meaningful fleet totals/averages when the
matched runners share device semantics — but assumes operators don't
mix heterogeneous flavors under "All".
* fix(observability): align Root FS gauge and System Load cores override
- Root FS gauge: restrict the denominator to state=~"used|free" to match
the Filesystem Utilization bargauge and df semantics. Without this,
reserved blocks (e.g. ext4's 5% root reservation) inflate the
denominator and the gauge reads artificially low.
- System Load cores override: the field matcher still pointed at the
old "cores" legend after the per-runner rename, so the red dashed
styling never applied. Update the matcher to "cores (per runner)".
* refactor(observability): switch hostmetrics dashboard to single-select
Drop multi-select on the GitHub-context variables (kept includeAll +
allValue: ".*" so picking "All" still widens the scope as a regex).
Single-select matches the upstream OpenTelemetry hostmetrics design and
makes per-host attribution work — multi-runner aggregations under
sum by (device) collapsed identically-named devices across hosts and
hid which runner was responsible for any given spike.
With single-select assured, simplify the dense per-device/per-mountpoint
panels back to bare metrics (drop sum by device on disk I/O throughput,
disk IOPS, disk busy %, memory usage, filesystem usage, network
throughput/packets/errors). Revert the multi-runner-defensive variants
of CPU Cores, System Load 1m/5m/15m and the cores reference series.
Aggregations are kept where they are inherent to the metric: overview
gauges (CPU/Memory/Root FS), Memory Utilization (sum/sum ratio),
Filesystem Utilization (sum by mountpoint ratio) and TCP Connections
(sum by state).
Drop the cross-host-aggregation note from the dashboard description
since the design no longer relies on it.1 parent 3cc3822 commit eb043c6
2 files changed
Lines changed: 1135 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
4 | 3 | | |
5 | 4 | | |
6 | 5 | | |
7 | 6 | | |
8 | | - | |
9 | | - | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
0 commit comments