From 2d1ea69eb6fa3f2a256f09313b83210e7d23ccc8 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 14:59:31 +0530 Subject: [PATCH 1/7] Add design spec for ovirt-engine-health codebundle Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-03-ovirt-engine-health-design.md | 167 ++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-03-ovirt-engine-health-design.md diff --git a/docs/superpowers/specs/2026-06-03-ovirt-engine-health-design.md b/docs/superpowers/specs/2026-06-03-ovirt-engine-health-design.md new file mode 100644 index 00000000..ff6ad8fd --- /dev/null +++ b/docs/superpowers/specs/2026-06-03-ovirt-engine-health-design.md @@ -0,0 +1,167 @@ +# Design: `ovirt-engine-health` CodeBundle + +**Date:** 2026-06-03 +**Status:** Approved (pending spec review) +**Author:** Prathamesh Lohakare (with Claude) + +## Summary + +A new RunWhen CodeBundle, `codebundles/ovirt-engine-health/`, that monitors the +health of an oVirt virtualization environment (oVirt / Red Hat Virtualization / +Oracle Linux Virtualization Manager) through the oVirt engine REST API +(`/ovirt-engine/api`). + +It follows the existing `jenkins-health` pattern: an SLI that emits a composite +0–1 health score, a runbook (taskset) that raises actionable issues, bash +scripts that call the API with `curl`+`jq` and emit JSON, plus `README.md`, +`.test/`, and `.runwhen/` directories. + +## Goals + +- One comprehensive bundle covering hosts, VMs, storage domains, clusters, and + events — not one bundle per resource type. +- Auth via oVirt SSO bearer token. +- Optional CA cert for TLS verification (self-signed engine certs are common). +- SLI (score) and runbook (issues + report) parity, sharing the same checks. + +## Non-Goals + +- Auto-discovery via a cloud provider resource type (oVirt is a self-hosted + endpoint with no native RunWhen-discovered resource type — see "Discovery"). +- Provisioning a real oVirt environment in CI (too heavy). Tests point at a + user-supplied engine. +- Mutating/remediating oVirt state (read-only checks only). + +## Authentication & TLS + +A sourced helper `ovirt_auth.sh`, used by every check script: + +1. `POST {OVIRT_ENGINE_URL}/ovirt-engine/sso/oauth/token` with + `grant_type=password`, `scope=ovirt-app-api`, `username`, `password` + (form-encoded). Parse `.access_token` from the JSON response with `jq`. +2. Export `OVIRT_TOKEN`. All API calls send headers: + `Authorization: Bearer ${OVIRT_TOKEN}`, `Accept: application/json`, + `Version: 4`. +3. If the optional `OVIRT_CA_CERT` secret is provided, write it to a temp file + (trapped for cleanup) and pass `--cacert ` to curl; otherwise rely on + the system trust store. +4. On token-fetch failure, emit an error JSON (`{"error": "..."}`) and exit + non-zero so the calling task can surface an engine-reachability issue. + +## Checks + +Seven checks. Each is one task in **both** `sli.robot` and `runbook.robot`. +The SLI pushes a per-check metric (`sub_name=`, value 0 or 1) and the +runbook raises an issue + writes a report section. + +| # | Check | Healthy when | Issue (severity) | +|---|---|---|---| +| 1 | **Engine reachability** | `/api` returns 200 + valid JSON, token obtained | engine unreachable / token fetch fails (sev 1) | +| 2 | **Host status** | all hosts `up` | any host `non_operational`, `connecting`, `error`, `install_failed` (sev 2); `maintenance` reported but not failed | +| 3 | **VM status** | no VMs `paused` or `unknown` | VMs paused (often storage I/O) or `unknown` (sev 2); count above `MAX_PAUSED_VMS` | +| 4 | **Storage domain capacity** | status `active` and free % ≥ `OVIRT_STORAGE_FREE_PCT` | domain `inactive`/`maintenance`, or free % below threshold (sev 2) | +| 5 | **Cluster health** | clusters reachable; no global maintenance / hosts-down condition | cluster has down hosts or is in global maintenance (sev 3) | +| 6 | **Recent critical events** | no `error`/`alert` severity events in `OVIRT_EVENT_LOOKBACK` | error/alert events present in window (sev 3) | +| 7 | **Stale VM snapshots** | no active snapshots older than `OVIRT_SNAPSHOT_MAX_AGE` | snapshots older than threshold (disk-bloat risk) (sev 4) | + +### Relevant API endpoints (Version 4) + +- `GET /ovirt-engine/api` — reachability +- `GET /ovirt-engine/api/hosts` — host status +- `GET /ovirt-engine/api/vms` — VM status +- `GET /ovirt-engine/api/storagedomains` — capacity/status (`available`, `used`) +- `GET /ovirt-engine/api/clusters` — cluster health +- `GET /ovirt-engine/api/events?search=severity>normal&max=...` — events +- `GET /ovirt-engine/api/vms/{id}/snapshots` (per VM) — snapshots + +## Configuration + +User Variables / Secrets imported in each robot's `Suite Initialization`: + +| Name | Kind | Default | Notes | +|---|---|---|---| +| `OVIRT_ENGINE_URL` | user var | — | e.g. `https://engine.example.com` (no trailing `/ovirt-engine`) | +| `OVIRT_USERNAME` | secret | — | e.g. `admin@internal` or `admin@ovirt@internal` | +| `OVIRT_PASSWORD` | secret | — | engine password | +| `OVIRT_CA_CERT` | secret | (optional) | PEM CA bundle; omitted → system trust store | +| `OVIRT_STORAGE_FREE_PCT` | user var | `10` | min free % per storage domain | +| `OVIRT_EVENT_LOOKBACK` | user var | `1h` | window for critical-event check | +| `OVIRT_SNAPSHOT_MAX_AGE` | user var | `7d` | stale-snapshot threshold | +| `MAX_PAUSED_VMS` | user var | `0` | max paused/unknown VMs considered healthy | +| `OVIRT_ENGINE_NAME` | user var | `ovirt-engine` | display name in task titles/SLX | + +## SLI Scoring + +Each check sets a global `*_score` (0 or 1) and calls +`RW.Core.Push Metric ${score} sub_name=`. A final +`Generate Health Score` task averages the seven sub-scores and pushes the +composite metric (rounded to 2 decimals), mirroring `jenkins-health/sli.robot`. + +## Runbook Behavior + +For each check the runbook: +- Runs the same bash script. +- Formats results into a `RW.Core.Add Pre To Report` table (`jq … | column -t`). +- Raises `RW.Core.Add Issue` per affected object with `severity`, `expected`, + `actual`, `title`, `reproduce_hint`, `details`, and `next_steps` + (concrete oVirt remediation guidance, e.g. "activate storage domain", + "investigate host in non-operational state"). + +## Files + +``` +codebundles/ovirt-engine-health/ + README.md + sli.robot + runbook.robot + ovirt_auth.sh # sourced token + curl helper + check_engine.sh # check 1 (reachability is largely inline in robot) + host_status.sh # check 2 + vm_status.sh # check 3 + storage_domains.sh # check 4 + cluster_health.sh # check 5 + recent_events.sh # check 6 + stale_snapshots.sh # check 7 + .runwhen/ + generation-rules/ovirt-engine-health.yaml + templates/ovirt-engine-health-slx.yaml + templates/ovirt-engine-health-sli.yaml + templates/ovirt-engine-health-taskset.yaml + .test/ + Taskfile.yaml + README.md +``` + +## Discovery (`.runwhen/`) + +oVirt has no RunWhen-discovered resource type, so a cloud-style generation rule +(matching e.g. an EC2 instance) cannot fire. We provide the SLX / SLI / taskset +**templates** so an SLX can be created via config/manually, and a +generation rule keyed on the workspace config index rather than a cloud +resource match. The README documents that SLX creation is config-driven, not +auto-discovered. This is the honest limitation and is called out explicitly. + +## Testing (`.test/`) + +Lightweight, no infra provisioning: +- `Taskfile.yaml` with tasks to run `sli.robot` / `runbook.robot` against a + user-supplied `OVIRT_ENGINE_URL` (env-driven), following the conventions of + other bundles' Taskfiles. +- `README.md` documenting required env vars and how to point at a real/lab + oVirt engine (or the upstream `ovirt-engine` appliance) for manual testing. +- No terraform (no cloud infra to provision, unlike `jenkins-health`). + +## Error Handling + +- Every script guards required env vars and exits with a clear message if unset. +- Robot tasks wrap `json.loads` of script stdout in `TRY/EXCEPT`, defaulting to + an empty list/dict and logging a `WARN` (matching `jenkins-health`), so a + single failing check never aborts the whole suite. +- Token-fetch failure surfaces as the sev-1 engine-reachability issue. + +## Open Risks + +- oVirt API field/state names verified against the v4 REST schema during + implementation (host states, VM states, storagedomain `available`/`used`). +- Event `search` query syntax (`severity>normal`, date filters) confirmed + against a live engine or docs during implementation. From d91e036be0292d181c2f37765cae23b5da4902c6 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 15:10:17 +0530 Subject: [PATCH 2/7] Add ovirt-engine-health codebundle Comprehensive SLI + runbook for oVirt/RHV/OLVM environments via the engine REST API. Authenticates with an SSO bearer token and checks engine reachability, host status, VM status, storage domain capacity, cluster health, recent critical events, and stale VM snapshots. Optional CA cert for TLS verification. Includes .runwhen templates and a lightweight .test harness (no infra provisioning, since oVirt is self-hosted). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../generation-rules/ovirt-engine-health.yaml | 30 ++ .../templates/ovirt-engine-health-sli.yaml | 42 +++ .../templates/ovirt-engine-health-slx.yaml | 19 + .../ovirt-engine-health-taskset.yaml | 32 ++ .../ovirt-engine-health/.test/.gitignore | 4 + .../ovirt-engine-health/.test/README.md | 49 +++ .../ovirt-engine-health/.test/Taskfile.yaml | 104 ++++++ codebundles/ovirt-engine-health/README.md | 64 ++++ .../ovirt-engine-health/cluster_health.sh | 34 ++ .../ovirt-engine-health/engine_health.sh | 18 + .../ovirt-engine-health/host_status.sh | 32 ++ codebundles/ovirt-engine-health/ovirt_auth.sh | 82 +++++ .../ovirt-engine-health/recent_events.sh | 34 ++ codebundles/ovirt-engine-health/runbook.robot | 327 ++++++++++++++++++ codebundles/ovirt-engine-health/sli.robot | 235 +++++++++++++ .../ovirt-engine-health/stale_snapshots.sh | 35 ++ .../ovirt-engine-health/storage_domains.sh | 39 +++ codebundles/ovirt-engine-health/vm_status.sh | 23 ++ 18 files changed, 1203 insertions(+) create mode 100644 codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml create mode 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml create mode 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml create mode 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml create mode 100644 codebundles/ovirt-engine-health/.test/.gitignore create mode 100644 codebundles/ovirt-engine-health/.test/README.md create mode 100644 codebundles/ovirt-engine-health/.test/Taskfile.yaml create mode 100644 codebundles/ovirt-engine-health/README.md create mode 100755 codebundles/ovirt-engine-health/cluster_health.sh create mode 100755 codebundles/ovirt-engine-health/engine_health.sh create mode 100755 codebundles/ovirt-engine-health/host_status.sh create mode 100755 codebundles/ovirt-engine-health/ovirt_auth.sh create mode 100755 codebundles/ovirt-engine-health/recent_events.sh create mode 100644 codebundles/ovirt-engine-health/runbook.robot create mode 100644 codebundles/ovirt-engine-health/sli.robot create mode 100755 codebundles/ovirt-engine-health/stale_snapshots.sh create mode 100755 codebundles/ovirt-engine-health/storage_domains.sh create mode 100755 codebundles/ovirt-engine-health/vm_status.sh diff --git a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml new file mode 100644 index 00000000..9ce1f39d --- /dev/null +++ b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml @@ -0,0 +1,30 @@ +# oVirt is not a RunWhen-discovered platform type, so SLXs for this bundle are +# not auto-generated from a cloud resource match. Create the SLX from the +# templates in ../templates/ using config (OVIRT_ENGINE_URL) and workspace +# secrets (OVIRT_USERNAME, OVIRT_PASSWORD, optional OVIRT_CA_CERT). +# +# The block below is the template for how a generation rule would look if oVirt +# were ever added as a discoverable platform type. +# +# apiVersion: runwhen.com/v1 +# kind: GenerationRules +# spec: +# platform: ovirt +# generationRules: +# - resourceTypes: +# - ovirt_engine +# matchRules: +# - type: pattern +# pattern: ".+" +# properties: [name] +# mode: substring +# slxs: +# - baseName: ovirt-engine-health +# qualifiers: ["resource"] +# baseTemplateName: ovirt-engine-health +# levelOfDetail: detailed +# outputItems: +# - type: slx +# - type: sli +# - type: runbook +# templateName: ovirt-engine-health-taskset.yaml diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml new file mode 100644 index 00000000..fa781da8 --- /dev/null +++ b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml @@ -0,0 +1,42 @@ +apiVersion: runwhen.com/v1 +kind: ServiceLevelIndicator +metadata: + name: {{slx_name}} + labels: + {% include "common-labels.yaml" %} + annotations: + {% include "common-annotations.yaml" %} +spec: + displayUnitsLong: OK + displayUnitsShort: ok + locations: + - {{default_location}} + description: The composite health score of the oVirt engine environment. + codeBundle: + {% if repo_url %} + repoUrl: {{repo_url}} + {% else %} + repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git + {% endif %} + {% if ref %} + ref: {{ref}} + {% else %} + ref: main + {% endif %} + pathToRobot: codebundles/ovirt-engine-health/sli.robot + intervalStrategy: intermezzo + intervalSeconds: 600 + configProvided: + - name: OVIRT_ENGINE_URL + value: {{custom.ovirt_engine_url}} + secretsProvided: + - name: OVIRT_USERNAME + workspaceKey: {{custom.ovirt_username}} + - name: OVIRT_PASSWORD + workspaceKey: {{custom.ovirt_password}} + - name: OVIRT_CA_CERT + workspaceKey: {{custom.ovirt_ca_cert}} + alertConfig: + tasks: + persona: eager-edgar + sessionTTL: 10m diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml new file mode 100644 index 00000000..efe038a0 --- /dev/null +++ b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml @@ -0,0 +1,19 @@ +apiVersion: runwhen.com/v1 +kind: ServiceLevelX +metadata: + name: {{slx_name}} + labels: + {% include "common-labels.yaml" %} + annotations: + {% include "common-annotations.yaml" %} +spec: + imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/ovirt.svg + alias: {{custom.ovirt_engine_name}} oVirt Engine Health + asMeasuredBy: The composite health score of the oVirt engine (hosts, VMs, storage, clusters, events, snapshots). + configProvided: + - name: OVIRT_ENGINE_URL + value: {{custom.ovirt_engine_url}} + owners: + - {{workspace.owner_email}} + statement: The oVirt engine and its hosts, VMs, and storage domains should be healthy. + additionalContext: [] diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml new file mode 100644 index 00000000..20e50637 --- /dev/null +++ b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml @@ -0,0 +1,32 @@ +apiVersion: runwhen.com/v1 +kind: Runbook +metadata: + name: {{slx_name}} + labels: + {% include "common-labels.yaml" %} + annotations: + {% include "common-annotations.yaml" %} +spec: + location: {{default_location}} + codeBundle: + {% if repo_url %} + repoUrl: {{repo_url}} + {% else %} + repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git + {% endif %} + {% if ref %} + ref: {{ref}} + {% else %} + ref: main + {% endif %} + pathToRobot: codebundles/ovirt-engine-health/runbook.robot + configProvided: + - name: OVIRT_ENGINE_URL + value: {{custom.ovirt_engine_url}} + secretsProvided: + - name: OVIRT_USERNAME + workspaceKey: {{custom.ovirt_username}} + - name: OVIRT_PASSWORD + workspaceKey: {{custom.ovirt_password}} + - name: OVIRT_CA_CERT + workspaceKey: {{custom.ovirt_ca_cert}} diff --git a/codebundles/ovirt-engine-health/.test/.gitignore b/codebundles/ovirt-engine-health/.test/.gitignore new file mode 100644 index 00000000..1f09b766 --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/.gitignore @@ -0,0 +1,4 @@ +.env +output-sli/ +output-runbook/ +*.pem diff --git a/codebundles/ovirt-engine-health/.test/README.md b/codebundles/ovirt-engine-health/.test/README.md new file mode 100644 index 00000000..c8cac5eb --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/README.md @@ -0,0 +1,49 @@ +# Testing — ovirt-engine-health + +This bundle talks to a live oVirt engine REST API, so testing means pointing it +at a reachable engine. There is no cloud infrastructure to provision (and no +Terraform), because oVirt is self-hosted. + +## What you need + +A reachable oVirt engine. Any of these works: +- An existing oVirt / RHV / OLVM engine you have read access to. +- A lab/self-hosted-engine deployment. +- The upstream `ovirt-engine` appliance for a throwaway environment. + +A read-only user with the auth profile is sufficient (e.g. a user in the +`@internal` profile with `UserRole` / `ReadOnlyAdmin`). + +## Configure + +Create a `.test/.env` file (gitignored) or export the variables: + +```bash +OVIRT_ENGINE_URL=https://engine.example.com +OVIRT_USERNAME=admin@internal +OVIRT_PASSWORD=changeme +# Optional: +OVIRT_CA_CERT_FILE=/path/to/engine-ca.pem # for TLS verification +OVIRT_STORAGE_FREE_PCT=10 +OVIRT_EVENT_LOOKBACK=1h +OVIRT_SNAPSHOT_MAX_AGE=7d +MAX_PAUSED_VMS=0 +OVIRT_ENGINE_NAME=lab-ovirt +``` + +> Fetch the engine CA with: +> `curl -sk https://engine.example.com/ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA -o engine-ca.pem` + +## Run + +```bash +task check-config # validate required env vars +task smoke-scripts # run the raw check scripts and print their JSON +task run-sli # run sli.robot (pushes the composite health score) +task run-runbook # run runbook.robot (raises issues + writes a report) +task # check-config + run-sli + run-runbook +task clean # remove robot output dirs +``` + +`smoke-scripts` is the fastest way to confirm connectivity and that the engine's +JSON shape matches what the scripts expect, without Robot Framework. diff --git a/codebundles/ovirt-engine-health/.test/Taskfile.yaml b/codebundles/ovirt-engine-health/.test/Taskfile.yaml new file mode 100644 index 00000000..0a67f0fc --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/Taskfile.yaml @@ -0,0 +1,104 @@ +version: "3" + +# Lightweight test harness for the ovirt-engine-health CodeBundle. +# +# Unlike the cloud bundles, there is no infrastructure to provision here: point +# the tasks at a reachable oVirt engine (a real engine, a lab, or the upstream +# ovirt-engine appliance) via the environment variables below, then run the +# robots locally. +# +# Required env: +# OVIRT_ENGINE_URL e.g. https://engine.example.com (no trailing /ovirt-engine) +# OVIRT_USERNAME e.g. admin@internal +# OVIRT_PASSWORD +# Optional env: +# OVIRT_CA_CERT_FILE path to a PEM CA bundle for TLS verification +# OVIRT_STORAGE_FREE_PCT, OVIRT_EVENT_LOOKBACK, OVIRT_SNAPSHOT_MAX_AGE, +# MAX_PAUSED_VMS, OVIRT_ENGINE_NAME + +tasks: + default: + desc: "Run both the SLI and the runbook against the configured engine." + cmds: + - task: check-config + - task: run-sli + - task: run-runbook + + check-config: + desc: "Verify the required environment variables are set." + cmds: + - | + missing=() + [ -z "${OVIRT_ENGINE_URL}" ] && missing+=("OVIRT_ENGINE_URL") + [ -z "${OVIRT_USERNAME}" ] && missing+=("OVIRT_USERNAME") + [ -z "${OVIRT_PASSWORD}" ] && missing+=("OVIRT_PASSWORD") + if [ ${#missing[@]} -ne 0 ]; then + echo "Missing required environment variables: ${missing[*]}" + exit 1 + fi + echo "Configuration looks good for ${OVIRT_ENGINE_URL}" + silent: true + + smoke-scripts: + desc: "Run the raw check scripts directly and print their JSON (no Robot Framework)." + dotenv: ['.env'] + cmds: + - task: check-config + - | + export OVIRT_CA_CERT="$( [ -n "${OVIRT_CA_CERT_FILE}" ] && cat "${OVIRT_CA_CERT_FILE}" || true )" + echo "== engine_health =="; ../engine_health.sh | jq . + echo "== host_status =="; ../host_status.sh | jq '{total, unhealthy_hosts}' + echo "== vm_status =="; ../vm_status.sh | jq '{total, problem_vms}' + echo "== storage_domains =="; ../storage_domains.sh "${OVIRT_STORAGE_FREE_PCT:-10}" | jq '{problem_domains}' + echo "== cluster_health =="; ../cluster_health.sh | jq '{problem_clusters}' + echo "== recent_events =="; ../recent_events.sh "${OVIRT_EVENT_LOOKBACK:-1h}" | jq '{critical_events: (.critical_events|length)}' + echo "== stale_snapshots =="; ../stale_snapshots.sh "${OVIRT_SNAPSHOT_MAX_AGE:-7d}" | jq '{stale: (.stale_snapshots|length)}' + silent: true + + run-sli: + desc: "Run sli.robot against the configured engine." + dotenv: ['.env'] + cmds: + - task: check-config + - | + export OVIRT_CA_CERT="$( [ -n "${OVIRT_CA_CERT_FILE}" ] && cat "${OVIRT_CA_CERT_FILE}" || true )" + robot \ + --variable OVIRT_ENGINE_URL:"${OVIRT_ENGINE_URL}" \ + --variable OVIRT_USERNAME:"${OVIRT_USERNAME}" \ + --variable OVIRT_PASSWORD:"${OVIRT_PASSWORD}" \ + --variable OVIRT_CA_CERT:"${OVIRT_CA_CERT}" \ + --variable OVIRT_STORAGE_FREE_PCT:"${OVIRT_STORAGE_FREE_PCT:-10}" \ + --variable OVIRT_EVENT_LOOKBACK:"${OVIRT_EVENT_LOOKBACK:-1h}" \ + --variable OVIRT_SNAPSHOT_MAX_AGE:"${OVIRT_SNAPSHOT_MAX_AGE:-7d}" \ + --variable MAX_PAUSED_VMS:"${MAX_PAUSED_VMS:-0}" \ + --variable OVIRT_ENGINE_NAME:"${OVIRT_ENGINE_NAME:-ovirt-engine}" \ + --outputdir output-sli \ + ../sli.robot + silent: true + + run-runbook: + desc: "Run runbook.robot against the configured engine." + dotenv: ['.env'] + cmds: + - task: check-config + - | + export OVIRT_CA_CERT="$( [ -n "${OVIRT_CA_CERT_FILE}" ] && cat "${OVIRT_CA_CERT_FILE}" || true )" + robot \ + --variable OVIRT_ENGINE_URL:"${OVIRT_ENGINE_URL}" \ + --variable OVIRT_USERNAME:"${OVIRT_USERNAME}" \ + --variable OVIRT_PASSWORD:"${OVIRT_PASSWORD}" \ + --variable OVIRT_CA_CERT:"${OVIRT_CA_CERT}" \ + --variable OVIRT_STORAGE_FREE_PCT:"${OVIRT_STORAGE_FREE_PCT:-10}" \ + --variable OVIRT_EVENT_LOOKBACK:"${OVIRT_EVENT_LOOKBACK:-1h}" \ + --variable OVIRT_SNAPSHOT_MAX_AGE:"${OVIRT_SNAPSHOT_MAX_AGE:-7d}" \ + --variable MAX_PAUSED_VMS:"${MAX_PAUSED_VMS:-0}" \ + --variable OVIRT_ENGINE_NAME:"${OVIRT_ENGINE_NAME:-ovirt-engine}" \ + --outputdir output-runbook \ + ../runbook.robot + silent: true + + clean: + desc: "Remove robot output directories." + cmds: + - rm -rf output-sli output-runbook + silent: true diff --git a/codebundles/ovirt-engine-health/README.md b/codebundles/ovirt-engine-health/README.md new file mode 100644 index 00000000..1a891128 --- /dev/null +++ b/codebundles/ovirt-engine-health/README.md @@ -0,0 +1,64 @@ +# oVirt Engine Health + +This CodeBundle monitors and evaluates the health of an oVirt virtualization +environment (oVirt / Red Hat Virtualization / Oracle Linux Virtualization +Manager) using the oVirt engine REST API (`/ovirt-engine/api`). + +## SLI +The SLI produces a score of 0 (bad), 1 (good), or a value in between. The score +is the average of the following checks: +- oVirt engine is reachable and an SSO token can be obtained +- No hypervisor hosts in an unhealthy (non-operational) state +- No VMs in a paused / unknown / not-responding state (within `MAX_PAUSED_VMS`) +- All storage domains active and above the free-space threshold +- No clusters with hosts down (non-up, non-maintenance) +- No error/alert severity engine events in the lookback window +- No VM snapshots older than the configured maximum age + +## TaskSet (Runbook) +Runs the same checks and raises an actionable issue for each problem found +(unreachable engine, non-operational hosts, paused VMs, low/inactive storage +domains, clusters with down hosts, critical events, and stale snapshots), with +oVirt-specific remediation guidance. + +## Authentication +The bundle authenticates with the engine's SSO endpoint +(`/ovirt-engine/sso/oauth/token`, `grant_type=password`, +`scope=ovirt-app-api`) to obtain a bearer token, which is then sent on every +API call. + +## Required Configuration + +``` +export OVIRT_ENGINE_URL="https://engine.example.com" # no trailing /ovirt-engine +export OVIRT_USERNAME="admin@internal" # include the auth profile +export OVIRT_PASSWORD="" +``` + +Optional: + +``` +export OVIRT_CA_CERT="" # PEM CA bundle; omit to use the system trust store +export OVIRT_STORAGE_FREE_PCT="10" # min free % per storage domain +export OVIRT_EVENT_LOOKBACK="1h" # window for critical events +export OVIRT_SNAPSHOT_MAX_AGE="7d" # stale snapshot threshold +export MAX_PAUSED_VMS="0" # max paused/unknown VMs considered healthy +export OVIRT_ENGINE_NAME="prod-ovirt" +``` + +> **TLS note:** oVirt engines typically present a self-signed CA. Provide the +> engine CA via `OVIRT_CA_CERT` to verify the connection. If omitted, the +> system trust store is used (the request will fail if the cert is not trusted). + +## Discovery +oVirt has no RunWhen-discovered cloud resource type, so SLXs for this bundle are +created from the workspace config index rather than auto-discovered from a cloud +provider. The templates under `.runwhen/templates/` define the SLX, SLI, and +runbook; the generation rule under `.runwhen/generation-rules/` wires them to +the config index. Provide the `OVIRT_ENGINE_URL` config value and the +`OVIRT_USERNAME` / `OVIRT_PASSWORD` (and optional `OVIRT_CA_CERT`) workspace +secrets. + +## Testing +See the `.test` directory for how to run the SLI and runbook against a reachable +oVirt engine or lab environment. diff --git a/codebundles/ovirt-engine-health/cluster_health.sh b/codebundles/ovirt-engine-health/cluster_health.sh new file mode 100755 index 00000000..70e106f3 --- /dev/null +++ b/codebundles/ovirt-engine-health/cluster_health.sh @@ -0,0 +1,34 @@ +#!/bin/bash +# Assess oVirt cluster health by cross-referencing each cluster's hosts. A +# cluster is flagged when it has one or more hosts in a non-up, non-maintenance +# state (i.e. hosts that should be serving but are not). + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +clusters_json=$(ovirt_get "/clusters") +hosts_json=$(ovirt_get "/hosts") + +echo "${clusters_json}" | jq --argjson hostsdoc "${hosts_json}" ' + ($hostsdoc.host // []) as $allhosts | + def down_in($cid): [ $allhosts[] + | select((.cluster.id // "") == $cid) + | select(((.status // "") | IN("up","maintenance","preparing_for_maintenance")) | not) ]; + { + clusters: [ .cluster[]? | .id as $cid | { + name: .name, + id: $cid, + total_hosts: ([ $allhosts[] | select((.cluster.id // "") == $cid) ] | length), + down_hosts: (down_in($cid) | length) + } ], + problem_clusters: [ .cluster[]? | .id as $cid + | (down_in($cid)) as $down + | select(($down | length) > 0) + | { + name: .name, + id: $cid, + down_hosts: ($down | length), + down_host_names: [ $down[] | .name ] + } + ] + }' diff --git a/codebundles/ovirt-engine-health/engine_health.sh b/codebundles/ovirt-engine-health/engine_health.sh new file mode 100755 index 00000000..a8537295 --- /dev/null +++ b/codebundles/ovirt-engine-health/engine_health.sh @@ -0,0 +1,18 @@ +#!/bin/bash +# Check that the oVirt engine API is reachable and responding with valid data. +# Emits a JSON summary; exits non-zero (with an {"error": ...} object) if the +# token cannot be obtained. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +api_json=$(ovirt_get "") + +echo "${api_json}" | jq '{ + reachable: (has("product_info") or has("summary")), + product: (.product_info.name // "oVirt"), + version: (.product_info.version.full_version // ""), + vms_total: (.summary.vms.total // null), + hosts_total: (.summary.hosts.total // null), + storage_domains_total: (.summary.storage_domains.total // null) +}' 2>/dev/null || echo '{"reachable": false}' diff --git a/codebundles/ovirt-engine-health/host_status.sh b/codebundles/ovirt-engine-health/host_status.sh new file mode 100755 index 00000000..850ae73a --- /dev/null +++ b/codebundles/ovirt-engine-health/host_status.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# List oVirt hypervisor hosts and flag any that are not in a healthy state. +# Hosts in 'maintenance'/'preparing_for_maintenance' are reported separately +# and are NOT treated as unhealthy (they are operator-intended states). + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +hosts_json=$(ovirt_get "/hosts") + +echo "${hosts_json}" | jq ' + def unhealthy: IN("non_operational","non_responsive","error","install_failed","connecting","down","reboot"); + { + total: ([.host[]?] | length), + hosts: [ .host[]? | { + name: .name, + id: .id, + status: (.status // "unknown"), + cluster_id: (.cluster.id // ""), + address: (.address // "") + } ], + unhealthy_hosts: [ .host[]? | select((.status // "") | unhealthy) | { + name: .name, + id: .id, + status: .status, + address: (.address // "") + } ], + maintenance_hosts: [ .host[]? | select((.status // "") | IN("maintenance","preparing_for_maintenance")) | { + name: .name, + status: .status + } ] + }' diff --git a/codebundles/ovirt-engine-health/ovirt_auth.sh b/codebundles/ovirt-engine-health/ovirt_auth.sh new file mode 100755 index 00000000..5858aa75 --- /dev/null +++ b/codebundles/ovirt-engine-health/ovirt_auth.sh @@ -0,0 +1,82 @@ +#!/bin/bash +# --------------------------------------------------------------------------- +# Shared oVirt SSO authentication + REST API helper. +# +# Source this from each check script: +# SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# source "${SCRIPT_DIR}/ovirt_auth.sh" +# hosts=$(ovirt_get "/hosts") +# +# Requires env: OVIRT_ENGINE_URL, OVIRT_USERNAME, OVIRT_PASSWORD +# Optional env: OVIRT_CA_CERT (PEM CA bundle; if unset, the system trust +# store is used). +# +# On any failure to authenticate, prints an {"error": "..."} JSON object to +# stdout and exits non-zero so the calling task surfaces an engine issue. +# --------------------------------------------------------------------------- + +if [ -z "${OVIRT_ENGINE_URL}" ] || [ -z "${OVIRT_USERNAME}" ] || [ -z "${OVIRT_PASSWORD}" ]; then + echo '{"error": "OVIRT_ENGINE_URL, OVIRT_USERNAME and OVIRT_PASSWORD must be set."}' + exit 1 +fi + +# Strip any trailing slash so path concatenation is predictable. +OVIRT_ENGINE_URL="${OVIRT_ENGINE_URL%/}" + +# TLS handling: use a caller-supplied CA bundle when present, otherwise rely on +# the system trust store. +OVIRT_CURL_TLS_OPTS=() +if [ -n "${OVIRT_CA_CERT}" ]; then + OVIRT_CA_FILE="$(mktemp)" + printf '%s\n' "${OVIRT_CA_CERT}" > "${OVIRT_CA_FILE}" + OVIRT_CURL_TLS_OPTS=(--cacert "${OVIRT_CA_FILE}") + trap 'rm -f "${OVIRT_CA_FILE}"' EXIT +fi + +# Obtain an SSO bearer token (grant_type=password, scope=ovirt-app-api). +_ovirt_token_response=$(curl -s "${OVIRT_CURL_TLS_OPTS[@]}" \ + --request POST \ + --header "Accept: application/json" \ + --header "Content-Type: application/x-www-form-urlencoded" \ + --data-urlencode "grant_type=password" \ + --data-urlencode "scope=ovirt-app-api" \ + --data-urlencode "username=${OVIRT_USERNAME}" \ + --data-urlencode "password=${OVIRT_PASSWORD}" \ + "${OVIRT_ENGINE_URL}/ovirt-engine/sso/oauth/token") + +OVIRT_TOKEN=$(echo "${_ovirt_token_response}" | jq -r '.access_token // empty' 2>/dev/null) + +if [ -z "${OVIRT_TOKEN}" ]; then + err=$(echo "${_ovirt_token_response}" | jq -r '.error_description // .error // "unknown error (check OVIRT_ENGINE_URL, credentials and TLS)"' 2>/dev/null) + echo "{\"error\": \"Failed to obtain oVirt SSO token: ${err}\"}" + exit 1 +fi + +# ovirt_get +# e.g. ovirt_get "/hosts" -> GET {engine}/ovirt-engine/api/hosts +ovirt_get() { + local path="$1" + curl -s "${OVIRT_CURL_TLS_OPTS[@]}" \ + --header "Authorization: Bearer ${OVIRT_TOKEN}" \ + --header "Accept: application/json" \ + --header "Version: 4" \ + "${OVIRT_ENGINE_URL}/ovirt-engine/api${path}" +} + +# ovirt_duration_to_seconds +# Accepts forms like 30s, 10m, 2h, 7d, 1w. Defaults the unit to hours. +ovirt_duration_to_seconds() { + local s num unit + s=$(echo "$1" | tr '[:upper:]' '[:lower:]' | tr -d ' ') + num=$(echo "$s" | grep -o '^[0-9]\+') + unit=$(echo "$s" | sed 's/^[0-9]\+//') + [ -z "$num" ] && { echo 0; return; } + case "$unit" in + s|sec|secs|second|seconds) echo "$num" ;; + m|min|mins|minute|minutes) echo $((num * 60)) ;; + h|hr|hrs|hour|hours) echo $((num * 3600)) ;; + d|day|days) echo $((num * 86400)) ;; + w|week|weeks) echo $((num * 604800)) ;; + *) echo $((num * 3600)) ;; + esac +} diff --git a/codebundles/ovirt-engine-health/recent_events.sh b/codebundles/ovirt-engine-health/recent_events.sh new file mode 100755 index 00000000..fb64f41b --- /dev/null +++ b/codebundles/ovirt-engine-health/recent_events.sh @@ -0,0 +1,34 @@ +#!/bin/bash +# List recent critical (error/alert) oVirt engine events within a lookback +# window (arg 1, e.g. 1h / 30m / 1d; default 1h). The engine search query +# narrows to severity above warning; results are then filtered client-side by +# event time so the window is honoured precisely. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +LOOKBACK="${1:-1h}" +SECONDS_BACK=$(ovirt_duration_to_seconds "${LOOKBACK}") +CUTOFF_MS=$(( ($(date +%s) - SECONDS_BACK) * 1000 )) + +# severity>warning returns error + alert events. max caps the payload size. +events_json=$(ovirt_get "/events?search=severity%3Ewarning&max=200") + +echo "${events_json}" | jq --argjson cutoff "${CUTOFF_MS}" --arg lookback "${LOOKBACK}" ' + { + lookback: $lookback, + critical_events: [ .event[]? + | select(((.time // 0) | tostring | (try tonumber catch 0)) >= $cutoff) + | select((.severity // "") | IN("error","alert")) + | { + id: .id, + severity: .severity, + time: (.time // ""), + code: (.code // ""), + description: (.description // ""), + host: (.host.name // ""), + vm: (.vm.name // ""), + storage_domain: (.storage_domain.name // "") + } + ] + }' diff --git a/codebundles/ovirt-engine-health/runbook.robot b/codebundles/ovirt-engine-health/runbook.robot new file mode 100644 index 00000000..34db083e --- /dev/null +++ b/codebundles/ovirt-engine-health/runbook.robot @@ -0,0 +1,327 @@ +*** Settings *** +Documentation Triage the health of an oVirt virtualization environment via the engine REST API. Inspects +... engine reachability, host status, VM status, storage domain capacity, cluster health, +... recent critical events, and stale VM snapshots, raising an issue for each problem found. +Metadata Author prathamesh +Metadata Display Name oVirt Engine Health +Metadata Supports oVirt RHV OLVM Virtualization + +Library BuiltIn +Library RW.Core +Library RW.CLI +Library RW.platform +Library String + +Suite Setup Suite Initialization + +*** Tasks *** +Check oVirt Engine `${OVIRT_ENGINE_NAME}` Reachability + [Documentation] Verify the oVirt engine API is reachable and an SSO token can be obtained. + [Tags] ovirt engine health data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/engine_health.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${reachable}= Set Variable ${data.get('reachable', False)} + IF not ${reachable} + ${err}= Set Variable ${data.get('error', 'Engine API did not return valid data.')} + RW.Core.Add Issue + ... severity=1 + ... expected=oVirt engine `${OVIRT_ENGINE_NAME}` should be reachable and return valid API data + ... actual=oVirt engine `${OVIRT_ENGINE_NAME}` is unreachable or authentication failed + ... title=oVirt Engine `${OVIRT_ENGINE_NAME}` Unreachable + ... reproduce_hint=Run engine_health.sh against ${OVIRT_ENGINE_URL} + ... details=${err} + ... next_steps=Verify OVIRT_ENGINE_URL is correct and the engine is running.\nConfirm OVIRT_USERNAME/OVIRT_PASSWORD and the auth profile (e.g. @internal) are valid.\nIf using a self-signed certificate, provide OVIRT_CA_CERT or verify the CA bundle.\nCheck network connectivity and that the ovirt-engine service is up. + ELSE + ${product}= Set Variable ${data.get('product', 'oVirt')} + ${version}= Set Variable ${data.get('version', 'unknown')} + RW.Core.Add Pre To Report oVirt Engine `${OVIRT_ENGINE_NAME}` reachable.\nProduct: ${product}\nVersion: ${version} + END + +Check oVirt Host Status in `${OVIRT_ENGINE_NAME}` + [Documentation] Identify hypervisor hosts that are not in a healthy state. + [Tags] ovirt hosts hypervisor data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/host_status.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${unhealthy}= Set Variable ${data.get('unhealthy_hosts', [])} + IF len(${unhealthy}) > 0 + ${json_str}= Evaluate json.dumps(${unhealthy}) json + ${table}= RW.CLI.Run Cli + ... cmd=echo '${json_str}' | jq -r '["Host", "Status", "Address"] as $h | $h, (.[] | [.name, .status, .address]) | @tsv' | column -t -s $'\t' + RW.Core.Add Pre To Report Unhealthy Hosts:\n=======================================\n${table.stdout} + FOR ${host} IN @{unhealthy} + RW.Core.Add Issue + ... severity=2 + ... expected=oVirt host `${host['name']}` should be in the 'up' state + ... actual=oVirt host `${host['name']}` is in state `${host['status']}` + ... title=oVirt Host `${host['name']}` Not Operational on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Inspect host `${host['name']}` in the oVirt administration portal + ... details=Host: ${host['name']}\nStatus: ${host['status']}\nAddress: ${host['address']} + ... next_steps=Open the host in the oVirt portal and review its events.\nCheck connectivity between the engine and the host (VDSM/ovirt-host service).\nFor non_operational hosts, verify required networks and storage domains are attached.\nIf the host is unresponsive, confirm power/management (fencing) and consider reinstalling or re-enrolling. + END + ELSE + RW.Core.Add Pre To Report No unhealthy oVirt hosts found. + END + +Check oVirt VM Status in `${OVIRT_ENGINE_NAME}` + [Documentation] Identify VMs in a paused, unknown, or not-responding state. + [Tags] ovirt vms data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/vm_status.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_vms}= Set Variable ${data.get('problem_vms', [])} + IF len(${problem_vms}) > 0 + ${json_str}= Evaluate json.dumps(${problem_vms}) json + ${table}= RW.CLI.Run Cli + ... cmd=echo '${json_str}' | jq -r '["VM", "Status"] as $h | $h, (.[] | [.name, .status]) | @tsv' | column -t -s $'\t' + RW.Core.Add Pre To Report Problem VMs:\n=======================================\n${table.stdout} + FOR ${vm} IN @{problem_vms} + RW.Core.Add Issue + ... severity=2 + ... expected=oVirt VM `${vm['name']}` should be running or cleanly powered off + ... actual=oVirt VM `${vm['name']}` is in state `${vm['status']}` + ... title=oVirt VM `${vm['name']}` in Problematic State on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Inspect VM `${vm['name']}` in the oVirt administration portal + ... details=VM: ${vm['name']}\nStatus: ${vm['status']}\nHost ID: ${vm['host_id']} + ... next_steps=Review the VM's events in the oVirt portal.\nPaused VMs are commonly caused by storage I/O errors - check the underlying storage domain.\nFor 'unknown'/'not_responding' VMs, verify the host running the VM is healthy.\nOnce the underlying issue is resolved, resume or restart the VM. + END + ELSE + RW.Core.Add Pre To Report No VMs in a problematic state found. + END + +Check oVirt Storage Domain Capacity in `${OVIRT_ENGINE_NAME}` + [Documentation] Identify storage domains that are inactive or low on free space. + [Tags] ovirt storage capacity data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/storage_domains.sh + ... cmd_override=${CURDIR}/storage_domains.sh ${OVIRT_STORAGE_FREE_PCT} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_domains}= Set Variable ${data.get('problem_domains', [])} + IF len(${problem_domains}) > 0 + ${json_str}= Evaluate json.dumps(${problem_domains}) json + ${table}= RW.CLI.Run Cli + ... cmd=echo '${json_str}' | jq -r '["Domain", "Type", "Status", "Free %"] as $h | $h, (.[] | [.name, .type, .external_status, (.free_pct|tostring)]) | @tsv' | column -t -s $'\t' + RW.Core.Add Pre To Report Problem Storage Domains:\n=======================================\n${table.stdout} + FOR ${sd} IN @{problem_domains} + RW.Core.Add Issue + ... severity=2 + ... expected=oVirt storage domain `${sd['name']}` should be active with at least ${OVIRT_STORAGE_FREE_PCT}% free + ... actual=oVirt storage domain `${sd['name']}` has status `${sd['external_status']}` with ${sd['free_pct']}% free + ... title=oVirt Storage Domain `${sd['name']}` Needs Attention on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Inspect storage domain `${sd['name']}` in the oVirt administration portal + ... details=Domain: ${sd['name']}\nType: ${sd['type']}\nExternal status: ${sd['external_status']}\nFree: ${sd['free_pct']}% + ... next_steps=If the domain is inactive/in error, reactivate it and check the underlying storage backend connectivity.\nIf free space is low, delete unused disks/snapshots/templates or extend the domain.\nReview storage-related engine events for I/O or connectivity errors. + END + ELSE + RW.Core.Add Pre To Report All oVirt storage domains are active and above the free-space threshold. + END + +Check oVirt Cluster Health in `${OVIRT_ENGINE_NAME}` + [Documentation] Identify clusters with hosts in a non-up, non-maintenance state. + [Tags] ovirt clusters data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/cluster_health.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_clusters}= Set Variable ${data.get('problem_clusters', [])} + IF len(${problem_clusters}) > 0 + FOR ${cluster} IN @{problem_clusters} + ${down_names}= Evaluate ", ".join(${cluster['down_host_names']}) + RW.Core.Add Issue + ... severity=3 + ... expected=oVirt cluster `${cluster['name']}` should have all hosts up or in maintenance + ... actual=oVirt cluster `${cluster['name']}` has ${cluster['down_hosts']} host(s) down + ... title=oVirt Cluster `${cluster['name']}` Has Down Hosts on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Inspect cluster `${cluster['name']}` in the oVirt administration portal + ... details=Cluster: ${cluster['name']}\nDown hosts: ${down_names} + ... next_steps=Investigate each down host listed above and restore it to service.\nVerify cluster capacity remains sufficient for the running workload.\nCheck for fencing/power-management issues on the affected hosts. + END + ELSE + RW.Core.Add Pre To Report All oVirt clusters have healthy hosts. + END + +Check oVirt Recent Critical Events in `${OVIRT_ENGINE_NAME}` + [Documentation] Surface error/alert severity engine events within the lookback window. + [Tags] ovirt events data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/recent_events.sh + ... cmd_override=${CURDIR}/recent_events.sh ${OVIRT_EVENT_LOOKBACK} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${events}= Set Variable ${data.get('critical_events', [])} + IF len(${events}) > 0 + ${json_str}= Evaluate json.dumps(${events}) json + ${table}= RW.CLI.Run Cli + ... cmd=echo '${json_str}' | jq -r '["Severity", "Code", "Host", "VM", "Description"] as $h | $h, (.[] | [.severity, (.code|tostring), .host, .vm, .description]) | @tsv' | column -t -s $'\t' + RW.Core.Add Pre To Report Critical Events (last ${OVIRT_EVENT_LOOKBACK}):\n=======================================\n${table.stdout} + ${count}= Get Length ${events} + RW.Core.Add Issue + ... severity=3 + ... expected=No error/alert severity oVirt events in the last ${OVIRT_EVENT_LOOKBACK} + ... actual=${count} error/alert oVirt event(s) in the last ${OVIRT_EVENT_LOOKBACK} + ... title=oVirt Critical Events Detected on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Review the Events tab in the oVirt administration portal + ... details=${count} critical event(s) found. See the report table for details. + ... next_steps=Review each critical event and correlate it with the affected host, VM, or storage domain.\nAddress the root cause indicated by the event descriptions.\nIf events are recurring, investigate the underlying subsystem (storage, networking, fencing). + ELSE + RW.Core.Add Pre To Report No critical oVirt events in the last ${OVIRT_EVENT_LOOKBACK}. + END + +Check oVirt Stale VM Snapshots in `${OVIRT_ENGINE_NAME}` + [Documentation] Identify VM snapshots older than the configured maximum age. + [Tags] ovirt snapshots vms data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/stale_snapshots.sh + ... cmd_override=${CURDIR}/stale_snapshots.sh ${OVIRT_SNAPSHOT_MAX_AGE} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${snapshots}= Set Variable ${data.get('stale_snapshots', [])} + IF len(${snapshots}) > 0 + ${json_str}= Evaluate json.dumps(${snapshots}) json + ${table}= RW.CLI.Run Cli + ... cmd=echo '${json_str}' | jq -r '["VM", "Snapshot", "Description"] as $h | $h, (.[] | [.vm, .snapshot_id, .description]) | @tsv' | column -t -s $'\t' + RW.Core.Add Pre To Report Stale Snapshots (older than ${OVIRT_SNAPSHOT_MAX_AGE}):\n=======================================\n${table.stdout} + ${count}= Get Length ${snapshots} + RW.Core.Add Issue + ... severity=4 + ... expected=No oVirt VM snapshots older than ${OVIRT_SNAPSHOT_MAX_AGE} + ... actual=${count} VM snapshot(s) older than ${OVIRT_SNAPSHOT_MAX_AGE} + ... title=Stale oVirt VM Snapshots on `${OVIRT_ENGINE_NAME}` + ... reproduce_hint=Review snapshots per VM in the oVirt administration portal + ... details=${count} stale snapshot(s) found. See the report table for details. + ... next_steps=Review each stale snapshot and confirm it is no longer needed.\nDelete (merge) obsolete snapshots to reclaim space and avoid long merge times.\nConsider a retention policy so snapshots are not left indefinitely. + ELSE + RW.Core.Add Pre To Report No stale VM snapshots older than ${OVIRT_SNAPSHOT_MAX_AGE} found. + END + +*** Keywords *** +Suite Initialization + ${OVIRT_ENGINE_URL}= RW.Core.Import User Variable OVIRT_ENGINE_URL + ... type=string + ... description=Base URL of your oVirt engine (without the /ovirt-engine path). + ... pattern=\w* + ... example=https://engine.example.com + ${OVIRT_USERNAME}= RW.Core.Import Secret OVIRT_USERNAME + ... type=string + ... description=oVirt engine username, including the auth profile. + ... pattern=\w* + ... example=admin@internal + ${OVIRT_PASSWORD}= RW.Core.Import Secret OVIRT_PASSWORD + ... type=string + ... description=Password for the oVirt engine user. + ... pattern=\w* + ... example=changeme + ${OVIRT_CA_CERT}= RW.Core.Import Secret OVIRT_CA_CERT + ... type=string + ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave blank to use the system trust store. + ... pattern=\w* + ... example=-----BEGIN CERTIFICATE-----... + ... default= + ${OVIRT_STORAGE_FREE_PCT}= RW.Core.Import User Variable OVIRT_STORAGE_FREE_PCT + ... type=string + ... description=Minimum free space percentage per storage domain before it is considered unhealthy. + ... pattern=\d+ + ... example=10 + ... default=10 + ${OVIRT_EVENT_LOOKBACK}= RW.Core.Import User Variable OVIRT_EVENT_LOOKBACK + ... type=string + ... description=Lookback window for critical events, e.g. 30m, 1h, 1d. + ... pattern=\w* + ... example=1h + ... default=1h + ${OVIRT_SNAPSHOT_MAX_AGE}= RW.Core.Import User Variable OVIRT_SNAPSHOT_MAX_AGE + ... type=string + ... description=Maximum age before a VM snapshot is considered stale, e.g. 24h, 7d, 2w. + ... pattern=\w* + ... example=7d + ... default=7d + ${MAX_PAUSED_VMS}= RW.Core.Import User Variable MAX_PAUSED_VMS + ... type=string + ... description=Maximum number of paused/unknown VMs to still consider healthy. + ... pattern=\d+ + ... example=0 + ... default=0 + ${OVIRT_ENGINE_NAME}= RW.Core.Import User Variable OVIRT_ENGINE_NAME + ... type=string + ... description=A friendly name for this oVirt engine, used in task and report titles. + ... pattern=\w* + ... example=prod-ovirt + ... default=ovirt-engine + Set Suite Variable ${env} {"OVIRT_ENGINE_URL":"${OVIRT_ENGINE_URL}"} + Set Suite Variable ${OVIRT_ENGINE_URL} ${OVIRT_ENGINE_URL} + Set Suite Variable ${OVIRT_USERNAME} ${OVIRT_USERNAME} + Set Suite Variable ${OVIRT_PASSWORD} ${OVIRT_PASSWORD} + Set Suite Variable ${OVIRT_CA_CERT} ${OVIRT_CA_CERT} + Set Suite Variable ${OVIRT_STORAGE_FREE_PCT} ${OVIRT_STORAGE_FREE_PCT} + Set Suite Variable ${OVIRT_EVENT_LOOKBACK} ${OVIRT_EVENT_LOOKBACK} + Set Suite Variable ${OVIRT_SNAPSHOT_MAX_AGE} ${OVIRT_SNAPSHOT_MAX_AGE} + Set Suite Variable ${MAX_PAUSED_VMS} ${MAX_PAUSED_VMS} + Set Suite Variable ${OVIRT_ENGINE_NAME} ${OVIRT_ENGINE_NAME} diff --git a/codebundles/ovirt-engine-health/sli.robot b/codebundles/ovirt-engine-health/sli.robot new file mode 100644 index 00000000..a6337a28 --- /dev/null +++ b/codebundles/ovirt-engine-health/sli.robot @@ -0,0 +1,235 @@ +*** Settings *** +Documentation Measures the health of an oVirt virtualization environment via the engine REST API: +... engine reachability, host status, VM status, storage domain capacity, cluster health, +... recent critical events, and stale VM snapshots. Pushes a composite 0-1 health score. +Metadata Author prathamesh +Metadata Display Name oVirt Engine Health +Metadata Supports oVirt RHV OLVM Virtualization + +Library BuiltIn +Library RW.Core +Library RW.CLI +Library RW.platform + +Suite Setup Suite Initialization + +*** Tasks *** +Check oVirt Engine `${OVIRT_ENGINE_NAME}` Reachability + [Documentation] Verify the oVirt engine API is reachable and an SSO token can be obtained. + [Tags] ovirt engine health data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/engine_health.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${reachable}= Set Variable ${data.get('reachable', False)} + ${engine_score}= Evaluate 1 if ${reachable} else 0 + Set Global Variable ${engine_score} + RW.Core.Push Metric ${engine_score} sub_name=engine_reachable + +Check oVirt Host Status in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when no hypervisor hosts are in an unhealthy (non-operational) state. + [Tags] ovirt hosts hypervisor data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/host_status.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${unhealthy}= Set Variable ${data.get('unhealthy_hosts', [])} + ${host_score}= Evaluate 1 if len($unhealthy) == 0 else 0 + Set Global Variable ${host_score} + RW.Core.Push Metric ${host_score} sub_name=host_status + +Check oVirt VM Status in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when paused/unknown/not-responding VMs are within the allowed maximum. + [Tags] ovirt vms data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/vm_status.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_vms}= Set Variable ${data.get('problem_vms', [])} + ${vm_score}= Evaluate 1 if len($problem_vms) <= int(${MAX_PAUSED_VMS}) else 0 + Set Global Variable ${vm_score} + RW.Core.Push Metric ${vm_score} sub_name=vm_status + +Check oVirt Storage Domain Capacity in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when all storage domains are active and above the free-space threshold. + [Tags] ovirt storage capacity data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/storage_domains.sh + ... cmd_override=${CURDIR}/storage_domains.sh ${OVIRT_STORAGE_FREE_PCT} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_domains}= Set Variable ${data.get('problem_domains', [])} + ${storage_score}= Evaluate 1 if len($problem_domains) == 0 else 0 + Set Global Variable ${storage_score} + RW.Core.Push Metric ${storage_score} sub_name=storage_capacity + +Check oVirt Cluster Health in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when no cluster has hosts in a non-up, non-maintenance state. + [Tags] ovirt clusters data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/cluster_health.sh + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${problem_clusters}= Set Variable ${data.get('problem_clusters', [])} + ${cluster_score}= Evaluate 1 if len($problem_clusters) == 0 else 0 + Set Global Variable ${cluster_score} + RW.Core.Push Metric ${cluster_score} sub_name=cluster_health + +Check oVirt Recent Critical Events in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when there are no error/alert severity events in the lookback window. + [Tags] ovirt events data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/recent_events.sh + ... cmd_override=${CURDIR}/recent_events.sh ${OVIRT_EVENT_LOOKBACK} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${events}= Set Variable ${data.get('critical_events', [])} + ${events_score}= Evaluate 1 if len($events) == 0 else 0 + Set Global Variable ${events_score} + RW.Core.Push Metric ${events_score} sub_name=critical_events + +Check oVirt Stale VM Snapshots in `${OVIRT_ENGINE_NAME}` + [Documentation] Score 1 when there are no VM snapshots older than the configured maximum age. + [Tags] ovirt snapshots vms data:config + ${rsp}= RW.CLI.Run Bash File + ... bash_file=${CURDIR}/stale_snapshots.sh + ... cmd_override=${CURDIR}/stale_snapshots.sh ${OVIRT_SNAPSHOT_MAX_AGE} + ... env=${env} + ... include_in_history=False + ... secret__ovirt_username=${OVIRT_USERNAME} + ... secret__ovirt_password=${OVIRT_PASSWORD} + ... secret__ovirt_ca_cert=${OVIRT_CA_CERT} + TRY + ${data}= Evaluate json.loads(r'''${rsp.stdout}''') json + EXCEPT + Log Failed to load JSON payload, defaulting to empty object. WARN + ${data}= Create Dictionary + END + ${snapshots}= Set Variable ${data.get('stale_snapshots', [])} + ${snapshot_score}= Evaluate 1 if len($snapshots) == 0 else 0 + Set Global Variable ${snapshot_score} + RW.Core.Push Metric ${snapshot_score} sub_name=stale_snapshots + +Generate oVirt Engine `${OVIRT_ENGINE_NAME}` Health Score + [Documentation] Average the individual check scores into a composite health score. + ${health_score}= Evaluate (${engine_score} + ${host_score} + ${vm_score} + ${storage_score} + ${cluster_score} + ${events_score} + ${snapshot_score}) / 7 + ${health_score}= Convert To Number ${health_score} 2 + RW.Core.Push Metric ${health_score} + +*** Keywords *** +Suite Initialization + ${OVIRT_ENGINE_URL}= RW.Core.Import User Variable OVIRT_ENGINE_URL + ... type=string + ... description=Base URL of your oVirt engine (without the /ovirt-engine path). + ... pattern=\w* + ... example=https://engine.example.com + ${OVIRT_USERNAME}= RW.Core.Import Secret OVIRT_USERNAME + ... type=string + ... description=oVirt engine username, including the auth profile. + ... pattern=\w* + ... example=admin@internal + ${OVIRT_PASSWORD}= RW.Core.Import Secret OVIRT_PASSWORD + ... type=string + ... description=Password for the oVirt engine user. + ... pattern=\w* + ... example=changeme + ${OVIRT_CA_CERT}= RW.Core.Import Secret OVIRT_CA_CERT + ... type=string + ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave blank to use the system trust store. + ... pattern=\w* + ... example=-----BEGIN CERTIFICATE-----... + ... default= + ${OVIRT_STORAGE_FREE_PCT}= RW.Core.Import User Variable OVIRT_STORAGE_FREE_PCT + ... type=string + ... description=Minimum free space percentage per storage domain before it is considered unhealthy. + ... pattern=\d+ + ... example=10 + ... default=10 + ${OVIRT_EVENT_LOOKBACK}= RW.Core.Import User Variable OVIRT_EVENT_LOOKBACK + ... type=string + ... description=Lookback window for critical events, e.g. 30m, 1h, 1d. + ... pattern=\w* + ... example=1h + ... default=1h + ${OVIRT_SNAPSHOT_MAX_AGE}= RW.Core.Import User Variable OVIRT_SNAPSHOT_MAX_AGE + ... type=string + ... description=Maximum age before a VM snapshot is considered stale, e.g. 24h, 7d, 2w. + ... pattern=\w* + ... example=7d + ... default=7d + ${MAX_PAUSED_VMS}= RW.Core.Import User Variable MAX_PAUSED_VMS + ... type=string + ... description=Maximum number of paused/unknown VMs to still consider healthy. + ... pattern=\d+ + ... example=0 + ... default=0 + ${OVIRT_ENGINE_NAME}= RW.Core.Import User Variable OVIRT_ENGINE_NAME + ... type=string + ... description=A friendly name for this oVirt engine, used in task and report titles. + ... pattern=\w* + ... example=prod-ovirt + ... default=ovirt-engine + Set Suite Variable ${env} {"OVIRT_ENGINE_URL":"${OVIRT_ENGINE_URL}"} + Set Suite Variable ${OVIRT_ENGINE_URL} ${OVIRT_ENGINE_URL} + Set Suite Variable ${OVIRT_USERNAME} ${OVIRT_USERNAME} + Set Suite Variable ${OVIRT_PASSWORD} ${OVIRT_PASSWORD} + Set Suite Variable ${OVIRT_CA_CERT} ${OVIRT_CA_CERT} + Set Suite Variable ${OVIRT_STORAGE_FREE_PCT} ${OVIRT_STORAGE_FREE_PCT} + Set Suite Variable ${OVIRT_EVENT_LOOKBACK} ${OVIRT_EVENT_LOOKBACK} + Set Suite Variable ${OVIRT_SNAPSHOT_MAX_AGE} ${OVIRT_SNAPSHOT_MAX_AGE} + Set Suite Variable ${MAX_PAUSED_VMS} ${MAX_PAUSED_VMS} + Set Suite Variable ${OVIRT_ENGINE_NAME} ${OVIRT_ENGINE_NAME} diff --git a/codebundles/ovirt-engine-health/stale_snapshots.sh b/codebundles/ovirt-engine-health/stale_snapshots.sh new file mode 100755 index 00000000..dddb4b47 --- /dev/null +++ b/codebundles/ovirt-engine-health/stale_snapshots.sh @@ -0,0 +1,35 @@ +#!/bin/bash +# Find oVirt VM snapshots older than a max-age threshold (arg 1, e.g. 7d / 24h; +# default 7d). The always-present "Active VM" snapshot (snapshot_type=active) +# is excluded. Old snapshots are a common cause of disk bloat and slow merges. + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +MAX_AGE="${1:-7d}" +SECONDS_BACK=$(ovirt_duration_to_seconds "${MAX_AGE}") +CUTOFF_MS=$(( ($(date +%s) - SECONDS_BACK) * 1000 )) + +vms_json=$(ovirt_get "/vms") + +stale="[]" +while IFS=$'\t' read -r vid vname; do + [ -z "${vid}" ] && continue + snaps=$(ovirt_get "/vms/${vid}/snapshots") + vm_stale=$(echo "${snaps}" | jq --arg vm "${vname}" --argjson cutoff "${CUTOFF_MS}" ' + [ .snapshot[]? + | select((.snapshot_type // "") != "active") + | ((.date // 0) | tostring | (try tonumber catch 0)) as $d + | select($d > 0 and $d < $cutoff) + | { + vm: $vm, + snapshot_id: .id, + description: (.description // ""), + date_ms: $d + } + ]' 2>/dev/null) + [ -z "${vm_stale}" ] && vm_stale="[]" + stale=$(echo "${stale} ${vm_stale}" | jq -s 'add') +done < <(echo "${vms_json}" | jq -r '.vm[]? | [.id, .name] | @tsv') + +echo "{\"max_age\": \"${MAX_AGE}\", \"stale_snapshots\": ${stale}}" diff --git a/codebundles/ovirt-engine-health/storage_domains.sh b/codebundles/ovirt-engine-health/storage_domains.sh new file mode 100755 index 00000000..1b7f5457 --- /dev/null +++ b/codebundles/ovirt-engine-health/storage_domains.sh @@ -0,0 +1,39 @@ +#!/bin/bash +# Report oVirt storage domain capacity and status. A domain is flagged when its +# external_status is not 'ok', or when its free space falls below the supplied +# percentage threshold (arg 1, default 10). + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +THRESHOLD="${1:-10}" + +sd_json=$(ovirt_get "/storagedomains") + +echo "${sd_json}" | jq --argjson threshold "${THRESHOLD}" ' + def free_pct: (.available // 0) as $a | (.used // 0) as $u + | (if ($a + $u) > 0 then ($a / ($a + $u) * 100) else 100 end); + { + threshold_pct: $threshold, + storage_domains: [ .storage_domain[]? | { + name: .name, + id: .id, + type: (.type // ""), + external_status: (.external_status // "n/a"), + available_bytes: (.available // 0), + used_bytes: (.used // 0), + free_pct: (free_pct | floor) + } ], + problem_domains: [ .storage_domain[]? + | (free_pct) as $fp + | select( ((.external_status // "ok") != "ok") or ($fp < $threshold) ) + | { + name: .name, + id: .id, + type: (.type // ""), + external_status: (.external_status // "n/a"), + free_pct: ($fp | floor), + available_bytes: (.available // 0) + } + ] + }' diff --git a/codebundles/ovirt-engine-health/vm_status.sh b/codebundles/ovirt-engine-health/vm_status.sh new file mode 100755 index 00000000..79888064 --- /dev/null +++ b/codebundles/ovirt-engine-health/vm_status.sh @@ -0,0 +1,23 @@ +#!/bin/bash +# List oVirt virtual machines and flag those in a problematic runtime state. +# 'down' VMs are intentionally NOT flagged (many VMs are deliberately powered +# off); 'paused', 'unknown' and 'not_responding' indicate real problems +# (commonly storage I/O errors or a lost host connection). + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/ovirt_auth.sh" + +vms_json=$(ovirt_get "/vms") + +echo "${vms_json}" | jq ' + def problem: IN("paused","unknown","not_responding"); + { + total: ([.vm[]?] | length), + problem_vms: [ .vm[]? | select((.status // "") | problem) | { + name: .name, + id: .id, + status: .status, + cluster_id: (.cluster.id // ""), + host_id: (.host.id // "") + } ] + }' From e8325fbb714efd2ddde0359973acde771c695890 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 15:25:30 +0530 Subject: [PATCH 3/7] Add dependency-free mock oVirt engine for testing Stdlib-only mock REST server serving the SSO token endpoint and all 7 API endpoints the bundle calls, with healthy/unhealthy scenarios and now-relative timestamps so event/snapshot windows behave realistically. Wired into the .test Taskfile (mock, test-mock, run-sli-mock) with a Dockerfile and README. Verified all check scripts end-to-end against both scenarios. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../ovirt-engine-health/.test/README.md | 16 ++ .../ovirt-engine-health/.test/Taskfile.yaml | 48 +++++ .../ovirt-engine-health/.test/mock/Dockerfile | 7 + .../ovirt-engine-health/.test/mock/README.md | 64 ++++++ .../.test/mock/ovirt_mock.py | 204 ++++++++++++++++++ 5 files changed, 339 insertions(+) create mode 100644 codebundles/ovirt-engine-health/.test/mock/Dockerfile create mode 100644 codebundles/ovirt-engine-health/.test/mock/README.md create mode 100644 codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py diff --git a/codebundles/ovirt-engine-health/.test/README.md b/codebundles/ovirt-engine-health/.test/README.md index c8cac5eb..a96fc966 100644 --- a/codebundles/ovirt-engine-health/.test/README.md +++ b/codebundles/ovirt-engine-health/.test/README.md @@ -47,3 +47,19 @@ task clean # remove robot output dirs `smoke-scripts` is the fastest way to confirm connectivity and that the engine's JSON shape matches what the scripts expect, without Robot Framework. + +## No engine handy? Use the mock + +`mock/` contains a dependency-free mock oVirt engine so you can exercise the +full bundle flow with no real engine and no cloud cost: + +```bash +task test-mock # start mock, run all check scripts, tear down +task test-mock MOCK_SCENARIO=healthy # nominal data (SLI score == 1, no issues) +task mock # run mock in the foreground on :8080 +task run-sli-mock # run sli.robot against the mock (needs RW libs) +``` + +See `mock/README.md` for details and the scenarios it ships. The mock validates +the bundle's wiring and parsing against the documented v4 API shape; it does not +replace a one-time check against a real engine. diff --git a/codebundles/ovirt-engine-health/.test/Taskfile.yaml b/codebundles/ovirt-engine-health/.test/Taskfile.yaml index 0a67f0fc..5db7d646 100644 --- a/codebundles/ovirt-engine-health/.test/Taskfile.yaml +++ b/codebundles/ovirt-engine-health/.test/Taskfile.yaml @@ -97,6 +97,54 @@ tasks: ../runbook.robot silent: true + mock: + desc: "Run the mock oVirt engine in the foreground (Ctrl-C to stop). Set MOCK_SCENARIO=healthy|unhealthy." + cmds: + - | + MOCK_SCENARIO="${MOCK_SCENARIO:-unhealthy}" MOCK_PORT="${MOCK_PORT:-8080}" python3 mock/ovirt_mock.py + + test-mock: + desc: "Start the mock, run every check script against it, then tear it down. Set MOCK_SCENARIO=healthy|unhealthy." + cmds: + - | + SCENARIO="${MOCK_SCENARIO:-unhealthy}" + PORT="${MOCK_PORT:-8088}" + MOCK_SCENARIO="$SCENARIO" MOCK_PORT="$PORT" python3 mock/ovirt_mock.py & + MOCK_PID=$! + trap 'kill $MOCK_PID 2>/dev/null' EXIT + sleep 1 + export OVIRT_ENGINE_URL="http://localhost:${PORT}" + export OVIRT_USERNAME="admin@internal" + export OVIRT_PASSWORD="mock" + echo "===== scenario: ${SCENARIO} =====" + echo "== engine_health =="; ../engine_health.sh | jq . + echo "== host_status =="; ../host_status.sh | jq '{total, unhealthy_hosts, maintenance_hosts}' + echo "== vm_status =="; ../vm_status.sh | jq '{total, problem_vms}' + echo "== storage_domains =="; ../storage_domains.sh "${OVIRT_STORAGE_FREE_PCT:-10}" | jq '{problem_domains}' + echo "== cluster_health =="; ../cluster_health.sh | jq '{problem_clusters}' + echo "== recent_events =="; ../recent_events.sh "${OVIRT_EVENT_LOOKBACK:-1h}" | jq '{critical_events}' + echo "== stale_snapshots =="; ../stale_snapshots.sh "${OVIRT_SNAPSHOT_MAX_AGE:-7d}" | jq '{stale_snapshots}' + silent: true + + run-sli-mock: + desc: "Run sli.robot against the mock engine (requires Robot Framework + RW libraries)." + cmds: + - | + SCENARIO="${MOCK_SCENARIO:-unhealthy}" + PORT="${MOCK_PORT:-8088}" + MOCK_SCENARIO="$SCENARIO" MOCK_PORT="$PORT" python3 mock/ovirt_mock.py & + MOCK_PID=$! + trap 'kill $MOCK_PID 2>/dev/null' EXIT + sleep 1 + robot \ + --variable OVIRT_ENGINE_URL:"http://localhost:${PORT}" \ + --variable OVIRT_USERNAME:"admin@internal" \ + --variable OVIRT_PASSWORD:"mock" \ + --variable OVIRT_CA_CERT:"" \ + --outputdir output-sli \ + ../sli.robot + silent: true + clean: desc: "Remove robot output directories." cmds: diff --git a/codebundles/ovirt-engine-health/.test/mock/Dockerfile b/codebundles/ovirt-engine-health/.test/mock/Dockerfile new file mode 100644 index 00000000..f4119d74 --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/mock/Dockerfile @@ -0,0 +1,7 @@ +FROM python:3.12-slim +WORKDIR /app +COPY ovirt_mock.py . +ENV MOCK_SCENARIO=unhealthy +ENV MOCK_PORT=8080 +EXPOSE 8080 +CMD ["python3", "ovirt_mock.py"] diff --git a/codebundles/ovirt-engine-health/.test/mock/README.md b/codebundles/ovirt-engine-health/.test/mock/README.md new file mode 100644 index 00000000..dff49cd6 --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/mock/README.md @@ -0,0 +1,64 @@ +# Mock oVirt engine + +A dependency-free (Python stdlib only) mock of the oVirt engine REST API, +serving just the endpoints `ovirt-engine-health` calls. Use it to exercise the +full bundle flow — SSO token → bearer auth → JSON parsing → scoring → issue +raising — without a real engine. + +## What it serves + +| Endpoint | Returns | +|---|---| +| `POST /ovirt-engine/sso/oauth/token` | a static bearer token | +| `GET /ovirt-engine/api` | product_info + summary | +| `GET /ovirt-engine/api/hosts` | host list | +| `GET /ovirt-engine/api/vms` | VM list | +| `GET /ovirt-engine/api/storagedomains` | storage domains | +| `GET /ovirt-engine/api/clusters` | clusters | +| `GET /ovirt-engine/api/events` | error/alert events + a warning (client filters it) | +| `GET /ovirt-engine/api/vms//snapshots` | snapshots (active + optionally stale) | + +Event times and snapshot dates are generated relative to "now", so the bundle's +lookback/age windows behave realistically. + +## Scenarios + +Set `MOCK_SCENARIO`: +- **`unhealthy`** (default) — a non-operational host, a paused VM, a near-full + data domain + an errored ISO domain, a cluster with a down host, error/alert + events, and a stale snapshot. The runbook raises issues; the SLI score < 1. +- **`healthy`** — everything nominal. SLI score == 1, no issues. + +## Run it + +Via Taskfile (from `.test/`): + +```bash +task test-mock # start mock, run all check scripts, tear down +task test-mock MOCK_SCENARIO=healthy # healthy variant +task mock # run mock in foreground on :8080 +task run-sli-mock # run sli.robot against the mock (needs RW libs) +``` + +Directly: + +```bash +MOCK_SCENARIO=unhealthy MOCK_PORT=8080 python3 ovirt_mock.py +# then, in another shell: +export OVIRT_ENGINE_URL=http://localhost:8080 OVIRT_USERNAME=admin@internal OVIRT_PASSWORD=mock +../../host_status.sh | jq . +``` + +Via Docker: + +```bash +docker build -t ovirt-mock . +docker run --rm -p 8080:8080 -e MOCK_SCENARIO=unhealthy ovirt-mock +``` + +## Limitation + +This mock reflects the **documented v4 API shape** the scripts assume — it does +**not** prove a real engine returns identical field names/types (e.g. whether +`event.time` is epoch-ms or ISO). For that, validate once against a real engine +or capture fixtures from one. diff --git a/codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py b/codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py new file mode 100644 index 00000000..41c66c1c --- /dev/null +++ b/codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py @@ -0,0 +1,204 @@ +#!/usr/bin/env python3 +"""Minimal mock of the oVirt engine REST API for testing ovirt-engine-health. + +Serves just the endpoints the codebundle calls, with v4-shaped JSON: + POST /ovirt-engine/sso/oauth/token -> bearer token + GET /ovirt-engine/api -> product_info + summary + GET /ovirt-engine/api/hosts + GET /ovirt-engine/api/vms + GET /ovirt-engine/api/storagedomains + GET /ovirt-engine/api/clusters + GET /ovirt-engine/api/events -> error/alert + a filtered warning + GET /ovirt-engine/api/vms//snapshots + +No external dependencies (Python stdlib only). Timestamps are generated +relative to "now" so the bundle's time-window filters (events lookback, stale +snapshot age) behave realistically. + +Scenario is chosen with the MOCK_SCENARIO env var: + unhealthy (default) -> problems in every category, so the runbook raises + issues and the SLI score drops below 1. + healthy -> everything nominal, SLI score == 1, no issues. + +Run: MOCK_SCENARIO=unhealthy python3 ovirt_mock.py # listens on :8080 +""" + +import json +import os +import time +from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer +from urllib.parse import urlparse + +PORT = int(os.environ.get("MOCK_PORT", "8080")) +SCENARIO = os.environ.get("MOCK_SCENARIO", "unhealthy").lower() + +MS = 1000 + + +def now_ms(): + return int(time.time() * MS) + + +def root_summary(): + return { + "product_info": {"name": "oVirt", "version": {"full_version": "4.5.4-1.el8"}}, + "summary": { + "vms": {"total": 3}, + "hosts": {"total": 3}, + "storage_domains": {"total": 2}, + }, + } + + +def clusters(): + return {"cluster": [{"name": "Default", "id": "cluster-1"}]} + + +def hosts(): + if SCENARIO == "healthy": + host_list = [ + {"name": "host-01", "id": "host-1", "status": "up", + "cluster": {"id": "cluster-1"}, "address": "10.0.0.11"}, + {"name": "host-02", "id": "host-2", "status": "up", + "cluster": {"id": "cluster-1"}, "address": "10.0.0.12"}, + ] + else: + host_list = [ + {"name": "host-01", "id": "host-1", "status": "up", + "cluster": {"id": "cluster-1"}, "address": "10.0.0.11"}, + {"name": "host-02", "id": "host-2", "status": "non_operational", + "cluster": {"id": "cluster-1"}, "address": "10.0.0.12"}, + {"name": "host-03", "id": "host-3", "status": "maintenance", + "cluster": {"id": "cluster-1"}, "address": "10.0.0.13"}, + ] + return {"host": host_list} + + +def vms(): + if SCENARIO == "healthy": + vm_list = [ + {"name": "web-01", "id": "vm-1", "status": "up", + "cluster": {"id": "cluster-1"}, "host": {"id": "host-1"}}, + {"name": "batch-01", "id": "vm-2", "status": "down", + "cluster": {"id": "cluster-1"}}, + ] + else: + vm_list = [ + {"name": "web-01", "id": "vm-1", "status": "up", + "cluster": {"id": "cluster-1"}, "host": {"id": "host-1"}}, + {"name": "db-01", "id": "vm-2", "status": "paused", + "cluster": {"id": "cluster-1"}, "host": {"id": "host-2"}}, + {"name": "batch-01", "id": "vm-3", "status": "down", + "cluster": {"id": "cluster-1"}}, + ] + return {"vm": vm_list} + + +def storagedomains(): + if SCENARIO == "healthy": + sd_list = [ + {"name": "data", "id": "sd-1", "type": "data", + "external_status": "ok", "available": 900 * 10**9, "used": 100 * 10**9}, + {"name": "iso", "id": "sd-2", "type": "iso", + "external_status": "ok", "available": 40 * 10**9, "used": 10 * 10**9}, + ] + else: + sd_list = [ + # ~5% free -> below default 10% threshold + {"name": "data", "id": "sd-1", "type": "data", + "external_status": "ok", "available": 5 * 10**9, "used": 95 * 10**9}, + {"name": "iso", "id": "sd-2", "type": "iso", + "external_status": "error", "available": 40 * 10**9, "used": 10 * 10**9}, + ] + return {"storage_domain": sd_list} + + +def events(): + if SCENARIO == "healthy": + ev_list = [ + {"id": "ev-100", "severity": "normal", "time": now_ms() - 120 * MS, + "code": 30, "description": "VM web-01 started"}, + ] + else: + ev_list = [ + {"id": "ev-1", "severity": "error", "time": now_ms() - 300 * MS, + "code": 119, "description": "VM db-01 has paused due to a storage I/O error", + "vm": {"name": "db-01"}, "storage_domain": {"name": "data"}}, + {"id": "ev-2", "severity": "alert", "time": now_ms() - 600 * MS, + "code": 9000, "description": "Host host-02 became non-operational", + "host": {"name": "host-02"}}, + # a warning the client-side filter should drop + {"id": "ev-3", "severity": "warning", "time": now_ms() - 90 * MS, + "code": 50, "description": "High memory usage on host-01", + "host": {"name": "host-01"}}, + ] + return {"event": ev_list} + + +def snapshots(vm_id): + active = {"id": f"{vm_id}-active", "snapshot_type": "active", + "description": "Active VM", "date": now_ms()} + if SCENARIO != "healthy" and vm_id == "vm-1": + stale = {"id": f"{vm_id}-snap-1", "snapshot_type": "regular", + "description": "pre-upgrade", "date": now_ms() - 30 * 86400 * MS} + return {"snapshot": [active, stale]} + return {"snapshot": [active]} + + +class Handler(BaseHTTPRequestHandler): + def log_message(self, format, *args): # quiet by default + pass + + def _send(self, payload, code=200): + body = json.dumps(payload).encode("utf-8") + self.send_response(code) + self.send_header("Content-Type", "application/json") + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + + def do_POST(self): + path = urlparse(self.path).path + if path == "/ovirt-engine/sso/oauth/token": + self._send({ + "access_token": "mock-token-abc123", + "token_type": "bearer", + "scope": "ovirt-app-api", + "exp": "9999999999999", + }) + return + self._send({"error": f"unhandled POST {path}"}, 404) + + def do_GET(self): + path = urlparse(self.path).path.rstrip("/") + api = "/ovirt-engine/api" + if path == api: + self._send(root_summary()) + elif path == f"{api}/hosts": + self._send(hosts()) + elif path == f"{api}/vms": + self._send(vms()) + elif path == f"{api}/storagedomains": + self._send(storagedomains()) + elif path == f"{api}/clusters": + self._send(clusters()) + elif path == f"{api}/events": + self._send(events()) + elif path.startswith(f"{api}/vms/") and path.endswith("/snapshots"): + vm_id = path[len(f"{api}/vms/"):-len("/snapshots")] + self._send(snapshots(vm_id)) + else: + self._send({"error": f"unhandled GET {path}"}, 404) + + +def main(): + server = ThreadingHTTPServer(("0.0.0.0", PORT), Handler) + print(f"oVirt mock listening on :{PORT} (scenario={SCENARIO})", flush=True) + try: + server.serve_forever() + except KeyboardInterrupt: + server.shutdown() + + +if __name__ == "__main__": + main() From 032e5fa1959f828ef6df2752e141fd312a9dec39 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 15:34:51 +0530 Subject: [PATCH 4/7] Make OVIRT_CA_CERT optional so the suite runs without a CA cert Importing OVIRT_CA_CERT as a required secret failed the entire suite whenever no CA cert was configured (the common system-trust-store case). Mark the import optional in both robots; when unset, Run Bash File skips the non-Secret value and ovirt_auth.sh falls back to the system trust store. Guard the secret entry in the SLI/taskset templates with an {% if custom.ovirt_ca_cert %} so it is only referenced when provided. Verified end-to-end against the mock (with RW.Core/RW.platform installed): - SLI healthy -> composite 1.0; unhealthy -> 0.14 with correct sub-scores - Runbook unhealthy -> 7 issues (engine-reachable branch correctly skipped), healthy -> 0 issues - Both robots run cleanly with OVIRT_CA_CERT entirely absent Co-Authored-By: Claude Opus 4.8 (1M context) --- .../.runwhen/generation-rules/ovirt-engine-health.yaml | 0 .../.runwhen/templates/ovirt-engine-health-sli.yaml | 2 ++ .../.runwhen/templates/ovirt-engine-health-slx.yaml | 0 .../.runwhen/templates/ovirt-engine-health-taskset.yaml | 2 ++ codebundles/ovirt-engine-health/.test/.gitignore | 0 codebundles/ovirt-engine-health/.test/README.md | 0 codebundles/ovirt-engine-health/.test/Taskfile.yaml | 0 codebundles/ovirt-engine-health/.test/mock/Dockerfile | 0 codebundles/ovirt-engine-health/.test/mock/README.md | 0 codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py | 0 codebundles/ovirt-engine-health/README.md | 0 codebundles/ovirt-engine-health/runbook.robot | 4 ++-- codebundles/ovirt-engine-health/sli.robot | 4 ++-- 13 files changed, 8 insertions(+), 4 deletions(-) mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/.gitignore mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/Taskfile.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/mock/Dockerfile mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/mock/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py mode change 100644 => 100755 codebundles/ovirt-engine-health/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/runbook.robot mode change 100644 => 100755 codebundles/ovirt-engine-health/sli.robot diff --git a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml old mode 100644 new mode 100755 index fa781da8..6fcfbfa8 --- a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml +++ b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml @@ -34,8 +34,10 @@ spec: workspaceKey: {{custom.ovirt_username}} - name: OVIRT_PASSWORD workspaceKey: {{custom.ovirt_password}} + {% if custom.ovirt_ca_cert %} - name: OVIRT_CA_CERT workspaceKey: {{custom.ovirt_ca_cert}} + {% endif %} alertConfig: tasks: persona: eager-edgar diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml old mode 100644 new mode 100755 index 20e50637..6f751bae --- a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml +++ b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml @@ -28,5 +28,7 @@ spec: workspaceKey: {{custom.ovirt_username}} - name: OVIRT_PASSWORD workspaceKey: {{custom.ovirt_password}} + {% if custom.ovirt_ca_cert %} - name: OVIRT_CA_CERT workspaceKey: {{custom.ovirt_ca_cert}} + {% endif %} diff --git a/codebundles/ovirt-engine-health/.test/.gitignore b/codebundles/ovirt-engine-health/.test/.gitignore old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/README.md b/codebundles/ovirt-engine-health/.test/README.md old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/Taskfile.yaml b/codebundles/ovirt-engine-health/.test/Taskfile.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/mock/Dockerfile b/codebundles/ovirt-engine-health/.test/mock/Dockerfile old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/mock/README.md b/codebundles/ovirt-engine-health/.test/mock/README.md old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py b/codebundles/ovirt-engine-health/.test/mock/ovirt_mock.py old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/README.md b/codebundles/ovirt-engine-health/README.md old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/runbook.robot b/codebundles/ovirt-engine-health/runbook.robot old mode 100644 new mode 100755 index 34db083e..2bf037e1 --- a/codebundles/ovirt-engine-health/runbook.robot +++ b/codebundles/ovirt-engine-health/runbook.robot @@ -281,10 +281,10 @@ Suite Initialization ... example=changeme ${OVIRT_CA_CERT}= RW.Core.Import Secret OVIRT_CA_CERT ... type=string - ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave blank to use the system trust store. + ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave unset to use the system trust store. ... pattern=\w* ... example=-----BEGIN CERTIFICATE-----... - ... default= + ... optional=${True} ${OVIRT_STORAGE_FREE_PCT}= RW.Core.Import User Variable OVIRT_STORAGE_FREE_PCT ... type=string ... description=Minimum free space percentage per storage domain before it is considered unhealthy. diff --git a/codebundles/ovirt-engine-health/sli.robot b/codebundles/ovirt-engine-health/sli.robot old mode 100644 new mode 100755 index a6337a28..778e0dd1 --- a/codebundles/ovirt-engine-health/sli.robot +++ b/codebundles/ovirt-engine-health/sli.robot @@ -189,10 +189,10 @@ Suite Initialization ... example=changeme ${OVIRT_CA_CERT}= RW.Core.Import Secret OVIRT_CA_CERT ... type=string - ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave blank to use the system trust store. + ... description=Optional PEM CA bundle to verify the engine TLS certificate. Leave unset to use the system trust store. ... pattern=\w* ... example=-----BEGIN CERTIFICATE-----... - ... default= + ... optional=${True} ${OVIRT_STORAGE_FREE_PCT}= RW.Core.Import User Variable OVIRT_STORAGE_FREE_PCT ... type=string ... description=Minimum free space percentage per storage domain before it is considered unhealthy. From 1e4b7d2826819fbe2bf6d557f014796400deea38 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 15:35:34 +0530 Subject: [PATCH 5/7] Normalize file permissions (644 for yaml/md/robot, 755 for scripts) Co-Authored-By: Claude Opus 4.8 (1M context) --- .../.runwhen/generation-rules/ovirt-engine-health.yaml | 0 .../.runwhen/templates/ovirt-engine-health-sli.yaml | 0 .../.runwhen/templates/ovirt-engine-health-slx.yaml | 0 .../.runwhen/templates/ovirt-engine-health-taskset.yaml | 0 codebundles/ovirt-engine-health/.test/.gitignore | 0 codebundles/ovirt-engine-health/.test/README.md | 0 codebundles/ovirt-engine-health/.test/Taskfile.yaml | 0 codebundles/ovirt-engine-health/.test/mock/Dockerfile | 0 codebundles/ovirt-engine-health/.test/mock/README.md | 0 codebundles/ovirt-engine-health/README.md | 0 codebundles/ovirt-engine-health/runbook.robot | 0 codebundles/ovirt-engine-health/sli.robot | 0 12 files changed, 0 insertions(+), 0 deletions(-) mode change 100755 => 100644 codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml mode change 100755 => 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml mode change 100755 => 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml mode change 100755 => 100644 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml mode change 100755 => 100644 codebundles/ovirt-engine-health/.test/.gitignore mode change 100755 => 100644 codebundles/ovirt-engine-health/.test/README.md mode change 100755 => 100644 codebundles/ovirt-engine-health/.test/Taskfile.yaml mode change 100755 => 100644 codebundles/ovirt-engine-health/.test/mock/Dockerfile mode change 100755 => 100644 codebundles/ovirt-engine-health/.test/mock/README.md mode change 100755 => 100644 codebundles/ovirt-engine-health/README.md mode change 100755 => 100644 codebundles/ovirt-engine-health/runbook.robot mode change 100755 => 100644 codebundles/ovirt-engine-health/sli.robot diff --git a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.test/.gitignore b/codebundles/ovirt-engine-health/.test/.gitignore old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.test/README.md b/codebundles/ovirt-engine-health/.test/README.md old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.test/Taskfile.yaml b/codebundles/ovirt-engine-health/.test/Taskfile.yaml old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.test/mock/Dockerfile b/codebundles/ovirt-engine-health/.test/mock/Dockerfile old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/.test/mock/README.md b/codebundles/ovirt-engine-health/.test/mock/README.md old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/README.md b/codebundles/ovirt-engine-health/README.md old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/runbook.robot b/codebundles/ovirt-engine-health/runbook.robot old mode 100755 new mode 100644 diff --git a/codebundles/ovirt-engine-health/sli.robot b/codebundles/ovirt-engine-health/sli.robot old mode 100755 new mode 100644 From ffeecc215e3da64153881e5efccb4416ae2d3eaa Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 21:26:21 +0530 Subject: [PATCH 6/7] Remove empty oVirt generation-rules file (silences upload warning) oVirt is not a RunWhen-discoverable platform type, so a generation rule has nothing to match and the commented placeholder file only produced a 'generation rules file does not contain any data' warning during workspace upload. Remove it; document that the SLX is created directly from the templates (config + secrets) rather than auto-generated. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../generation-rules/ovirt-engine-health.yaml | 30 ------------------- .../templates/ovirt-engine-health-sli.yaml | 0 .../templates/ovirt-engine-health-slx.yaml | 0 .../ovirt-engine-health-taskset.yaml | 0 .../ovirt-engine-health/.test/.gitignore | 0 .../ovirt-engine-health/.test/README.md | 0 .../ovirt-engine-health/.test/Taskfile.yaml | 0 .../ovirt-engine-health/.test/mock/Dockerfile | 0 .../ovirt-engine-health/.test/mock/README.md | 0 codebundles/ovirt-engine-health/README.md | 19 +++++++----- codebundles/ovirt-engine-health/runbook.robot | 0 codebundles/ovirt-engine-health/sli.robot | 0 12 files changed, 12 insertions(+), 37 deletions(-) delete mode 100644 codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/.gitignore mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/Taskfile.yaml mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/mock/Dockerfile mode change 100644 => 100755 codebundles/ovirt-engine-health/.test/mock/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/README.md mode change 100644 => 100755 codebundles/ovirt-engine-health/runbook.robot mode change 100644 => 100755 codebundles/ovirt-engine-health/sli.robot diff --git a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml deleted file mode 100644 index 9ce1f39d..00000000 --- a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# oVirt is not a RunWhen-discovered platform type, so SLXs for this bundle are -# not auto-generated from a cloud resource match. Create the SLX from the -# templates in ../templates/ using config (OVIRT_ENGINE_URL) and workspace -# secrets (OVIRT_USERNAME, OVIRT_PASSWORD, optional OVIRT_CA_CERT). -# -# The block below is the template for how a generation rule would look if oVirt -# were ever added as a discoverable platform type. -# -# apiVersion: runwhen.com/v1 -# kind: GenerationRules -# spec: -# platform: ovirt -# generationRules: -# - resourceTypes: -# - ovirt_engine -# matchRules: -# - type: pattern -# pattern: ".+" -# properties: [name] -# mode: substring -# slxs: -# - baseName: ovirt-engine-health -# qualifiers: ["resource"] -# baseTemplateName: ovirt-engine-health -# levelOfDetail: detailed -# outputItems: -# - type: slx -# - type: sli -# - type: runbook -# templateName: ovirt-engine-health-taskset.yaml diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-sli.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-slx.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml b/codebundles/ovirt-engine-health/.runwhen/templates/ovirt-engine-health-taskset.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/.gitignore b/codebundles/ovirt-engine-health/.test/.gitignore old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/README.md b/codebundles/ovirt-engine-health/.test/README.md old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/Taskfile.yaml b/codebundles/ovirt-engine-health/.test/Taskfile.yaml old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/mock/Dockerfile b/codebundles/ovirt-engine-health/.test/mock/Dockerfile old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/.test/mock/README.md b/codebundles/ovirt-engine-health/.test/mock/README.md old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/README.md b/codebundles/ovirt-engine-health/README.md old mode 100644 new mode 100755 index 1a891128..3f078262 --- a/codebundles/ovirt-engine-health/README.md +++ b/codebundles/ovirt-engine-health/README.md @@ -51,13 +51,18 @@ export OVIRT_ENGINE_NAME="prod-ovirt" > system trust store is used (the request will fail if the cert is not trusted). ## Discovery -oVirt has no RunWhen-discovered cloud resource type, so SLXs for this bundle are -created from the workspace config index rather than auto-discovered from a cloud -provider. The templates under `.runwhen/templates/` define the SLX, SLI, and -runbook; the generation rule under `.runwhen/generation-rules/` wires them to -the config index. Provide the `OVIRT_ENGINE_URL` config value and the -`OVIRT_USERNAME` / `OVIRT_PASSWORD` (and optional `OVIRT_CA_CERT`) workspace -secrets. +oVirt is **not** a RunWhen-discoverable platform type, so there is no generation +rule (a generation rule must match a cloud resource type, and oVirt has none). +The SLX is therefore created directly rather than auto-generated during workspace +discovery. + +The templates under `.runwhen/templates/` (SLX, SLI, runbook) are provided as the +content to commit. Render them with: +- config: `OVIRT_ENGINE_URL`, `OVIRT_ENGINE_NAME` +- secrets: `OVIRT_USERNAME`, `OVIRT_PASSWORD`, and optionally `OVIRT_CA_CERT` + +and commit the resulting SLX/SLI/runbook to your workspace (e.g. via the RunWhen +API or `commit_slx`). ## Testing See the `.test` directory for how to run the SLI and runbook against a reachable diff --git a/codebundles/ovirt-engine-health/runbook.robot b/codebundles/ovirt-engine-health/runbook.robot old mode 100644 new mode 100755 diff --git a/codebundles/ovirt-engine-health/sli.robot b/codebundles/ovirt-engine-health/sli.robot old mode 100644 new mode 100755 From 104cbf1e8459404e859367bafdc357fc33589423 Mon Sep 17 00:00:00 2001 From: Prathamesh Lohakare Date: Wed, 3 Jun 2026 21:42:47 +0530 Subject: [PATCH 7/7] Generate ovirt-engine-health SLX via Kubernetes cluster anchor oVirt has no discoverable resource of its own, so anchor the generation rule on the kubernetes 'cluster' resource purely as a trigger (qualifiers: [cluster]) -> one oVirt SLX per discovered cluster. All SLX/SLI/runbook content comes from workspaceInfo custom.* + workspace secrets, not the matched cluster. Mirrors the k8s-cluster-resource-health singleton pattern. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../generation-rules/ovirt-engine-health.yaml | 33 +++++++++++++++++++ codebundles/ovirt-engine-health/README.md | 21 ++++++------ 2 files changed, 44 insertions(+), 10 deletions(-) create mode 100644 codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml diff --git a/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml new file mode 100644 index 00000000..cba65601 --- /dev/null +++ b/codebundles/ovirt-engine-health/.runwhen/generation-rules/ovirt-engine-health.yaml @@ -0,0 +1,33 @@ +# oVirt is not itself a RunWhen-discoverable platform, so this rule anchors on the +# Kubernetes `cluster` resource purely as a generation trigger: it emits one +# oVirt engine-health SLX per discovered cluster. All SLX/SLI/runbook content +# comes from the workspaceInfo `custom.*` values (ovirt_engine_url, ovirt_engine_name) +# and workspace secrets (ovirt_username, ovirt_password, optional ovirt_ca_cert), +# not from the matched cluster. +# +# Requires the workspace to discover at least one Kubernetes cluster. If multiple +# clusters are discovered, one oVirt SLX is generated per cluster. +apiVersion: runwhen.com/v1 +kind: GenerationRules +spec: + platform: kubernetes + generationRules: + - resourceTypes: + - cluster + matchRules: + - type: and + matches: + - type: pattern + pattern: ".+" + properties: [name] + mode: substring + slxs: + - baseName: ovirt-engine-health + qualifiers: ["cluster"] + baseTemplateName: ovirt-engine-health + levelOfDetail: detailed + outputItems: + - type: slx + - type: sli + - type: runbook + templateName: ovirt-engine-health-taskset.yaml diff --git a/codebundles/ovirt-engine-health/README.md b/codebundles/ovirt-engine-health/README.md index 3f078262..e2dc6a98 100755 --- a/codebundles/ovirt-engine-health/README.md +++ b/codebundles/ovirt-engine-health/README.md @@ -51,18 +51,19 @@ export OVIRT_ENGINE_NAME="prod-ovirt" > system trust store is used (the request will fail if the cert is not trusted). ## Discovery -oVirt is **not** a RunWhen-discoverable platform type, so there is no generation -rule (a generation rule must match a cloud resource type, and oVirt has none). -The SLX is therefore created directly rather than auto-generated during workspace -discovery. +oVirt is **not** itself a RunWhen-discoverable platform type, so the generation +rule anchors on the Kubernetes `cluster` resource purely as a trigger: it emits +one oVirt engine-health SLX per discovered cluster. The SLX/SLI/runbook content +comes entirely from `workspaceInfo` `custom.*` values and workspace secrets, not +from the matched cluster. -The templates under `.runwhen/templates/` (SLX, SLI, runbook) are provided as the -content to commit. Render them with: -- config: `OVIRT_ENGINE_URL`, `OVIRT_ENGINE_NAME` -- secrets: `OVIRT_USERNAME`, `OVIRT_PASSWORD`, and optionally `OVIRT_CA_CERT` +Requirements in `workspaceInfo.yaml`: +- the workspace discovers at least one Kubernetes cluster (`kubeConfig`) +- `custom.ovirt_engine_url`, `custom.ovirt_engine_name` +- `custom.ovirt_username`, `custom.ovirt_password` (and optional + `custom.ovirt_ca_cert`) pointing at the corresponding workspace secrets -and commit the resulting SLX/SLI/runbook to your workspace (e.g. via the RunWhen -API or `commit_slx`). +If multiple clusters are discovered, one oVirt SLX is generated per cluster. ## Testing See the `.test` directory for how to run the SLI and runbook against a reachable