Add ovirt-engine-health codebundle by theyashl · Pull Request #682 · runwhen-contrib/rw-cli-codecollection

theyashl · 2026-06-03T09:44:23Z

Summary

Adds a new ovirt-engine-health CodeBundle that monitors the health of an oVirt virtualization environment (oVirt / Red Hat Virtualization / Oracle Linux Virtualization Manager) via the oVirt engine REST API (/ovirt-engine/api).

Modeled on the existing jenkins-health bundle: an SLI that emits a composite 0–1 health score and a runbook that raises an actionable issue per problem, both backed by bash scripts using curl+jq.

Checks (7, shared across `sli.robot` and `runbook.robot`)

Engine reachability — API responds + SSO token obtainable (sev 1)
Host status — hypervisor hosts in non-operational states (sev 2; maintenance reported, not failed)
VM status — paused / unknown / not-responding VMs (sev 2)
Storage domain capacity — inactive domains or below free-space % threshold (sev 2)
Cluster health — clusters with hosts down (sev 3)
Recent critical events — error/alert events within a lookback window (sev 3)
Stale VM snapshots — snapshots older than a max-age threshold (sev 4)

The SLI adds a Generate ... Health Score task that averages the 7 sub-scores.

Key decisions

Auth: SSO bearer token via /ovirt-engine/sso/oauth/token (grant_type=password, scope=ovirt-app-api), shared ovirt_auth.sh helper.
TLS: optional OVIRT_CA_CERT → curl --cacert, otherwise the system trust store (self-signed engine certs are common).
Discovery: oVirt is not a RunWhen-discoverable platform type, so — following the existing gh-actions-health precedent — the generation rule is a commented "how it would look" template and the README is explicit that SLXs are config-driven, not auto-discovered.
Robust timestamp parsing (try tonumber catch 0) so an unexpected date format can't crash a check or produce false-positive stale snapshots.

Testing

.test/ ships a lightweight Taskfile (check-config, smoke-scripts, run-sli, run-runbook) + README — no infra provisioning, since oVirt is self-hosted.
Verified locally: bash -n clean on all 8 scripts; all 7 jq filters validated against representative payloads; Robot dry-run parsed every task with no syntax errors (only the RW platform libraries are unavailable locally, as expected).

⚠️ Not yet run against a live oVirt engine. Exact JSON field shapes (host/VM state strings, event.time/snapshot.date epoch-ms vs ISO, the events search grammar) are based on the v4 API docs + defensive parsing — noted as "verify on a live engine" in the design spec's Open Risks. cd .test && task smoke-scripts against a lab engine is the quickest confirmation.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Comprehensive SLI + runbook for oVirt/RHV/OLVM environments via the engine REST API. Authenticates with an SSO bearer token and checks engine reachability, host status, VM status, storage domain capacity, cluster health, recent critical events, and stale VM snapshots. Optional CA cert for TLS verification. Includes .runwhen templates and a lightweight .test harness (no infra provisioning, since oVirt is self-hosted). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Stdlib-only mock REST server serving the SSO token endpoint and all 7 API endpoints the bundle calls, with healthy/unhealthy scenarios and now-relative timestamps so event/snapshot windows behave realistically. Wired into the .test Taskfile (mock, test-mock, run-sli-mock) with a Dockerfile and README. Verified all check scripts end-to-end against both scenarios. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Importing OVIRT_CA_CERT as a required secret failed the entire suite whenever no CA cert was configured (the common system-trust-store case). Mark the import optional in both robots; when unset, Run Bash File skips the non-Secret value and ovirt_auth.sh falls back to the system trust store. Guard the secret entry in the SLI/taskset templates with an {% if custom.ovirt_ca_cert %} so it is only referenced when provided. Verified end-to-end against the mock (with RW.Core/RW.platform installed): - SLI healthy -> composite 1.0; unhealthy -> 0.14 with correct sub-scores - Runbook unhealthy -> 7 issues (engine-reachable branch correctly skipped), healthy -> 0 issues - Both robots run cleanly with OVIRT_CA_CERT entirely absent Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

oVirt is not a RunWhen-discoverable platform type, so a generation rule has nothing to match and the commented placeholder file only produced a 'generation rules file does not contain any data' warning during workspace upload. Remove it; document that the SLX is created directly from the templates (config + secrets) rather than auto-generated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

oVirt has no discoverable resource of its own, so anchor the generation rule on the kubernetes 'cluster' resource purely as a trigger (qualifiers: [cluster]) -> one oVirt SLX per discovered cluster. All SLX/SLI/runbook content comes from workspaceInfo custom.* + workspace secrets, not the matched cluster. Mirrors the k8s-cluster-resource-health singleton pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

theyashl and others added 2 commits June 3, 2026 14:59

Add design spec for ovirt-engine-health codebundle

2d1ea69

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

theyashl requested a review from a team as a code owner June 3, 2026 09:44

theyashl self-assigned this Jun 3, 2026

theyashl and others added 5 commits June 3, 2026 15:25

Normalize file permissions (644 for yaml/md/robot, 755 for scripts)

1e4b7d2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ovirt-engine-health codebundle#682

Add ovirt-engine-health codebundle#682
theyashl wants to merge 7 commits into
mainfrom
ovirt-engine-health-codebundle

theyashl commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theyashl commented Jun 3, 2026

Summary

Checks (7, shared across sli.robot and runbook.robot)

Key decisions

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Checks (7, shared across `sli.robot` and `runbook.robot`)