Skip to content

Add ovirt-engine-health codebundle#682

Open
theyashl wants to merge 7 commits into
mainfrom
ovirt-engine-health-codebundle
Open

Add ovirt-engine-health codebundle#682
theyashl wants to merge 7 commits into
mainfrom
ovirt-engine-health-codebundle

Conversation

@theyashl

@theyashl theyashl commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new ovirt-engine-health CodeBundle that monitors the health of an oVirt virtualization environment (oVirt / Red Hat Virtualization / Oracle Linux Virtualization Manager) via the oVirt engine REST API (/ovirt-engine/api).

Modeled on the existing jenkins-health bundle: an SLI that emits a composite 0–1 health score and a runbook that raises an actionable issue per problem, both backed by bash scripts using curl+jq.

Checks (7, shared across sli.robot and runbook.robot)

  1. Engine reachability — API responds + SSO token obtainable (sev 1)
  2. Host status — hypervisor hosts in non-operational states (sev 2; maintenance reported, not failed)
  3. VM status — paused / unknown / not-responding VMs (sev 2)
  4. Storage domain capacity — inactive domains or below free-space % threshold (sev 2)
  5. Cluster health — clusters with hosts down (sev 3)
  6. Recent critical events — error/alert events within a lookback window (sev 3)
  7. Stale VM snapshots — snapshots older than a max-age threshold (sev 4)

The SLI adds a Generate ... Health Score task that averages the 7 sub-scores.

Key decisions

  • Auth: SSO bearer token via /ovirt-engine/sso/oauth/token (grant_type=password, scope=ovirt-app-api), shared ovirt_auth.sh helper.
  • TLS: optional OVIRT_CA_CERTcurl --cacert, otherwise the system trust store (self-signed engine certs are common).
  • Discovery: oVirt is not a RunWhen-discoverable platform type, so — following the existing gh-actions-health precedent — the generation rule is a commented "how it would look" template and the README is explicit that SLXs are config-driven, not auto-discovered.
  • Robust timestamp parsing (try tonumber catch 0) so an unexpected date format can't crash a check or produce false-positive stale snapshots.

Testing

  • .test/ ships a lightweight Taskfile (check-config, smoke-scripts, run-sli, run-runbook) + README — no infra provisioning, since oVirt is self-hosted.
  • Verified locally: bash -n clean on all 8 scripts; all 7 jq filters validated against representative payloads; Robot dry-run parsed every task with no syntax errors (only the RW platform libraries are unavailable locally, as expected).

⚠️ Not yet run against a live oVirt engine. Exact JSON field shapes (host/VM state strings, event.time/snapshot.date epoch-ms vs ISO, the events search grammar) are based on the v4 API docs + defensive parsing — noted as "verify on a live engine" in the design spec's Open Risks. cd .test && task smoke-scripts against a lab engine is the quickest confirmation.

🤖 Generated with Claude Code

theyashl and others added 2 commits June 3, 2026 14:59
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comprehensive SLI + runbook for oVirt/RHV/OLVM environments via the engine
REST API. Authenticates with an SSO bearer token and checks engine
reachability, host status, VM status, storage domain capacity, cluster
health, recent critical events, and stale VM snapshots. Optional CA cert
for TLS verification. Includes .runwhen templates and a lightweight .test
harness (no infra provisioning, since oVirt is self-hosted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@theyashl theyashl requested a review from a team as a code owner June 3, 2026 09:44
@theyashl theyashl self-assigned this Jun 3, 2026
theyashl and others added 5 commits June 3, 2026 15:25
Stdlib-only mock REST server serving the SSO token endpoint and all 7 API
endpoints the bundle calls, with healthy/unhealthy scenarios and now-relative
timestamps so event/snapshot windows behave realistically. Wired into the
.test Taskfile (mock, test-mock, run-sli-mock) with a Dockerfile and README.
Verified all check scripts end-to-end against both scenarios.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Importing OVIRT_CA_CERT as a required secret failed the entire suite
whenever no CA cert was configured (the common system-trust-store case).
Mark the import optional in both robots; when unset, Run Bash File skips
the non-Secret value and ovirt_auth.sh falls back to the system trust
store. Guard the secret entry in the SLI/taskset templates with an
{% if custom.ovirt_ca_cert %} so it is only referenced when provided.

Verified end-to-end against the mock (with RW.Core/RW.platform installed):
- SLI healthy -> composite 1.0; unhealthy -> 0.14 with correct sub-scores
- Runbook unhealthy -> 7 issues (engine-reachable branch correctly skipped),
  healthy -> 0 issues
- Both robots run cleanly with OVIRT_CA_CERT entirely absent

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
oVirt is not a RunWhen-discoverable platform type, so a generation rule has
nothing to match and the commented placeholder file only produced a
'generation rules file does not contain any data' warning during workspace
upload. Remove it; document that the SLX is created directly from the
templates (config + secrets) rather than auto-generated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
oVirt has no discoverable resource of its own, so anchor the generation rule
on the kubernetes 'cluster' resource purely as a trigger (qualifiers:
[cluster]) -> one oVirt SLX per discovered cluster. All SLX/SLI/runbook
content comes from workspaceInfo custom.* + workspace secrets, not the matched
cluster. Mirrors the k8s-cluster-resource-health singleton pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant