Skip to content

GDC-ConsumerEdge/cluster-health-validator

Cluster Health Validator

The cluster health validator is a service that runs in-cluster and reports an aggregated signal of platform and workload health. The health is reported both as a status as a Kubernetes and as a prometheus metric. This can be used during cluster provisioning to signal to completion of the pre-staging process or as a continual sanity check of the state of a cluster.

Alternatively, the cluster health validator can run locally, useful for local troubleshooting or to use during the cluster provisioning process without requiring an in-cluster component.

Installation

This project uses a CRD and operator, and requires Cluster-Level access. The project can be deployed as a RootSync config-sync object with the following configuration. NOTE: Production use should clone the repo, make it private and use the token approach to authenticate to private repo.

# root-sync.yaml
apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
  name: "cluster-health-validator"
  namespace: config-management-system
  annotations:
    configsync.gke.io/deletion-propagation-policy: Foreground # indicate that cascade delete is preferred
spec:
  sourceFormat: "unstructured"
  git:
    repo: "https://github.com/GDC-ConsumerEdge/cluster-health-validator.git"
    branch: "main"
    period: "24h"                                       # check for changes every day
    dir: "/config/default"
    auth: "none"                                        # Production use, use "token" after forking repo
    #auth: "token"
    #secretRef:
    #  name: "git-creds"

Configuration

Cluster Health Validator allows customization for which platform and workload health checks are performed. This is specified as part of the ConfigMap as part of the deployment.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: health-check-config
data:
  config.yaml: |
    platform_checks:
    - name: Node Health
      module: CheckNodes
    - name: Robin Cluster Health
      module: CheckRobinCluster
    - name: Root Sync Check
      module: CheckRootSyncs
    - name: VMRuntime Check
      module: CheckVMRuntime

    workload_checks:
    - name: VM Workloads Health
      module: CheckVirtualMachines
      parameters:
        namespace: vm-workloads
        count: 4
    - name: VM Data Volume Health
      module: CheckDataVolumes
      parameters:
        namespace: vm-workloads
        count: 4
    - name: HTTP Endpoints
      module: CheckHttpEndpoints
      parameters:
        endpoints:
        - name: Google
          url: https://www.google.com
        - name: Kubernetes API
          url: https://kubernetes.default.svc
          timeout: 5

Below details the health check modules available as part of the solution, with some requiring parameters:

Module Description Parameters
CheckNodes Checks Kubernetes Node Health
CheckGoogleGroupRBAC Checks that Google Group RBAC has been enabled
CheckRobinCluster Checks RobinCluster Health
CheckRootSyncs Checks that RootSyncs are synced and have completed reconciling
CheckVMRuntime Checks that VMRuntime is Ready, without any preflight failure
CheckVirtualMachines Checks that the expected # of VMs are in a Running State namespace: namespace to run check against
count: (Optional) expected # of VMs
CheckDataVolumes Checks that the expected # of Data Volumes are 100% imported and ready namespace: namespace to run check against
count: (Optional) expected # of DVs
CheckHttpEndpoints Checks that a list of HTTP endpoints are reachable and return a successful status code endpoints: A list of HTTP endpoints to check. Each endpoint has the following parameters:
  • name: The name of the endpoint
  • url: The URL of the endpoint
  • timeout: (Optional) The timeout in seconds for the request
  • method: (Optional) The HTTP method to use (e.g. 'GET', 'POST')

on_failure property

Each health check module supports an on_failure property that allows you to control the behavior of the health check when it fails. The on_failure property can be set to one of two values:

  • fail (default): If the health check fails, the entire group of checks (platform or workload) will be considered failed.
  • ignore: If the health check fails, the failure will be logged and tracked in metrics, but it will not affect the overall health status of the group.

This is useful for non-critical health checks that you want to monitor but not have affect the overall health status.

Example:

platform_checks:
- name: Node Health
  module: CheckNodes
- name: Robin Cluster Health
  module: CheckRobinCluster
  on_failure: ignore

workload_checks:
- name: VM Workloads Health
  module: CheckVirtualMachines
  parameters:
    namespace: vm-workloads
  on_failure: fail
- name: VM Disk Health
  module: CheckVirtualMachineDisks
  on_failure: ignore

Building the image

IMAGE_TAG=gcr.io/${PROJECT_ID}/cluster-health-validator:1.0.0
docker build -t ${IMAGE_TAG} .
docker push ${IMAGE_TAG}

Local Usage

python3 -m venv .venv
source .venv/bin/activate
pip install -r app/requirements.txt

python3 app --help
usage: app [-h] [--health-check HEALTH_CHECK [HEALTH_CHECK ...]] [-v | -q] [-w] [-i INTERVAL] [-t TIMEOUT]

options:
  -h, --help            show this help message and exit
  --health-check HEALTH_CHECK [HEALTH_CHECK ...]
                        Set a health check to perform. For health checks requiring parameters, pass them in a key=value format as additional arguments. Example: --health-check
                        checkvirtualmachines namespace=vm-workloads count=3
  -v, --verbose         increase output verbosity; -vv for max verbosity
  -q, --quiet           output errors only
  -w, --wait            wait for health checks to pass before exiting
  -i INTERVAL, --interval INTERVAL
                        interval to poll passing health checks
  -t TIMEOUT, --timeout TIMEOUT
                        Overall timeout for health checks to pass

Examples:

# Run the default health checks (CheckNodes, CheckRootSyncs, CheckRobinCluster)
python3 app

# Run customized health checks
python3 app --health-check checknodes \
            --health-check checkrobincluster \
            --health-check checkrootsyncs \
            --health-check checkgooglegrouprbac \
            --health-check checkvirtualmachines namespace=vm-workloads count=3 \
            --health-check checkdatavolumes namespace=vm-workloads count=3

# Run default health checks and wait until all health checks pass.
#   Timeout after 1 hour if health checks don't pass
python3 app --wait --interval 60 --timeout 3600

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 6