This project contains predefined dashboards and alerts for enterprise workloads running on GDC Connected Servers.
Dashboards and alerts can be deployed through the following methods:
cd alerts- Run
./create-alerts.sh. This will deploy scripts into your current context's project. Modify script if notification channels are needed.
Dashboards are stored in the dashboards folder and can be manually deployed.
- cd terraform
- cp
backend.tf.sampletobackend.tfand modify to store tfstate in target cloud storage bucket. terraform plan/teraform apply
| Dashboard Name | Screenshot | Description | json |
|---|---|---|---|
| GDC Daily Report | ![]() |
Dashboard showing node/VM availability and utilization based metrics | json |
| GDC Node View | ![]() |
Dashboard showing GDC node information | json |
| GDC VM Status | ![]() |
Dashboard showing GDC VM information | json |
| GDC Robin Status | ![]() |
Dashboard to deep-dive into robin metrics. Note: this dashboard requires the use of the robin-health application | json |
| GDC External Secrets | ![]() |
Dashboard showing External Secrets operational information | json |
| GDC VM Distribution | ![]() |
Dashboard showing VM distribution by node | json |
| Alert | Category | Description | link |
|---|---|---|---|
| node-cpu-usage-high | Node | Alert when CPU usage of any node exceeds 80% | config |
| node-memory-usage-high | Node | Alert when memory usage of any node exceeds 80% | config |
| node-not-ready-30m | Node | Alert if any node is not ready for more than 30 minutes | config |
| multiple-nodes-not-ready-realtime | Node | Alert if multiple nodes are not ready at any time | config |
| api-server-error-ratio-5-percent | Control-plane | Alert if the API server has an error ratio exceeding 5% | config |
| apiserver-down | Control-plane | Alert if api server is down | config |
| controller-manager-down | Control-plane | Alert if controller manager is down | config |
| scheduler-down | Control-plane | Alert if scheduler is down | config |
| pod-crash-looping | Pods | Alert if a pod is crashlooping | config |
| pod-not-ready-1h | Pods | Alert if a pod is not ready for more than an hour | config |
| coredns-down | System | Alert if CoreDNS is down | config |
| coredns-servfail-ratio-1-percent | System | Alert if greater than 1 percent of DNS requests are SERVFAILs | config |
| robin-master-down-10m | Storage | Alert if robin master is down for more than 10 minutes | config |
| robin-node-offline-30m | Storage | Alert if a robin node is offline for more than 30 minutes | config |
| robin-disk-inactive-10m | Storage | Alert if robin disk is inactive for more than 10 minutes | config |
| vmruntime-heartbeats-active-realtime | VMRuntime | Alert if VMRuntime heartbeats are missing | config |
| vmruntime-heartbeats-realtime | VMRuntime | Alert if VMRuntime heartbeats are 0 | config |
| vmruntime-vm-down-5m | VMRuntime | Alert if any VM is not active for more than 5 minutes | config |
| vmruntime-vm-missing-5m | VMRuntime | Alert if CPU activity for a VM are absent for more than 5 minutes | config |
| vmruntime-vm-no-network-traffic-5m | VMRuntime | Alert if there is no network activity from a VM | config |
| externalsecrets-down-30m | ExternalSecrets | Alert if External Secrets is down | config |
| externalsecrets-sync-error | ExternalSecrets | Alert if any ExternalSecret resources have sync errors | config |
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.





