Skip to content

GDC connected servers starter cloud monitoring dashboards and alerts

License

GDC-ConsumerEdge/gdc-connected-servers-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

GDC Connected Servers Enterprise Observability

Overview

This project contains predefined dashboards and alerts for enterprise workloads running on GDC Connected Servers.

Deployment Quickstart

Dashboards and alerts can be deployed through the following methods:

Option 1 - Scripted deployment

  1. cd alerts
  2. Run ./create-alerts.sh. This will deploy scripts into your current context's project. Modify script if notification channels are needed.

Dashboards are stored in the dashboards folder and can be manually deployed.

Option 2 - Terraform deployment

  1. cd terraform
  2. cp backend.tf.sample to backend.tf and modify to store tfstate in target cloud storage bucket.
  3. terraform plan/teraform apply

Dashboards

Dashboard Name Screenshot Description json
GDC Daily Report dashboard Dashboard showing node/VM availability and utilization based metrics json
GDC Node View dashboard Dashboard showing GDC node information json
GDC VM Status dashboard Dashboard showing GDC VM information json
GDC Robin Status dashboard Dashboard to deep-dive into robin metrics. Note: this dashboard requires the use of the robin-health application json
GDC External Secrets dashboard Dashboard showing External Secrets operational information json
GDC VM Distribution dashboard Dashboard showing VM distribution by node json

Alerts

Alert Category Description link
node-cpu-usage-high Node Alert when CPU usage of any node exceeds 80% config
node-memory-usage-high Node Alert when memory usage of any node exceeds 80% config
node-not-ready-30m Node Alert if any node is not ready for more than 30 minutes config
multiple-nodes-not-ready-realtime Node Alert if multiple nodes are not ready at any time config
api-server-error-ratio-5-percent Control-plane Alert if the API server has an error ratio exceeding 5% config
apiserver-down Control-plane Alert if api server is down config
controller-manager-down Control-plane Alert if controller manager is down config
scheduler-down Control-plane Alert if scheduler is down config
pod-crash-looping Pods Alert if a pod is crashlooping config
pod-not-ready-1h Pods Alert if a pod is not ready for more than an hour config
coredns-down System Alert if CoreDNS is down config
coredns-servfail-ratio-1-percent System Alert if greater than 1 percent of DNS requests are SERVFAILs config
robin-master-down-10m Storage Alert if robin master is down for more than 10 minutes config
robin-node-offline-30m Storage Alert if a robin node is offline for more than 30 minutes config
robin-disk-inactive-10m Storage Alert if robin disk is inactive for more than 10 minutes config
vmruntime-heartbeats-active-realtime VMRuntime Alert if VMRuntime heartbeats are missing config
vmruntime-heartbeats-realtime VMRuntime Alert if VMRuntime heartbeats are 0 config
vmruntime-vm-down-5m VMRuntime Alert if any VM is not active for more than 5 minutes config
vmruntime-vm-missing-5m VMRuntime Alert if CPU activity for a VM are absent for more than 5 minutes config
vmruntime-vm-no-network-traffic-5m VMRuntime Alert if there is no network activity from a VM config
externalsecrets-down-30m ExternalSecrets Alert if External Secrets is down config
externalsecrets-sync-error ExternalSecrets Alert if any ExternalSecret resources have sync errors config

Disclaimer

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

About

GDC connected servers starter cloud monitoring dashboards and alerts

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published