Skip to content

digitalis-io/terraform-k8s-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digitalis.io

terraform-k8s-monitoring

Terraform modules for deploying a full observability stack on Kubernetes — by Digitalis.io

Metrics (Mimir), logs (Loki), traces (Tempo), collection (OpenTelemetry Collector), dashboards and alerts (Grafana via kube-prometheus-stack). Works on any Kubernetes cluster — EKS, GKE, AKS, or bare metal. Metrics (Mimir), logs (Loki), traces (Tempo), collection (OpenTelemetry Collector), dashboards and alerts (Grafana via kube-prometheus-stack). Works on any Kubernetes cluster — EKS, GKE, AKS, or bare metal.


Table of Contents


Prerequisites

  • A running Kubernetes cluster with a valid kubeconfig
  • kubectl configured and pointing at the target cluster
  • Terraform >= 1.4 or OpenTofu >= 1.4
  • The Terraform Helm and Kubernetes providers configured (see Quick Start)
  • Buckets or containers pre-created if using cloud storage backends — this module does not create them

Quick Start

This example deploys the full stack with local disk storage. No cloud credentials required. Data lives on the pod filesystem — suitable for development, evaluation, and blog-post walkthroughs.

providers.tf

terraform {
  required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = ">= 2.12"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = ">= 2.27"
    }
  }
}

provider "helm" {
  kubernetes {
    config_path = "~/.kube/config"
  }
}

provider "kubernetes" {
  config_path = "~/.kube/config"
}

main.tf

module "cert_manager" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/cert-manager"
}

module "mimir" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/mimir"
}

module "loki" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/loki"
  loki = {
    create_namespace = false
  }
}

module "tempo" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/tempo"
  tempo = {
    create_namespace = false
  }
}

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"
  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id
    loki_datasource_url    = module.loki.datasource_url
    tempo_datasource_url   = module.tempo.datasource_url
    grafana_ingress = {
      enabled    = true
      host       = "grafana.YOUR_DOMAIN"
      class_name = "nginx"
    }
  }
}

module "otel" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/otel-collector"
  otel = {
    create_namespace = false
    tempo_endpoint   = module.tempo.otlp_grpc_endpoint
    mimir_endpoint   = module.mimir.remote_write_endpoint
    loki_endpoint    = module.loki.datasource_url
  }
}

module "prometheus_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus-rules"
  prometheus_rules = {
    prometheus_release_id = module.prometheus.helm_release_id
  }
}

module "grafana_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/grafana-rules"
  grafana_rules = {}
}

Deploy:

terraform init
terraform apply

Grafana will be available at https://grafana.YOUR_DOMAIN. The default credentials are admin / prom-operator.


Module Reference

cert-manager

Installs cert-manager and creates a self-signed ClusterIssuer. Other modules reference this issuer in their ingress TLS annotations.

module "cert_manager" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/cert-manager"

  cert_manager = {
    chart_version       = "v1.19.1"
    namespace           = "cert-manager"
    create_namespace    = true
    cluster_issuer_name = "selfsigned-cluster-issuer"
  }
}
Variable Default Description
chart_version "v1.19.1" cert-manager Helm chart version
namespace "cert-manager" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
cluster_issuer_name "selfsigned-cluster-issuer" Name of the ClusterIssuer to create — must match the cert-manager.io/cluster-issuer annotation in other modules
kubeconfig_path "" Path to the kubeconfig file used by the kubectl local-exec provisioner. When empty, --kubeconfig is omitted and kubectl uses its standard resolution order (KUBECONFIG env var → ~/.kube/config). Set explicitly to pin to a specific file (see Troubleshooting)

No notable outputs.


mimir

Installs Grafana Mimir as the metrics storage and query backend. Prometheus writes metrics here via remote_write. Grafana queries here via a Prometheus-compatible datasource.

module "mimir" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/mimir"

  mimir = {
    namespace        = "monitoring"
    retention_period = "30d"
    tenant_id        = "anonymous"
  }
}
Variable Default Description
chart_version "5.6.0" Mimir distributed Helm chart version
namespace "monitoring" Namespace to deploy into
retention_period "30d" How long to keep metrics
tenant_id "anonymous" Value sent in X-Scope-OrgID header by Prometheus and Grafana
replicas 1 Number of replicas for each Mimir component
ingress_enabled false Expose Mimir via an Ingress
ingress_host "" Hostname for the Mimir ingress (required when ingress_enabled = true)
ingress_class_name "nginx" Ingress class
ingress_tls_secret "" TLS secret name
storage.backend "local" Storage backend: local, s3, gcs, or azure
storage.s3_blocks_prefix "" Object key prefix for blocks — allows sharing one S3 bucket across all three Mimir storage types
storage.s3_ruler_prefix "" Object key prefix for ruler data
storage.s3_alertmanager_prefix "" Object key prefix for Alertmanager data
storage.s3_credentials_secret null Reference a pre-existing Kubernetes Secret for S3 credentials (see S3 credentials secret)
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits

Default resources: 100m CPU / 512Mi memory request, 2 CPU / 4Gi memory limit.

Outputs:

Output Description
remote_write_endpoint Prometheus remote_write URL — wire into the prometheus module
query_frontend_endpoint Grafana datasource URL — wire into the prometheus module
tenant_id The configured tenant ID — wire into the prometheus module
namespace Namespace where Mimir is deployed

prometheus

Installs kube-prometheus-stack: Prometheus, Grafana, and Alertmanager. Configures remote_write to Mimir and adds Loki and Tempo as Grafana datasources automatically when their URLs are supplied.

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id
    loki_datasource_url    = module.loki.datasource_url
    tempo_datasource_url   = module.tempo.datasource_url
  }
}
Variable Default Description
chart_version "86.3.2" kube-prometheus-stack Helm chart version
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
namespace_labels {} Additional labels to apply to the namespace
namespace_annotations {} Additional annotations to apply to the namespace
grafana_enabled true Deploy Grafana
alertmanager_enabled true Deploy Alertmanager
mimir_remote_write_url "" Mimir remote_write URL — use module.mimir.remote_write_endpoint
mimir_datasource_url "" Mimir query URL — use module.mimir.query_frontend_endpoint
mimir_tenant_id "anonymous" Tenant ID for X-Scope-OrgID header
loki_datasource_url "" Loki URL — use module.loki.datasource_url
tempo_datasource_url "" Tempo URL — use module.tempo.datasource_url
pyroscope_datasource_url "" Pyroscope URL — use module.pyroscope.datasource_url
clickhouse_datasource null ClickHouse datasource config — see ClickHouse integration
storage_size "20Gi" PVC size for Prometheus TSDB
storage_class "" StorageClass name (cluster default if empty)
retention "24h" Local TSDB retention (metrics are in Mimir long-term)
grafana_dashboard_imports Node Exporter Full (1860) Grafana.com dashboard IDs to import
extra_dashboards {} Additional dashboard JSON — { "name.json" = file("...") }
grafana_plugins see below Grafana plugins to install
grafana_ingress disabled Grafana ingress config (see Enable ingress)
prometheus_ingress disabled Prometheus ingress config
alertmanager_ingress disabled Alertmanager ingress config
resources see below CPU/memory requests and limits

Default resources: 200m CPU / 512Mi memory request, 2 CPU / 2Gi memory limit.

Default Grafana plugins: digrich-bubblechart-panel, grafana-clock-panel, btplc-status-dot-panel, grafana-piechart-panel, grafana-llm-app, grafana-clickhouse-datasource.

Outputs:

Output Description
grafana_service In-cluster Grafana URL
helm_release_id Helm release ID — required by prometheus-rules module
namespace Namespace where kube-prometheus-stack is deployed

loki

Installs Grafana Loki for log aggregation. Supports single-binary (default) and scalable (SimpleScalable) deployment modes.

module "loki" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/loki"

  loki = {
    namespace        = "monitoring"
    create_namespace = false
    deployment_mode  = "single-binary"
    retention_period = "744h"
  }
}
Variable Default Description
chart_version "6.6.0" Loki Helm chart version
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
deployment_mode "single-binary" single-binary or scalable
replicas 1 Replica count (single-binary mode)
retention_period "744h" Log retention period (31 days)
storage.backend "local" Storage backend: local, s3, gcs, or azure
storage.s3_credentials_secret null Reference a pre-existing Kubernetes Secret for S3 credentials (see S3 credentials secret)
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits

Default resources: 100m CPU / 256Mi memory request, 2 CPU / 2Gi memory limit.

Outputs:

Output Description
datasource_url Loki URL for Grafana datasource and OTel Collector — http://loki.monitoring.svc.cluster.local:3100
namespace Namespace where Loki is deployed

tempo

Installs Grafana Tempo for distributed tracing. Supports monolithic (default) and distributed deployment modes.

module "tempo" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/tempo"

  tempo = {
    namespace        = "monitoring"
    create_namespace = false
    deployment_mode  = "monolithic"
    retention        = "720h"
  }
}
Variable Default Description
chart_version "1.40.0" Tempo Helm chart version
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
namespace_labels {} Additional labels to apply to the namespace
namespace_annotations {} Additional annotations to apply to the namespace
deployment_mode "monolithic" monolithic or distributed
replicas 1 Replica count (monolithic mode)
retention "720h" Trace retention period (30 days)
metrics_generator_remote_write_url "" Mimir (or Prometheus) remote_write URL to enable metrics-generator for TraceQL rate() and span metrics
storage.backend "local" Storage backend: local, s3, gcs, or azure
storage.s3_credentials_secret null Reference a pre-existing Kubernetes Secret for S3 credentials (see S3 credentials secret)
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits

Default resources: 100m CPU / 256Mi memory request, 2 CPU / 2Gi memory limit.

Outputs:

Output Description
datasource_url Tempo URL for Grafana datasource
otlp_grpc_endpoint OTLP gRPC endpoint for app instrumentation (port 4317)
otlp_http_endpoint OTLP HTTP endpoint for app instrumentation (port 4318)
namespace Namespace where Tempo is deployed

otel-collector

Installs the OpenTelemetry Collector (contrib image). Receives OTLP traces, metrics, and logs from your applications and forwards them to Tempo, Mimir, and Loki respectively. Optionally enables the OpenTelemetry Operator for workload instrumentation. Runs as a DaemonSet by default.

module "otel" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/otel-collector"

  otel = {
    namespace        = "monitoring"
    create_namespace = false
    mode             = "daemonset"
    tempo_endpoint   = module.tempo.otlp_grpc_endpoint
    mimir_endpoint   = module.mimir.remote_write_endpoint
    mimir_tenant_id  = module.mimir.tenant_id
    loki_endpoint    = module.loki.datasource_url
  }
}
Variable Default Description
chart_version "0.158.2" OpenTelemetry Collector Helm chart version
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
namespace_labels {} Additional labels to apply to the namespace
namespace_annotations {} Additional annotations to apply to the namespace
mode "daemonset" daemonset or deployment
tempo_endpoint "" OTLP gRPC endpoint for Tempo — use module.tempo.otlp_grpc_endpoint
mimir_endpoint "" Remote write URL for Mimir — use module.mimir.remote_write_endpoint
mimir_tenant_id "anonymous" Tenant ID for X-Scope-OrgID header sent to Mimir — use module.mimir.tenant_id
loki_endpoint "" Loki push URL — use module.loki.datasource_url
clickhouse_endpoint "" ClickHouse HTTP endpoint (:8123) for logs and traces
clickhouse_username "" ClickHouse username
clickhouse_password "" ClickHouse password
clickhouse_database "otel" ClickHouse database name for OTLP/ClickHouse exporter
clickhouse_create_schema true Auto-create database and tables on startup. Disable on memory-constrained ClickHouse instances and pre-create the schema manually
image.repository "otel/opentelemetry-collector-contrib" Collector image (contrib required for Loki and Mimir exporters)
image.tag "" Image tag (empty = chart appVersion)
image.pull_policy "IfNotPresent" Image pull policy
operator.enabled false Deploy the OpenTelemetry Operator for auto-instrumentation
operator.chart_version "0.116.0" Operator Helm chart version
operator.collector_image_repository "otel/opentelemetry-collector-k8s" Operator's default collector image repository
operator.cert_manager_enabled false Use cert-manager for webhook certificates
operator.auto_generate_cert_enabled true Auto-generate webhook certificates (incompatible with cert-manager)
operator.extra_args [] Additional arguments to pass to the operator
operator.go_instrumentation_enabled false Enable Go auto-instrumentation via eBPF (requires Linux kernel >=4.19)
operator.go_instrumentation_image "" Go instrumentation image (defaults to chart appVersion when empty)
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits

Default resources: 300m CPU / 256Mi memory request, 500m CPU / 512Mi memory limit.

Outputs:

Output Description
otlp_grpc_endpoint OTLP gRPC endpoint your apps send traces to (port 4317)
otlp_http_endpoint OTLP HTTP endpoint your apps send traces to (port 4318)
namespace Namespace where the collector is deployed
helm_release_name Helm release name
helm_release_version Deployed chart version

alloy

Installs Grafana Alloy — the OpenTelemetry-native successor to Grafana Agent. Receives OTLP traces, metrics, logs, and profiles from instrumented applications using a River/Alloy pipeline config, and forwards each signal to the configured backend. Runs as a DaemonSet by default (one pod per node), but supports Deployment and StatefulSet controller types.

module "alloy" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/alloy"

  alloy = {
    namespace        = "monitoring"
    create_namespace = false
    controller_type  = "daemonset"
    tempo_endpoint   = module.tempo.otlp_grpc_endpoint
    mimir_endpoint   = module.mimir.remote_write_endpoint
    mimir_tenant_id  = module.mimir.tenant_id
    loki_endpoint    = module.loki.datasource_url
  }
}
Variable Default Description
chart_version "0.12.5" Alloy Helm chart version — check ArtifactHub for the latest
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
namespace_labels {} Additional labels to apply to the namespace
namespace_annotations {} Additional annotations to apply to the namespace
controller_type "daemonset" Kubernetes workload kind: daemonset, deployment, or statefulset
replicas 1 Replica count (ignored when controller_type = "daemonset")
alloy_config "" Full River/Alloy pipeline config. When empty, a built-in default config is rendered using the non-empty sibling endpoints below
loki_endpoint "" Loki push URL — use module.loki.datasource_url
tempo_endpoint "" Tempo OTLP gRPC endpoint — use module.tempo.otlp_grpc_endpoint
mimir_endpoint "" Mimir remote write URL — use module.mimir.remote_write_endpoint
mimir_tenant_id "anonymous" Value sent in X-Scope-OrgID header to Mimir — use module.mimir.tenant_id
pyroscope_endpoint "" Pyroscope push URL — use module.pyroscope.push_url
otel_grpc_endpoint "" Upstream OTel Collector endpoint for chaining — use module.otel.otlp_grpc_endpoint
persistence.enabled false Mount a PVC for WAL state (only meaningful with controller_type = "statefulset")
persistence.size "10Gi" PVC size
persistence.storage_class "" StorageClass name (cluster default if empty)
ingress.enabled false Expose Alloy via an Ingress
ingress.host "" Ingress hostname (required when ingress.enabled = true)
ingress.class_name "nginx" Ingress class
ingress.tls_secret "" TLS secret name
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits
extra_values "" Extra Helm values merged last (highest precedence)

Default resources: 100m CPU / 128Mi memory request, 500m CPU / 512Mi memory limit.

Outputs:

Output Description
otlp_grpc_endpoint OTLP gRPC endpoint for app instrumentation — http://alloy.<namespace>.svc.cluster.local:4317
otlp_http_endpoint OTLP HTTP endpoint for app instrumentation — http://alloy.<namespace>.svc.cluster.local:4318
namespace Namespace where Alloy is deployed
helm_release_name Helm release name
helm_release_version Deployed chart version

pyroscope

Installs Grafana Pyroscope for continuous profiling. Collects CPU, memory, goroutine, and heap profiles from Go, Java, Python, Ruby, and other supported runtimes. Profiles are stored in Pyroscope and queried through a dedicated Grafana datasource.

module "pyroscope" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/pyroscope"

  pyroscope = {
    namespace        = "monitoring"
    create_namespace = false
  }
}
Variable Default Description
chart_version "1.20.3" Pyroscope Helm chart version
namespace "monitoring" Namespace to deploy into
create_namespace true Create the namespace if it does not exist
namespace_labels {} Additional labels to apply to the namespace
namespace_annotations {} Additional annotations to apply to the namespace
replicas 1 Number of Pyroscope replicas
storage.backend "local" Storage backend: local, s3, gcs, or azure
storage.s3_bucket "" S3 bucket name
storage.s3_region "" S3 region
storage.s3_endpoint "" S3-compatible endpoint hostname (scheme stripped automatically)
storage.s3_insecure false Use plain HTTP for the S3 endpoint
storage.s3_access_key "" S3 access key (leave empty for IRSA)
storage.s3_secret_key "" S3 secret key (leave empty for IRSA)
storage.s3_credentials_secret null Reference a pre-existing Kubernetes Secret for S3 credentials (see S3 credentials secret)
storage.gcs_bucket "" GCS bucket name
storage.gcs_service_account_key "" GCS service account JSON key (leave empty for Workload Identity)
storage.azure_storage_account "" Azure storage account name
storage.azure_container "" Azure blob container name
storage.azure_storage_account_key "" Azure storage account key
service_account_annotations {} Annotations for IRSA / Workload Identity
resources see below CPU/memory requests and limits

Default resources: 100m CPU / 256Mi memory request, 1 CPU / 1Gi memory limit.

S3 path-style not supported. Pyroscope's S3 client does not support bucket_lookup_type (path-style access). When using an S3-compatible service such as Hetzner Object Storage, Exoscale, or Cloudflare R2, use a bucket-specific endpoint instead of a shared endpoint with s3_path_style = true:

storage = {
  backend      = "s3"
  s3_bucket    = "mybucket"
  s3_region    = "ch-gva-2"
  s3_endpoint  = "mybucket.sos-ch-gva-2.exo.io"  # bucket-specific endpoint
  s3_access_key = "YOUR_ACCESS_KEY"
  s3_secret_key = "YOUR_SECRET_KEY"
}

Outputs:

Output Description
datasource_url Pyroscope URL for Grafana datasource — wire into the prometheus module as pyroscope_datasource_url
push_url Pyroscope push URL for profiling agents — http://pyroscope.<namespace>.svc.cluster.local:4040
namespace Namespace where Pyroscope is deployed
helm_release_name Helm release name
helm_release_version Deployed chart version

prometheus-rules

Applies Prometheus alert rules and configures Alertmanager receivers. Must be applied after the prometheus module — pass module.prometheus.helm_release_id to enforce ordering.

module "prometheus_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus-rules"

  prometheus_rules = {
    namespace             = "monitoring"
    prometheus_release_id = module.prometheus.helm_release_id
  }
}
Variable Default Description
namespace "monitoring" Must match the kube-prometheus-stack namespace
prometheus_release_id required Output from module.prometheus.helm_release_id
kubeconfig_path "" Path to the kubeconfig file used by kubectl local-exec. When empty, --kubeconfig is omitted and kubectl uses its standard resolution order (KUBECONFIG env var → ~/.kube/config). Set explicitly to pin to a specific file (see Troubleshooting)
extra_rules {} Additional rule YAML files — { "my-app.yaml" = file("...") }
slack.enabled false Send alerts to Slack
slack.webhook_url "" Slack incoming webhook URL (required when enabled)
slack.channel "#alerts" Slack channel
slack.min_severity "warning" Minimum severity to forward: info, warning, or critical
pagerduty.enabled false Send alerts to PagerDuty
pagerduty.routing_key "" PagerDuty routing key (required when enabled)
pagerduty.min_severity "critical" Minimum severity to page

No notable outputs.


grafana-rules

Applies Grafana-managed alert rules and configures Grafana contact points (Slack, PagerDuty, webhook, email).

module "grafana_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/grafana-rules"

  grafana_rules = {
    namespace = "monitoring"
  }
}
Variable Default Description
namespace "monitoring" Must match the kube-prometheus-stack namespace
extra_rules {} Additional rule YAML files — { "my-app.yaml" = file("...") }
slack.enabled false Send alerts to Slack
slack.webhook_url "" Slack incoming webhook URL (required when enabled)
slack.channel "#alerts" Slack channel
slack.min_severity "warning" Minimum severity: info, warning, or critical
pagerduty.enabled false Send alerts to PagerDuty
pagerduty.integration_key "" PagerDuty integration key (required when enabled)
pagerduty.min_severity "critical" Minimum severity to page
webhook.enabled false Send alerts to a generic webhook
webhook.url "" Webhook URL (required when enabled)
webhook.http_method "POST" HTTP method
webhook.min_severity "warning" Minimum severity
email.enabled false Send alerts by email
email.addresses [] List of recipient email addresses (required when enabled)
email.min_severity "critical" Minimum severity

No notable outputs.


Common Recipes

Complete, copy-paste examples are available in the examples/ directory:

Example Description
examples/minimal/ Full stack with local disk storage — no cloud credentials needed
examples/alloy-basic/ Alloy DaemonSet collector wired to Loki, Tempo, and Mimir
examples/aws/ S3 backend with IRSA authentication on EKS
examples/gcp/ GCS backend with Workload Identity on GKE

Use S3 for Mimir storage (with IRSA)

Pre-create three S3 buckets before running terraform apply. IRSA handles authentication — no access keys needed.

module "mimir" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/mimir"

  mimir = {
    namespace        = "monitoring"
    retention_period = "90d"

    storage = {
      backend                = "s3"
      s3_blocks_bucket       = "YOUR_BUCKET_NAME-mimir-blocks"
      s3_ruler_bucket        = "YOUR_BUCKET_NAME-mimir-ruler"
      s3_alertmanager_bucket = "YOUR_BUCKET_NAME-mimir-alertmanager"
      s3_region              = "eu-west-1"
      # s3_access_key and s3_secret_key left empty — IRSA is used instead
    }

    service_account_annotations = {
      "eks.amazonaws.com/role-arn" = "arn:aws:iam::123456789012:role/mimir"
    }
  }
}

Use S3-compatible storage (Hetzner, MinIO, Ceph)

Any S3-compatible service works. Set s3_endpoint to the service hostname or URL, s3_path_style = true (required by Hetzner and most non-AWS services), and provide access credentials.

The module strips https:// and http:// from s3_endpoint automatically, so both "https://fsn1.your-objectstorage.com" and "fsn1.your-objectstorage.com" are accepted.

module "mimir" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/mimir"

  mimir = {
    namespace        = "monitoring"
    retention_period = "30d"

    storage = {
      backend                = "s3"
      s3_blocks_bucket       = "mimir-blocks"
      s3_ruler_bucket        = "mimir-ruler"
      s3_alertmanager_bucket = "mimir-alertmanager"
      s3_region              = "eu-central"          # Hetzner region, or "us-east-1" for MinIO
      s3_endpoint            = "fsn1.your-objectstorage.com"  # Hetzner example — scheme optional
      s3_path_style          = true                  # required for Hetzner, MinIO, Ceph
      s3_insecure            = false                 # set true only for plain HTTP endpoints
      s3_access_key          = "YOUR_ACCESS_KEY"
      s3_secret_key          = "YOUR_SECRET_KEY"
    }
  }
}

The same s3_endpoint, s3_path_style, and s3_insecure variables are available on modules/loki and modules/tempo:

module "loki" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/loki"

  loki = {
    storage = {
      backend          = "s3"
      s3_chunks_bucket = "loki-chunks"
      s3_ruler_bucket  = "loki-ruler"
      s3_region        = "eu-central"
      s3_endpoint      = "fsn1.your-objectstorage.com"  # scheme optional
      s3_path_style    = true
      s3_access_key    = "YOUR_ACCESS_KEY"
      s3_secret_key    = "YOUR_SECRET_KEY"
    }
  }
}

Use GCS for Loki storage (with Workload Identity)

Pre-create two GCS buckets before running terraform apply.

module "loki" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/loki"

  loki = {
    namespace        = "monitoring"
    create_namespace = false
    retention_period = "744h"

    storage = {
      backend           = "gcs"
      gcs_chunks_bucket = "YOUR_PROJECT-loki-chunks"
      gcs_ruler_bucket  = "YOUR_PROJECT-loki-ruler"
      # gcs_service_account_key left empty — Workload Identity is used instead
    }

    service_account_annotations = {
      "iam.gke.io/gcp-service-account" = "loki@YOUR_GCP_PROJECT.iam.gserviceaccount.com"
    }
  }
}

Add a custom Grafana dashboard from a JSON file

Place your dashboard JSON anywhere in the repo, then pass it via extra_dashboards. The key is the filename that appears in Grafana.

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id

    extra_dashboards = {
      "my-app.json"      = file("${path.module}/dashboards/my-app.json")
      "another-app.json" = file("${path.module}/dashboards/another-app.json")
    }
  }
}

Add a custom Grafana dashboard by grafana.com ID

Find the dashboard on grafana.com/grafana/dashboards, note its ID and revision number.

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id

    grafana_dashboard_imports = [
      # Node Exporter Full — already included by default, shown here as example
      { gnet_id = 1860, revision = 37, datasource = "Mimir" },
      # Kubernetes / Compute Resources / Cluster
      { gnet_id = 15520, revision = 9, datasource = "Mimir" },
      # Loki dashboard
      { gnet_id = 13639, revision = 2, datasource = "Loki" },
    ]
  }
}

Add custom Prometheus alert rules from a YAML file

Write a standard PrometheusRule-compatible YAML file and pass it via extra_rules.

rules/my-app.yaml:

groups:
  - name: my-app
    rules:
      - alert: MyAppHighErrorRate
        expr: rate(http_requests_total{job="my-app",status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on my-app"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
module "prometheus_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus-rules"

  prometheus_rules = {
    namespace             = "monitoring"
    prometheus_release_id = module.prometheus.helm_release_id

    extra_rules = {
      "my-app.yaml" = file("${path.module}/rules/my-app.yaml")
    }
  }
}

Enable Slack alerts (prometheus-rules)

Alerts at warning severity or above are forwarded to Slack. Critical alerts also go to Slack unless you raise min_severity.

module "prometheus_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus-rules"

  prometheus_rules = {
    namespace             = "monitoring"
    prometheus_release_id = module.prometheus.helm_release_id

    slack = {
      enabled      = true
      webhook_url  = "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
      channel      = "#platform-alerts"
      min_severity = "warning"
    }
  }
}

Enable PagerDuty alerts (grafana-rules)

Only critical alerts page by default. Lower min_severity to warning to increase coverage.

module "grafana_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/grafana-rules"

  grafana_rules = {
    namespace = "monitoring"

    pagerduty = {
      enabled         = true
      integration_key = "YOUR_PAGERDUTY_INTEGRATION_KEY"
      min_severity    = "critical"
    }
  }
}

Enable continuous profiling (Pyroscope)

Deploy Pyroscope and wire it into Grafana as a datasource. The pyroscope_datasource_url variable adds a grafana-pyroscope-datasource datasource with uid pyroscope to Grafana automatically.

module "pyroscope" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/pyroscope"

  pyroscope = {
    namespace        = "monitoring"
    create_namespace = false
  }
}

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id
    loki_datasource_url    = module.loki.datasource_url
    tempo_datasource_url   = module.tempo.datasource_url
    pyroscope_datasource_url = module.pyroscope.datasource_url
  }
}

Once deployed, push profiles from your applications to http://pyroscope.monitoring.svc.cluster.local:4040. Pyroscope uses port 4040 for both push ingestion and query.


Enable Tempo metrics generator with Mimir backend

Tempo's metrics-generator extracts RED metrics (Request, Error, Duration) and custom span metrics from traces, then writes them to Mimir for long-term storage. This enables TraceQL rate() queries and correlation between traces and metrics.

module "tempo" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/tempo"

  tempo = {
    namespace        = "monitoring"
    create_namespace = false
    # Enable metrics generation — write to the same Mimir endpoint as Prometheus
    metrics_generator_remote_write_url = module.mimir.remote_write_endpoint
  }
}

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace         = false
    mimir_remote_write_url   = module.mimir.remote_write_endpoint
    mimir_datasource_url     = module.mimir.query_frontend_endpoint
    mimir_tenant_id          = module.mimir.tenant_id
    tempo_datasource_url     = module.tempo.datasource_url
  }
}

ClickHouse integration for logs and traces

Use ClickHouse as an alternative backend for OTLP logs and traces. The OTel Collector exports directly to ClickHouse, and Grafana queries via the ClickHouse datasource plugin.

Deploy ClickHouse first (or use a managed instance), then wire the OTel Collector and Grafana datasource:

module "otel" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/otel-collector"

  otel = {
    namespace           = "monitoring"
    create_namespace    = false
    tempo_endpoint      = module.tempo.otlp_grpc_endpoint
    mimir_endpoint      = module.mimir.remote_write_endpoint
    loki_endpoint       = module.loki.datasource_url
    # ClickHouse exporter configuration
    clickhouse_endpoint  = "clickhouse.observability.svc.cluster.local:8123"
    clickhouse_username  = "default"
    clickhouse_password  = "your-password"
    clickhouse_database  = "otel"
    clickhouse_create_schema = true  # auto-create tables on startup
  }
}

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id

    # ClickHouse datasource for querying OTel logs/traces
    clickhouse_datasource = {
      host     = "clickhouse.observability.svc.cluster.local"
      port     = 9000
      database = "otel"
      username = "default"
      password = "your-password"
      secure   = false
      # OTel schema — matches tables created by the otel-collector ClickHouse exporter
      logs_otel_enabled    = true
      logs_default_table   = "otel_logs"
      traces_otel_enabled  = true
      traces_default_table = "otel_traces"
    }
  }
}

Enable OpenTelemetry Operator for auto-instrumentation

The OpenTelemetry Operator enables zero-code instrumentation of workloads via annotations. Deployed workloads are automatically patched with OTEL_JAVAAGENT, Go eBPF instrumentation, or Python auto-instrumentation.

module "otel" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/otel-collector"

  otel = {
    namespace        = "monitoring"
    create_namespace = false

    operator = {
      enabled           = true
      chart_version     = "0.116.0"
      cert_manager_enabled = false  # auto-generate webhook certs by default
      # Enable Go eBPF instrumentation (requires Linux kernel >=4.19)
      go_instrumentation_enabled = true
    }
  }
}

After deployment, annotate your workload to enable instrumentation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "true"  # or inject-python, inject-go
    spec:
      containers:
      - name: app
        image: my-app:latest

Enable ingress for Grafana with TLS via cert-manager

This requires the cert-manager module to be deployed first. The cluster_issuer_name in cert-manager must match the cert-manager.io/cluster-issuer annotation below.

module "cert_manager" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/cert-manager"

  cert_manager = {
    cluster_issuer_name = "selfsigned-cluster-issuer"
  }
}

module "prometheus" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus"

  prometheus = {
    create_namespace       = false
    mimir_remote_write_url = module.mimir.remote_write_endpoint
    mimir_datasource_url   = module.mimir.query_frontend_endpoint
    mimir_tenant_id        = module.mimir.tenant_id

    grafana_ingress = {
      enabled    = true
      host       = "grafana.YOUR_DOMAIN"
      class_name = "nginx"
      tls_secret = "grafana-tls"
      annotations = {
        "cert-manager.io/cluster-issuer" = "selfsigned-cluster-issuer"
      }
    }

    prometheus_ingress = {
      enabled    = true
      host       = "prometheus.YOUR_DOMAIN"
      class_name = "nginx"
      tls_secret = "prometheus-tls"
      annotations = {
        "cert-manager.io/cluster-issuer" = "selfsigned-cluster-issuer"
      }
    }

    alertmanager_ingress = {
      enabled    = true
      host       = "alertmanager.YOUR_DOMAIN"
      class_name = "nginx"
      tls_secret = "alertmanager-tls"
      annotations = {
        "cert-manager.io/cluster-issuer" = "selfsigned-cluster-issuer"
      }
    }
  }
}

Storage Backends

All modules default to local disk storage. For production, use an object storage backend. Buckets and containers must be created before running terraform apply — these modules do not create them.

Module local S3 GCS Azure
mimir yes yes yes yes
loki yes yes yes yes
tempo yes yes yes yes
pyroscope yes yes yes yes
otel-collector n/a n/a n/a n/a
prometheus n/a n/a n/a n/a
cert-manager n/a n/a n/a n/a

S3 bucket requirements per module:

Module Required buckets
mimir s3_blocks_bucket, s3_ruler_bucket, s3_alertmanager_bucket
loki s3_chunks_bucket, s3_ruler_bucket
tempo s3_bucket
pyroscope s3_bucket

GCS bucket requirements per module:

Module Required buckets
mimir gcs_blocks_bucket, gcs_ruler_bucket, gcs_alertmanager_bucket
loki gcs_chunks_bucket, gcs_ruler_bucket
tempo gcs_bucket
pyroscope gcs_bucket

Azure container requirements per module:

Module Required containers
mimir azure_storage_account, azure_blocks_container, azure_ruler_container, azure_alertmanager_container
loki azure_storage_account, azure_chunks_container, azure_ruler_container
tempo azure_storage_account, azure_container
pyroscope azure_storage_account, azure_container

For IRSA (AWS) or Workload Identity (GCP/Azure), leave the key fields empty and provide the IAM annotation via service_account_annotations. The module does not create IAM roles — pre-create the role and supply the annotation.

# IRSA (EKS)
service_account_annotations = {
  "eks.amazonaws.com/role-arn" = "arn:aws:iam::123456789012:role/mimir"
}

# GKE Workload Identity
service_account_annotations = {
  "iam.gke.io/gcp-service-account" = "mimir@YOUR_GCP_PROJECT.iam.gserviceaccount.com"
}

S3 credentials secret

Instead of passing s3_access_key and s3_secret_key as plain text, you can reference a pre-existing Kubernetes Secret. The module injects the credentials as environment variables rather than embedding them in Helm values.

storage = {
  backend                = "s3"
  s3_blocks_bucket       = "mimir-blocks"
  s3_ruler_bucket        = "mimir-ruler"
  s3_alertmanager_bucket = "mimir-alertmanager"
  s3_region              = "eu-west-1"

  s3_credentials_secret = {
    name             = "my-s3-secret"       # name of the pre-existing Secret
    access_key_field = "access-key"         # key inside the Secret (default: "access-key")
    secret_key_field = "secret-key"         # key inside the Secret (default: "secret-key")
  }
}

The same s3_credentials_secret variable is available on modules/loki and modules/tempo. To share one Secret across all three modules, pass the same name to each.

Three credential modes are supported — use whichever fits your environment:

Mode How to configure
IRSA / Workload Identity Leave s3_access_key, s3_secret_key, and s3_credentials_secret all unset; provide service_account_annotations
Plain-text keys Set s3_access_key and s3_secret_key directly; the module creates a Secret automatically
Pre-existing Secret Set s3_credentials_secret; leave s3_access_key and s3_secret_key unset

Sharing one S3 bucket across Mimir storage types (Mimir only)

By default Mimir requires three separate S3 buckets (blocks, ruler, alertmanager). If you prefer a single bucket, use the s3_blocks_prefix, s3_ruler_prefix, and s3_alertmanager_prefix variables to isolate each storage type under a distinct key prefix.

storage = {
  backend                = "s3"
  s3_blocks_bucket       = "mimir-shared"
  s3_ruler_bucket        = "mimir-shared"
  s3_alertmanager_bucket = "mimir-shared"
  s3_region              = "eu-west-1"
  s3_blocks_prefix       = "blocks"
  s3_ruler_prefix        = "ruler"
  s3_alertmanager_prefix = "alertmanager"
}

S3 endpoint format

All three modules (mimir, loki, tempo) strip https:// and http:// from s3_endpoint automatically before passing the value to the underlying Helm chart. Either format is accepted:

s3_endpoint = "fsn1.your-objectstorage.com"        # hostname only — preferred
s3_endpoint = "https://fsn1.your-objectstorage.com" # scheme stripped automatically

Architecture

                          ┌─────────────────────────────────────────┐
                          │            Kubernetes Cluster            │
                          │                                          │
  ┌──────────┐  OTLP      │  ┌─────────────────────────────────┐   │
  │   Your   │──gRPC/HTTP─┼─▶│      OpenTelemetry Collector     │   │
  │   Apps   │            │  └──────┬──────────┬───────────┬───┘   │
  └──────────┘            │         │          │           │        │
                          │   traces│    metrics│      logs│        │
                          │         ▼          ▼           ▼        │
                          │  ┌──────────┐ ┌───────┐ ┌──────────┐  │
                          │  │  Tempo   │ │ Mimir │ │   Loki   │  │
                          │  │ (traces) │ │(metrics)│ │  (logs)  │  │
                          │  └────┬─────┘ └───┬───┘ └────┬─────┘  │
                          │       │            │          │         │
                          │       └────────────┼──────────┘         │
                          │                    │ query               │
                          │                    ▼                     │
                          │           ┌─────────────────┐           │
                          │           │     Grafana      │           │
                          │           │  (dashboards +   │           │
                          │           │     alerts)      │           │
                          │           └────────┬─────────┘           │
                          └────────────────────┼─────────────────────┘
                                               │ HTTPS
                                               ▼
                                        Browser / User

  ┌──────────────────────────────────────────────────────┐
  │                    Alert routing                      │
  │                                                       │
  │  Prometheus ──▶ Alertmanager ──▶ Slack / PagerDuty  │
  │  Grafana rules ────────────────▶ Slack / PagerDuty  │
  └──────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────┐
  │                   Prometheus scraping                 │
  │                                                       │
  │  Kubernetes nodes, pods, services                     │
  │         │                                             │
  │         ▼                                             │
  │    Prometheus ──remote_write──▶ Mimir                 │
  └──────────────────────────────────────────────────────┘

Component roles at a glance:

Component Role
Mimir Long-term metrics storage and query backend
Loki Log aggregation and query
Tempo Distributed trace storage and query
Prometheus Cluster scraping and remote write to Mimir
Grafana Unified dashboards and alert management
OTel Collector OTLP receiver — forwards traces to Tempo, metrics to Mimir, logs to Loki
Alloy OTel-native collector (successor to Grafana Agent) — River/Alloy pipeline config
Pyroscope Continuous profiling storage and query — CPU, memory, goroutines, heap
cert-manager TLS certificate issuance for ingress
prometheus-rules Prometheus alert rules and Alertmanager receivers
grafana-rules Grafana-managed alert rules and contact points

Troubleshooting

"Endpoint url cannot have fully qualified paths"

This error is produced by the MinIO SDK when s3_endpoint is passed with a scheme (https:// or http://). All three modules strip the scheme automatically, so this error should not appear. If it does, verify that s3_endpoint contains only the hostname and optional port — no scheme prefix.

# Correct
s3_endpoint = "fsn1.your-objectstorage.com"

# Also accepted — scheme is stripped automatically
s3_endpoint = "https://fsn1.your-objectstorage.com"

Mimir bundled MinIO conflict

The mimir-distributed Helm chart ships with MinIO enabled by default upstream. This module disables it (minio.enabled: false) because the bundled MinIO injects its own S3 configuration that conflicts with external storage backends, producing the "fully qualified paths" error above. No action is required from callers — the module handles this automatically.

usage_stats disabled in Mimir

Mimir's anonymous telemetry (usage_stats) is disabled by this module. When usage_stats is enabled and an S3-compatible endpoint is configured, Mimir attempts to send telemetry to a fully-qualified S3 path that triggers the MinIO SDK path validation error. Disabling it has no effect on Mimir's functionality.

Wrong cluster targeted by kubectl

The cert-manager and prometheus-rules modules use kubectl via a local-exec provisioner. If the KUBECONFIG environment variable is set in your shell, it overrides the Terraform provider's config_path, causing kubectl to target a different cluster than the one Terraform is managing.

Set kubeconfig_path explicitly on both modules to pin them to the correct config file:

module "cert_manager" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/cert-manager"
  cert_manager = {
    kubeconfig_path = "/path/to/your/kubeconfig"
  }
}

module "prometheus_rules" {
  source = "github.com/digitalis-io/terraform-k8s-monitoring//modules/prometheus-rules"
  prometheus_rules = {
    prometheus_release_id = module.prometheus.helm_release_id
    kubeconfig_path       = "/path/to/your/kubeconfig"
  }
}

Support

This project is maintained by Digitalis.io. For support, visit digitalis.io/contact.

About

Deploy a monitoring stack to Kubernetes using Grafana and Prometheus

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors