Skip to content

Add integrated scylla-monitoring stack support#724

Open
dkropachev wants to merge 1 commit intomasterfrom
dk/integrated-monitoring-stack
Open

Add integrated scylla-monitoring stack support#724
dkropachev wants to merge 1 commit intomasterfrom
dk/integrated-monitoring-stack

Conversation

@dkropachev
Copy link
Contributor

Summary

Add first-class monitoring integration to CCM: a Prometheus + Grafana + Alertmanager
stack that runs alongside any Scylla cluster as Docker containers with --net=host.

Two modes of operation:

  • Automatic (--monitoring flag or CCM_MONITORING=1 env var) — monitoring
    starts with the cluster and Prometheus scrape targets are kept in sync on every
    topology change (add, remove, start, stop, decommission).
  • Manual (ccm monitoring start/stop/sync) — on-demand control, no automatic
    target updates.

Key features:

  • Per-cluster port offsets (based on cluster ID) for running multiple monitored clusters
    simultaneously
  • Auto-clones scylla-monitoring repo for real Grafana dashboards; falls back to a
    built-in overview dashboard when unavailable
  • Atomic target file writes so Prometheus never reads partial state
  • Monitoring failures in automatic mode are logged as warnings and never block cluster
    operations
  • Configuration persisted in cluster.conf and restored on load

New CLI surface:

ccm create ... --monitoring [--monitoring-dir=PATH]
ccm monitoring start|stop|enable|disable|sync|status

Changed files

  • ccmlib/scylla_monitoring.py — new MonitoringStack class
  • ccmlib/cmds/cluster_cmds.pyClusterMonitoringCmd + --monitoring flag on create
  • ccmlib/cluster.py — monitoring fields, _notify_topology_change() hook
  • ccmlib/scylla_cluster.py — auto-start/stop monitoring, CCM_MONITORING env var
  • ccmlib/scylla_node.py, ccmlib/node.py — topology change notifications
  • ccmlib/cluster_factory.py — restore monitoring settings on load
  • docs/monitoring.md — full reference documentation
  • README.md — monitoring section, environment variables table
  • tests/test_scylla_monitoring.py — unit tests for MonitoringStack
  • tests/test_monitoring_integration.py — integration tests for CLI and hooks

@dkropachev dkropachev marked this pull request as draft February 12, 2026 15:05
@dkropachev dkropachev force-pushed the dk/integrated-monitoring-stack branch 2 times, most recently from 9392bad to 1226411 Compare February 12, 2026 15:16
@dkropachev dkropachev force-pushed the dk/integrated-monitoring-stack branch from 1226411 to 5c80573 Compare February 12, 2026 15:30
@dkropachev
Copy link
Contributor Author

@fruch , now it is ready and tested, the only reason why ccmlib/scylla_monitoring.py is so big is that there is no way to reuse all the scripts from https://github.com/scylladb/scylla-monitoring, because it is shared environment, there could be many stacks running, so i had to switch to host networking, so provisioning now is manual and scripts have no cli to help with it.

@dkropachev dkropachev marked this pull request as ready for review February 12, 2026 15:47
Spin up a Prometheus + Grafana + Alertmanager stack alongside any CCM
cluster using Docker containers with --net=host.

Automatic mode (--monitoring or CCM_MONITORING=1) keeps Prometheus scrape
targets in sync on every topology change. Manual mode
(ccm monitoring start/stop/sync) gives on-demand control.

Multi-cluster setups are supported via port offsets based on cluster ID.
When scylla-monitoring repo is available, real dashboards are generated;
otherwise a built-in fallback overview dashboard is used.

New CLI:
  ccm create ... --monitoring [--monitoring-dir=PATH]
  ccm monitoring start|stop|enable|disable|sync|status
@dkropachev dkropachev force-pushed the dk/integrated-monitoring-stack branch from 5c80573 to 4378b81 Compare February 19, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments