Skip to content

Commit 9392bad

Browse files
committed
scylla: add integrated monitoring stack support
Spin up a Prometheus + Grafana + Alertmanager stack alongside any CCM cluster using Docker containers with --net=host. Automatic mode (--monitoring or CCM_MONITORING=1) keeps Prometheus scrape targets in sync on every topology change. Manual mode (ccm monitoring start/stop/sync) gives on-demand control. Multi-cluster setups are supported via port offsets based on cluster ID. When scylla-monitoring repo is available, real dashboards are generated; otherwise a built-in fallback overview dashboard is used. New CLI: ccm create ... --monitoring [--monitoring-dir=PATH] ccm monitoring start|stop|enable|disable|sync|status
1 parent 17e0101 commit 9392bad

14 files changed

+3481
-10
lines changed

README.md

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ Requirements
131131
- **Java 8+** - only required for:
132132
- Using Cassandra clusters
133133
- Using older Scylla versions (< 6.0)
134-
- **Docker** (optional) - for Docker-based clusters
134+
- **Docker** (optional) - for Docker-based clusters and [integrated monitoring](docs/monitoring.md)
135135
- **Multiple loopback interfaces** - CCM runs nodes on 127.0.0.X addresses
136136
- On Linux: Usually available by default
137137
- On Mac OS X: Create aliases manually:
@@ -142,8 +142,18 @@ Requirements
142142
```
143143

144144
By default ccm will look for Java in `/usr/lib/jvm` directory. If you have a custom Java installation directory,
145-
you can provide `CUSTOM_JAVA_HOME` env variable. When this environment, ccm will assume Java binary is available
146-
under `$CUSTOM_JAVA_HOME/bin/java` path.
145+
you can provide `CUSTOM_JAVA_HOME` env variable. When this environment variable is set, ccm will assume Java binary is available
146+
under `$CUSTOM_JAVA_HOME/bin/java` path.
147+
148+
### Environment Variables
149+
150+
| Variable | Description |
151+
|---|---|
152+
| `CCM_MONITORING` | When non-empty and not `0`, enable automatic monitoring for all newly created clusters (same as `--monitoring`). |
153+
| `SCYLLA_MONITORING_DIR` | Path to the [scylla-monitoring](https://github.com/scylladb/scylla-monitoring) checkout. Used as default for `--monitoring-dir`. |
154+
| `CUSTOM_JAVA_HOME` | Custom Java installation directory. |
155+
| `SCYLLA_EXT_OPTS` | Extra Scylla command-line options passed to every node. |
156+
| `SCYLLA_EXT_ENV` | Extra environment variables for Scylla processes. |
147157

148158
Known Issues
149159
------------
@@ -237,6 +247,7 @@ ccm start # Start all nodes
237247
ccm stop # Stop all nodes
238248
ccm remove # Remove current cluster
239249
ccm clear # Clear cluster data but keep config
250+
ccm monitoring <subcmd> # Manage monitoring stack (start/stop/enable/disable/sync/status)
240251
```
241252
242253
### Node Management
@@ -258,6 +269,42 @@ ccm add node4 -i 127.0.0.4 # Add a new node to cluster
258269
259270
For complete command reference, run `ccm --help` or `ccm <command> --help`.
260271
272+
Integrated Monitoring
273+
---------------------
274+
275+
CCM can start and maintain a [scylla-monitoring](https://github.com/scylladb/scylla-monitoring) stack
276+
(Prometheus + Grafana) alongside your cluster. See [docs/monitoring.md](docs/monitoring.md) for full details.
277+
278+
### Quick start — automatic mode
279+
```bash
280+
# Create a cluster with monitoring enabled
281+
ccm create mycluster --scylla -n 3 -v release:2024.2 --monitoring --monitoring-dir=/path/to/scylla-monitoring -s
282+
283+
# Monitoring is now running and auto-updates on topology changes
284+
ccm monitoring status
285+
# Grafana: http://localhost:3000
286+
# Prometheus: http://localhost:9090
287+
288+
ccm add node4 --scylla # targets updated automatically
289+
ccm stop # monitoring stops automatically
290+
```
291+
292+
### Quick start — environment variable
293+
```bash
294+
# Enable monitoring for all new clusters without passing --monitoring every time
295+
export CCM_MONITORING=1
296+
ccm create mycluster --scylla -n 3 -v release:2024.2 -s
297+
```
298+
299+
### Quick start — manual mode
300+
```bash
301+
ccm create mycluster --scylla -n 3 -v release:2024.2 -s
302+
ccm monitoring start --monitoring-dir=/path/to/scylla-monitoring
303+
ccm add node4 --scylla
304+
ccm monitoring sync # manually refresh targets
305+
ccm monitoring stop
306+
```
307+
261308
Working with Cassandra
262309
----------------------
263310

ccmlib/cluster.py

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,18 @@ def __init__(self, path, name, partitioner=None, install_dir=None, create_direct
4141
self._debug = []
4242
self._trace = []
4343

44+
# Monitoring stack integration (Scylla-only).
45+
# These fields are initialized here so they're always present on any
46+
# Cluster instance (simplifies config persistence and CLI). Actual
47+
# monitoring auto-start only happens in ScyllaCluster.
48+
# CCM_MONITORING env var is applied in ScyllaCluster.__init__.
49+
self.monitoring_enabled = False
50+
self.monitoring_stack = None # runtime only, not persisted
51+
self.monitoring_dir = None
52+
self.grafana_port = 3000
53+
self.prometheus_port = 9090
54+
self.alertmanager_port = 9093
55+
4456
if self.name.lower() == "current":
4557
raise RuntimeError("Cannot name a cluster 'current'.")
4658

@@ -226,14 +238,25 @@ def version(self):
226238
def cassandra_version(self):
227239
return self.version()
228240

241+
def _notify_topology_change(self):
242+
"""Called after any operation that changes which nodes are UP.
243+
244+
Only updates targets if automatic monitoring is enabled AND running.
245+
"""
246+
if self.monitoring_enabled and self.monitoring_stack and self.monitoring_stack.is_running():
247+
try:
248+
self.monitoring_stack.update_targets()
249+
except Exception as e:
250+
self.warning(f"Failed to update monitoring targets: {e}")
251+
229252
def add(self, node: ScyllaNode, is_seed, data_center=None, rack=None):
230253
if node.name in self.nodes:
231254
raise common.ArgumentError(f'Cannot create existing node {node.name}')
232255
self.nodes[node.name] = node
233256
if is_seed:
234257
self.seeds.append(node)
235258
self._update_config()
236-
259+
237260
# If data_center is not specified, infer it from existing nodes
238261
if data_center is None and len(self.nodes) > 1:
239262
# Get datacenter from the first existing node (excluding the one we just added)
@@ -243,7 +266,7 @@ def add(self, node: ScyllaNode, is_seed, data_center=None, rack=None):
243266
if rack is None and existing_node.rack is not None:
244267
rack = existing_node.rack
245268
break
246-
269+
247270
node.data_center = data_center
248271
node.rack = rack
249272
node.set_log_level(self.__log_level)
@@ -257,6 +280,7 @@ def add(self, node: ScyllaNode, is_seed, data_center=None, rack=None):
257280
self.debug(f"{node.name}: data_center={node.data_center} rack={node.rack} snitch={self.snitch}")
258281
self.__update_topology_files()
259282
node._save()
283+
self._notify_topology_change()
260284
return self
261285

262286
# nodes can be provided in multiple notations, determining the cluster topology:
@@ -420,6 +444,7 @@ def remove(self, node: ScyllaNode=None, wait_other_notice=False, other_nodes=Non
420444
self.seeds.remove(node)
421445
self._update_config()
422446
node.stop(gently=False, wait_other_notice=wait_other_notice, other_nodes=other_nodes)
447+
self._notify_topology_change()
423448
else:
424449
self.stop(gently=False, wait_other_notice=wait_other_notice, other_nodes=other_nodes)
425450

@@ -700,7 +725,12 @@ def _update_config(self, install_dir=None):
700725
'log_level': self.__log_level,
701726
'use_vnodes': self.use_vnodes,
702727
'id': self.id,
703-
'ipprefix': self.ipprefix
728+
'ipprefix': self.ipprefix,
729+
'monitoring_enabled': self.monitoring_enabled,
730+
'monitoring_dir': self.monitoring_dir,
731+
'grafana_port': self.grafana_port,
732+
'prometheus_port': self.prometheus_port,
733+
'alertmanager_port': self.alertmanager_port,
704734
}
705735

706736
with open(filename, 'w') as f:

ccmlib/cluster_factory.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,17 @@ def load(path, name):
5555
if 'ipprefix' in data:
5656
cluster.ipprefix = data['ipprefix']
5757

58+
# Restore monitoring settings
59+
if 'monitoring_enabled' in data:
60+
cluster.monitoring_enabled = data['monitoring_enabled']
61+
if 'monitoring_dir' in data:
62+
cluster.monitoring_dir = data['monitoring_dir']
63+
if 'grafana_port' in data:
64+
cluster.grafana_port = data['grafana_port']
65+
if 'prometheus_port' in data:
66+
cluster.prometheus_port = data['prometheus_port']
67+
if 'alertmanager_port' in data:
68+
cluster.alertmanager_port = data['alertmanager_port']
5869

5970
except KeyError as k:
6071
raise common.LoadError("Error Loading " + filename + ", missing property:" + str(k))

0 commit comments

Comments
 (0)