This document is a step by step procedure for upgrading Scylla Monitoring Stack from version 3.x to 4.y, for example, between 3.9 to 4.0.0.
We recommend installing the new release next to the old one. You can run both monitoring stacks in parallel, ensuring it is working as expected before uninstalling the old version.
Change to the directory you want to install the new Monitoring stack. Download the latest release in the .zip or .tar format.
wget -L https://github.com/scylladb/scylla-monitoring/archive/scylla-monitoring-4.y.zip
unzip scylla-monitoring-4.y.zip
cd scylla-monitoring-scylla-monitoring-4.y/
Replace “y” with the new minor and patch release number, for example, 4.0.0.zip
Copy the scylla_servers.yml
and scylla_manager_servers.yml
from the version that is already installed.
cp /path/to/monitoring/3.x/prometheus/scylla_servers.yml prometheus/
cp /path/to/monitoring/3.x/prometheus/scylla_manager_servers.yml.yml prometheus/
Run the following command to validate the Scylla-Monitoring version:
./start-all.sh --version
This section is optional. It shows you how to run two monitoring stacks side by side. You can skip this section entirely and move to switching to the new version section.
We need to use different ports to run two monitoring stacks in parallel (i.e., the older 3.x version and the new 4.x stack).
./start-all.sh -p 9091 -g 3001 -m 9095
Browse to http://{ip}:9091
and check the Grafana dashboard.
Note that we are using different port numbers for Grafana, Prometheus, and the Alertmanager.
Caution!
Important: Do not use the local dir flag when testing!
When you are satisfied with the data in the dashboard, you can shut down the containers.
Caution!
Important: Do not kill the 3.x version that is currently running!
Use the following command to kill the containers:
./kill-all.sh -p 9091 -g 3001 -m 9095
You can start and stop the new 4.y version while testing.
Note: Migrating will cause a few seconds of blackout in the system.
We assume that you are using external volume to store the metrics data.
We suggest to make a copy of Prometheus' external directory and use it as the data directory for the new monitoring stack. The new monitoring stack uses newer versions of Prometheus and keeping a backup would enable you to rollback to the previous version of Prometheus.
At this point you have two monitoring stacks installed with the older version running.
If you run the new version in testing mode kill it by following the instructions on how to Killing the new 4.y Monitoring stack in testing mode in the previous section.
Kill the older 3.x version containers by running:
./kill-all.sh
Start version 4.y in normal mode
From the new root of the scylla-monitoring-scylla-monitoring-4.y run
./start-all.sh -d /path/to/copy/data/dir
Point your browser to http://{ip}:3000
and see that the data is there.
To rollback during the testing mode, follow Killing the new 4.y Monitoring stack in testing mode as explained previously and the system will continue to operate normally.
To rollback to version 3.x, run the following commands after you have completed moving to version 4.y (as shown above):
./kill-all.sh
cd /path/to/scylla-grafana-3.x/
./start-all.sh -d /path/to/original/data/dir
Starting from Scylla Monitoring version 3.8, Scylla Monitoring uses Prometheus's recording rules for performance reasons. Recording rules perform some of the calculations when collecting the metrics instead of when showing the dashboards.
For example, this is a recording rule that calculates the p99 write latency:
- record: wlatencyp99
expr: histogram_quantile(0.99, sum(rate(scylla_storage_proxy_coordinator_write_latency_bucket{}[60s])) by (cluster, dc, instance, shard, scheduling_group_name, le))
labels:
by: "instance,shard"
For a transition period, Scylla Monitoring version 3.x has a fall-back mechanism that if those recording rules are not present data will still be shown.
Scylla Monitoring versions 4.0 and newer will rely only on recording rules.
Note
If you upgrade from a version older than 3.8 without back-filling, latency historical data will not be shown.
The following instructions are only relevant if you are upgrading from a version older than 3.8 to version 4.0 or higher, or if you are looking at historical data collected before you upgraded to 3.8.
For example, you keep your data (retention period) for a year, and you upgraded to version 3.8 three months ago.
In this example you have recording rules data only for the last three months, to be able to look at older latency information you will need to back-fill that missing period.
The following instructions are based on the recording rules backfilling section in the Prometheus documentation.
When you run the backfilling process you need to determine the start time and end time.
The start time is your Prometheus retention time, by default it is set to 15 days. If you are not sure what Prometheus retention time is, you can check by logging in to your Prometheus server: http://{ip}:9090/status.
If you are running Scylla Monitoring version 3.8 or newer for longer than the retention period, you are done! You can skip the rest of this section.
For the rest of this example, we will assume that your retention time is 360 days.
Typically, you need to back-fill the recording rules when you are using a long retention period, for example, you have a year of retention data, and you upgraded to Scylla Monitoring 3.8 about three months ago.
If you open the Overview dashboard and look at your entire retention time (in our example 1 year) you will see that while most of the graphs do display the data, the latency graphs are missing a period of time. The latency graph will only show the last three months in our example from the entire year. That nine months gap (12 months minus 3) is what we want to fill with back-filling.
The point in time that the graphs start will be your back-filling end time. Check in the graph for the exact time.
Backup the external directory containing Prometheus data; if something goes wrong, you can revert the changes.
To complete the process, you must restart Monitoring Stack at least once. You cannot complete the process without providing the path to the external directory with Prometheus data using the -d
command line option.
You need to stop the monitoring stack and run the start-all.sh
command with an additional flag (The 365d retention time used here is used only as an example):
start-all.sh -d data_dir -b "--storage.tsdb.allow-overlapping-blocks --storage.tsdb.retention.time=365d"
We will create the data files using the Promtool utility, which has been installed in the Docker container. To run the utility, you must pass the start time and end time in the epoch format. The following example shows one of the ways to convert the times to epoch when the start time is 360 and the end time is 90 days ago:
echo $((`date +%s` - 3600*24*360))
echo $((`date +%s` - 3600*24*90))
Log in to your Docker container and run the following (start
and end
should be in epoch format):
docker exec -it aprom sh
cd /prometheus/data/
promtool tsdb create-blocks-from rules \
--start $start \
--end $end \
--url http://localhost:9090 \
/etc/prometheus/prom_rules/back_fill/3.8/rules.1.yml
The previous bash script will create a data
directory in the directory where it is executed. The reason to run the bash script under the /prometheus/data/
is to ensure Prometheus has write privileges to the directory.
You should not start this section until all the previous sections have been completed. Copy the data files to the Prometheus directory using the following command:
cp data/* .
The rules will be evaluated next time Prometheus will perform compaction. You can force it by restarting the server using docker restart aprom
Follow the logs docker logs aprom
to see that the process works as expected. If there are no errors, you should now be able to
see the latency graphs over your entire retention time.