Skip to content

Update upgrade-guide-from-monitoring-3.x-to-monitoring-4.y.rst #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Copy the ``scylla_servers.yml`` and ``scylla_manager_servers.yml`` from the vers
Validate the new version is running the correct version
-------------------------------------------------------

run:
Run the following command to validate the Scylla-Monitoring version:

.. code-block:: bash

./start-all.sh --version

To validate the Scylla-Monitoring version.


Running in test mode
====================
Expand All @@ -66,7 +66,7 @@ Note that we are using different port numbers for Grafana, Prometheus, and the A

.. caution::

Important: do not use the local dir flag when testing!
Important: Do not use the local dir flag when testing!

When you are satisfied with the data in the dashboard, you can shut down the containers.

Expand All @@ -91,16 +91,15 @@ Migrating
Move to version 4.y (the new version)
-------------------------------------

Note: migrating will cause a few seconds of blackout in the system.
Note: Migrating will cause a few seconds of blackout in the system.

We assume that you are using external volume to store the metrics data.


Backup
^^^^^^

We suggest to copy the Prometheus external directory first and use the copy as the data directory for the new monitoring stack.
Newer Monitoring stack uses newer Promethues versions, and keeping a backup of the prometheus dir would allow you to rollback.
We suggest to make a copy of Prometheus' external directory and use it as the data directory for the new monitoring stack. The new monitoring stack uses newer versions of Prometheus and keeping a backup would enable you to rollback to the previous version of Prometheus.

Kill all containers
^^^^^^^^^^^^^^^^^^^
Expand All @@ -110,7 +109,7 @@ At this point you have two monitoring stacks installed with the older version ru
If you run the new version in testing mode kill it by following the instructions on how to `Killing the new 4.y Monitoring stack in testing mode`_
in the previous section.

kill the older 3.x version containers by running:
Kill the older 3.x version containers by running:

.. code-block:: bash

Expand All @@ -135,7 +134,7 @@ Rollback to version 3.x
To rollback during the testing mode, follow `Killing the new 4.y Monitoring stack in testing mode`_ as explained previously
and the system will continue to operate normally.

To rollback to version 3.x after you completed moving to version 4.y (as shown above), run:
To rollback to version 3.x, run the following commands after you have completed moving to version 4.y (as shown above):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's two options for rollback depends were you are at in the process, this is when you complete the upgrade,
the new version doesn't say what it should say

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lujogre Here you need to restore the previous version because the meaning was changed (and I misunderstood the instructions after the change).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @annastuchlik , thanks for clarification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still wrong, you can roll back at any time, but the way you roll back is different if you have completed moving to 4.0 or if you are still testing.

This is the case after you completed


.. code-block:: bash

Expand Down Expand Up @@ -189,7 +188,7 @@ When you run the backfilling process you need to determine the start time and en

Determine the start time
^^^^^^^^^^^^^^^^^^^^^^^^
The start time is your Prometheus retention time, by default it is set to 15 days. if you are not sure what Prometheus retention time is, you can check by
The start time is your Prometheus retention time, by default it is set to 15 days. If you are not sure what Prometheus retention time is, you can check by
logging in to your Prometheus server: `http://{ip}:9090/status`.

If you are running Scylla Monitoring version 3.8 or newer for longer than the retention period, you are done! You can skip the rest of this section.
Expand All @@ -202,37 +201,34 @@ Typically, you need to back-fill the recording rules when you are using a long r
and you upgraded to Scylla Monitoring 3.8 about three months ago.

If you open the Overview dashboard and look at your entire retention time (in our example 1 year) you will see that while most of the graphs do
show the data, the latency graphs have a missing period, in our example - from the entire year, the latency graph will only show the last three months.

That nine months gap (12 months minus 3) is what we want to fill with back-filling.
display the data, the latency graphs are missing a period of time. The latency graph will only show the last three months in our example from the entire year. That nine months gap (12 months minus 3) is what we want to fill with back-filling.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing, all those durations are a real world example, it's not always going to be that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amnonh I would leave it as is, because the entire "Determine the end time" section is an example - and it is emphasized in the first paragraph of that section ("for example, you have a year of retention data") and in the following ones ("in our example 1 year"). For sure, it's clearer than the previous version.
If we want to make it clearer, the content reorganization would help:

Typically, you need to back-fill the recording rules when using a long retention period.

Example:

You have a year of retention data, and you upgraded to ScyllaDB Monitoring 3.8 about three months ago. If you open the Overview dashboard and look at your entire retention time (one year), you will see that while most of the graphs do display the data, the latency graphs are missing a period of time. The latency graph will only show the last three months from the entire year. That nine months gap (12 months minus 3) is what we want to fill with back-filling.


The point in time that the graphs start will be your back-filling end time. Check in the graph for the exact time.

Backfilling Process
-------------------
backup
Backup
^^^^^^
If you have a long retention period you are using an external directory that holds the Prometheus data, back it up, in case
If you have a long retention period, you are using an external directory that holds the Prometheus data back it up; if something goes wrong in the process, you can revert the process.
Backup the external directory containing Prometheus data; if something goes wrong, you can revert the changes.

To complete the process, you must restart Monitoring Stack at least once. You cannot complete the process without providing the path to the external directory with Prometheus data using the ``-d`` command line option.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use an external data directory, the -d is the way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amnonh I think this is what the line says. How would you rephrase it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot complete the process without using Prometheus external directory. You can set Prometheus external directory with the -d command line option.


To complete the process you will need to restart the monitoring stack at least once. If you are not using an external directory (The ``-d``
command-line option) You cannot complete it.

Restart the monitoring stack
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to stop the monitoring stack and run the ``stat-all.sh`` command with an additional flag:
You need to stop the monitoring stack and run the ``start-all.sh`` command with an additional flag (The 365d retention time used here is used only as an example):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional flag is "--storage.tsdb.allow-overlapping-blocks" it should be clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the flag name:

You need to stop the monitoring stack and run the start-all.sh command with an additional flag --storage.tsdb.allow-overlapping-blocks. In the following command, the 365d retention time used is used as an example:


``-b "--storage.tsdb.allow-overlapping-blocks"``
``start-all.sh -d data_dir -b "--storage.tsdb.allow-overlapping-blocks --storage.tsdb.retention.time=365d"``

Create the data files
^^^^^^^^^^^^^^^^^^^^^^^^^
We will use the Promtool utility; it's already installed for you if you are using the docker container.
You will need the start time and end time for the process, in our example the start time is 360 days ago and the end time is 90 days ago.
We will create the data files using the Promtool utility, which has been installed in the Docker container. To run the utility, you must pass the start time and end time in the epoch format. The following example shows one of the ways to convert the times to epoch when the start time is 360 and the end time is 90 days ago:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it clearer, it's insalled in the prometheus docker container

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We will create the data files using the Promtool utility, which is installed in the Prometheus Docker container". - Is that what you mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @annastuchlik / @amnonh , the paragraph could be:

We will create the data files using the Promtool utility, which is installed in the Prometheus Docker container. To run the utility, you must pass the start time and end time in the epoch format. The following example shows one of the ways to convert the times to epoch when the start time is 360 and the end time is 90 days ago:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @annastuchlik , great, thanks for feedback.


The start and end times are in epoc, so you will need to translate the times to epoc. There are many ways to do this - for example, from the command line.
Run the following command to get the epoc time for 90 days ago: : ``echo $((`date +%s` - 3600*24*90))``
``echo $((`date +%s` - 3600*24*360))``

Log in to your docker container and run the following (``start`` and ``end`` should be the start and end in epoc time):
``echo $((`date +%s` - 3600*24*90))``

Log in to your Docker container and run the following (``start`` and ``end`` should be in epoch format):

.. code-block:: bash

Expand All @@ -244,20 +240,15 @@ Log in to your docker container and run the following (``start`` and ``end`` sho
--url http://localhost:9090 \
/etc/prometheus/prom_rules/back_fill/3.8/rules.1.yml

It will create a ``data`` directory in the directory where you run it.
The reason to run it under the ``/prometheus/data/`` is you can be sure Prometheus has write privileges there.
The previous bash script will create a ``data`` directory in the directory where it is executed. The reason to run the bash script under the ``/prometheus/data/`` is to ensure Prometheus has write privileges to the directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a bash script


.. note::
Depending on the time range and the number of cores, the process can take a long time. During testing it took an hour for every week of data,
for a cluster with a total of 100 cores. Make sure that the creation process is not inerupt. You can split the time range to smaller durations
(e.g. instead of an entire year, do it a weeks at a time).
.. note:
This process may take a long time, depending on the time range and number of cores. For instance, for a cluster with 100 cores, the process took an hour for every week of data during testing. Please be patient and make sure that the creation process is not interrupted. Note that the time range can be split into smaller intervals (e.g., instead of an entire year, break it down into weeks).


Copy the data files
^^^^^^^^^^^^^^^^^^^
Make sure that the process is completed successfully - don't start this section before you complete the previous sections.

Copy the data files to the Prometheus directory:
You should not start this section until all the previous sections have been completed. Copy the data files to the Prometheus directory using the following command:

.. code-block:: bash

Expand Down