Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Commit 8333c33

Browse files
committed
better docs
1 parent 8512666 commit 8333c33

File tree

3 files changed

+42
-15
lines changed

3 files changed

+42
-15
lines changed

docs/cassandra.md

+9
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,15 @@ If you need to run Cassandra 2.2, the backported [TimeWindowCompactionStrategy](
6161
See [issue cassandra-9666](https://issues.apache.org/jira/browse/CASSANDRA-9666) for more information.
6262
You may also need to lower the cql-protocol-version value in the config to 3 or 2.
6363

64+
65+
## Data persistence
66+
67+
saving of chunks is initiated whenever the current time reaches a timestamp that divides without remainder by a chunkspan.
68+
Raw data has a certain chunkspan, and aggregated (rollup data) has chunkspans too (see [config](https://github.com/raintank/metrictank/blob/master/docs/config.md#data)) which is
69+
why periodically e.g. on the hour and on every 6th our you'll see a burst in chunks being added to the write queue.
70+
The write queue is then gradually drained by the persistence workers.
71+
72+
6473
## Write queues
6574

6675
Tuning the write queue is a bit tricky for now.

docs/clustering.md

+31-13
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,50 @@
11
# Clustering
22

33
## Underlying storage
4-
for clustering and HA of the underlying storage, we of course can simply rely on Cassandra.
4+
for clustering and HA of the underlying storage, we of course can simply rely on Cassandra. It has built-in replication and data partitioning to assure both HA and load balancing.
55

66
## Metrictank redundancy and fault tolerance
77

8-
for metrictank itself you can achieve redundancy and fault tolerance by running multiple instances.
9-
One of them needs to have the primary role, and the role can dynamically be reassigned (see http api docs)
10-
There's 2 transports for clustering (kafka and NSQ), and it's just used to send messages by the primary to the others,
11-
about which chunks have been saved to cassandra.
8+
For metrictank itself you can achieve redundancy and fault tolerance by running multiple instances which receive identical inputs.
9+
One of the instances needs to have the primary role, which means it saves data chunks to Cassandra. The other instances are secondaries.
10+
11+
Configuration of primary vs secondary:
12+
13+
* statically in the [cluster section of the config](https://github.com/raintank/metrictank/blob/master/docs/config.md#clustering) for each instance.
14+
* dynamically (see [http api docs](https://github.com/raintank/metrictank/blob/master/docs/http-api.md)) should your primary crash or you want to shut it down.
15+
16+
### Clustering transport and synchronisation
17+
18+
The primary sends out persistence messages when it saves chunks to Cassandra. These messages simply detail which chunks have been saved.
19+
If you want to be able to promote secondaries to primaries, it's important they have been ingesting and processing these messages, so that the moment they become primary,
20+
they don't start saving all the chunks it has in memory, which could be a significant sudden load on Cassandra.
21+
22+
Metrictank supports 2 transports for clustering (kafka and NSQ), configured in the [clustering transports section in the config](https://github.com/raintank/metrictank/blob/master/docs/config.md#clustering-transports)
23+
1224
Instances should not become primary when they have incomplete chunks (though in worst case scenario, you might
1325
have to do just that). So they expose metrics that describe when they are ready to be upgraded.
1426
Notes:
1527
* all instances receive the same data and have a full copy in this case.
16-
* primary role controls who writes chunks to cassandra. metadata is updated by all instances independently.
28+
* primary role controls who writes *data* chunks to cassandra. *metadata* is updated by all instances independently and redundantly, see [metadata](https://github.com/raintank/metrictank/blob/master/docs/metadata.md).
29+
30+
### Promoting a secondary to primary
31+
32+
If the primary crashed, or you want to take it down for maintenance you will need to upgrade a secondary instance (the "candidate") to primary status.
33+
This procedure needs to be done carefully:
1734

18-
when one primary is down you need to be careful about when to promote a secondary to primary:
35+
1) assure there is no primary running. Demote the primary to secondary if it's running. Make sure all the persistence messages made it through the clustering transport into the
36+
candidate. Synchronisation over the transport is usually near-instant so waiting a few seconds is usually fine. But if you want to be careful, assure that the primary stopped sending persistence messages (via the dashboard), and verify that the candidate caught up with the transport (by monitoring the consumption delay from the transport).
1937

20-
* after you see the "starting data consumption" log message for a primary, data consomuption starts. this timestamp is important.
21-
* look at your largest chunkSpan. secondary can only be promoted when a new interval starts for the largest chunkSpan. intervals start when clock unix timestamp divides without remainder by chunkSpan. How long you should wait is also shown (in seconds) via the `cluster.promotion_wait` metric.
22-
* of course there are other factors: any running primary should be depromoted and have saved its data to cassandra, all metricPersist message should have made it through NSQ into the about-to-be-promoted instance.
38+
2) pick a node that has `cluster.promotion_wait` at zero. This means the instance has been consuming data long enough to have full data for all the recent chunks that will have to be saved - assuming the primary was able to persist its older chunks to Cassandra. If that's not the case, just pick the instance that has been consuming data the longest or has a full working set in RAM (e.g. has been for longer than [`chunkspan * numchunks`](https://github.com/raintank/metrictank/blob/master/docs/config.md#data). (note: when a metrictank process starts, it first does a few maintenance tasks before data consumption starts, such as filling its index when needed)
39+
The `cluster.promotion_wait` is automatically determined based on the largest chunkSpan (typically your coarsest aggregation) and the current time. From which an instance can derive when it's ready.
2340

41+
3) open the Grafana dashboard and verify that the secondary is able to save chunks
2442

25-
## Horizontal scaling
43+
## Metrictank: Horizontal scaling
2644

2745
As for load balancing / partitioning / scaling horizontally, metrictank has no mechanism built-in to make this easier.
2846
You should be able to run multiple instances and route a subset of the traffic to each, by using proper
2947
partitioning in kafka. Instances could either use the same metadata index and cassandra cluster, or different ones, should all work.
30-
However we have not tried this yet, simply because we haven't needed this yet: single instances can grow quite large.
31-
This will be an area of future work, and input is welcomed.
48+
The main problem is combining data together to serve read requests fully. This is a [work in progress](https://github.com/raintank/metrictank/issues/315)
49+
Any input is welcomed.
3250

docs/operations.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
You should monitor the dependencies according to their best practices.
66
In particular, pay attention to delays in your kafka queue, if you use it.
7-
Especially for metric persistence messages: if those have issues, chunks may be saved multiple times
8-
when you move around the primary role.
7+
Especially for metric persistence messages which flow from primary to secondary nodes: if those have issues, chunks may be saved multiple times
8+
when you move around the primary role. (see [clustering transport](https://github.com/raintank/metrictank/blob/master/docs/clustering.md))
99

1010
Metrictank uses statsd to report metrics about itself. See [the list of documented metrics](https://github.com/raintank/metrictank/blob/master/docs/metrics.md)
1111

0 commit comments

Comments
 (0)