Replication error handling or alerting

__Proposal:__
Enable users to setup a reporting system that manages alerts of failing replication streams. The remote connection is e.g. between two InfluxDB OSS instances running in Docker.

__Current behavior:__
Replication stream errors are displayed in the InfluxDB logs:
E.g.: ts=2022-11-28T11:03:54.182970Z lvl=error msg="Error in replication stream" log_id=0eRUKx5l000 service=replications replication_id=0a5b73c0c1da8000 error="invalid response code 422, must be 204" retries=8
Other options to monitor the replication results seem to be unavailable.

__Desired behavior:__
Being able to setup an alerting system to monitor replication streams and receive messages if the replication fails.

__Alternatives considered:__
- Monitor InfluxDB metrics via Prometheus: found only "influxdb_replications_total" metric related to replication
- InfluxDB’s own alerting system: the current alerting system does not seem to be useful in this case as checks can be setup to evaluate existing measurements, not replication processes.
- _monitoring system bucket: stores no replication related data.
- _tasks system bucket: stores data related only to task executions
- Monitor logs (e.g. grok exporter, fluentd): with InfluxDB instances running in Docker the log monitoring seems to be unstable and the provided replication related error logs contain no detailed information about the error response.
E.g.: ts=2022-11-28T11:03:54.182970Z lvl=error msg="Error in replication stream" log_id=0eRUKx5l000 service=replications replication_id=0a5b73c0c1da8000 error="invalid response code 422, must be 204" retries=8

__Use case:__
We have an InfluxDB 2 service (v. 2.4.0) running in a docker container, with a remote connection to another InfluxDB OSS instance. We would like to setup an alerting system to receive messages if the replication fails. To use replication with as much safety as possible we would need the ability to monitor replication streams and having system alerts with detailed information about the failed processes and the related records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replication error handling or alerting #24034

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replication error handling or alerting #24034

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions