Description
Proposal:
Enable users to setup a reporting system that manages alerts of failing replication streams. The remote connection is e.g. between two InfluxDB OSS instances running in Docker.
Current behavior:
Replication stream errors are displayed in the InfluxDB logs:
E.g.: ts=2022-11-28T11:03:54.182970Z lvl=error msg="Error in replication stream" log_id=0eRUKx5l000 service=replications replication_id=0a5b73c0c1da8000 error="invalid response code 422, must be 204" retries=8
Other options to monitor the replication results seem to be unavailable.
Desired behavior:
Being able to setup an alerting system to monitor replication streams and receive messages if the replication fails.
Alternatives considered:
- Monitor InfluxDB metrics via Prometheus: found only "influxdb_replications_total" metric related to replication
- InfluxDB’s own alerting system: the current alerting system does not seem to be useful in this case as checks can be setup to evaluate existing measurements, not replication processes.
- _monitoring system bucket: stores no replication related data.
- _tasks system bucket: stores data related only to task executions
- Monitor logs (e.g. grok exporter, fluentd): with InfluxDB instances running in Docker the log monitoring seems to be unstable and the provided replication related error logs contain no detailed information about the error response.
E.g.: ts=2022-11-28T11:03:54.182970Z lvl=error msg="Error in replication stream" log_id=0eRUKx5l000 service=replications replication_id=0a5b73c0c1da8000 error="invalid response code 422, must be 204" retries=8
Use case:
We have an InfluxDB 2 service (v. 2.4.0) running in a docker container, with a remote connection to another InfluxDB OSS instance. We would like to setup an alerting system to receive messages if the replication fails. To use replication with as much safety as possible we would need the ability to monitor replication streams and having system alerts with detailed information about the failed processes and the related records.