Ignore Prometheus metrics for service status when too many are flagged as down

Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy.  To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:

https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L371
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L498

Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the  signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has _only_ the monitoring signals. 

Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down _and_ heartbeat says they are healthy, then ignore the script_exporter monitoring data?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore Prometheus metrics for service status when too many are flagged as down #127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ignore Prometheus metrics for service status when too many are flagged as down #127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions