-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy. To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L371
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L498
Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has only the monitoring signals.
Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down and heartbeat says they are healthy, then ignore the script_exporter monitoring data?