@@ -60,3 +60,48 @@ Notes:
6060 weights into the system to deal with heterogenous capacity of the
6161 servers. (This is harder, since the "weight" factor is admin-assigned,
6262 and how exactly is it computed? 2x CPU? 4x memory? 6x network?)
63+
64+ Implementation
65+
66+ Initially, we will want a simple implementation. Passive health checks,
67+ not active. Observe errors on existing traffic, mark unhealthy backends and
68+ skip them, for some period of time. This will only apply to reverse proxying,
69+ not forward proxying.
70+
71+ What types of errors should we watch for? How many errors before a backend
72+ is unhealthy? How long to skip over unhealthy backends? What happens when
73+ all backends are unhealthy?
74+
75+ Types of errors:
76+ TCP connect errors
77+ TLS handshake errors
78+ FTP connect/login errors (ignoring bad credentials!)
79+
80+ Specifically, we want to track errors that indicate that that server is
81+ unavailable for service. Thus probably NOT TLS handshake errors, or
82+ FTP data transfer errors. That leaves TCP connect errors, and FTP non-200
83+ responses on connect.
84+
85+ Consider also the case where a configured backend server is a DNS name/URL,
86+ which resolves to multiple IP addresses/ports. These resolved addresses/ports
87+ are not currently tracked in the SQLite database -- thus we currently have
88+ no state for them persisted/shareable outside of that session process. If
89+ we did persist these addresses in the db, how to clean them up? Consider
90+ dynamic backends whose IPs change a lot over time; the cruft would build up
91+ in the db. Hmm.
92+
93+ https://www.haproxy.com/blog/using-haproxy-as-an-api-gateway-part-3-health-checks/
94+ https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/#passive-health-checks
95+
96+ > Note that if there is only a single server in a group, the fail_timeout
97+ > and max_fails parameters are ignored and the server is never marked
98+ > unavailable.
99+
100+ https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-health-check/#passive-tcp-health-checks
101+
102+ > If several health checks are configured for an upstream group, the failure
103+ > of any check is enough to consider the corresponding server unhealthy.
104+
105+ https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-health-check/#fine-tuning-tcp-health-checks
106+
107+ Defaults: interval=5s, passes=1, fails=1
0 commit comments