Skip to content

Commit d7b9899

Browse files
committed
Adding notes to myself for future healthcheck implementation.
1 parent 1e3f60b commit d7b9899

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

doc/NOTES.health-checks

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,48 @@ Notes:
6060
weights into the system to deal with heterogenous capacity of the
6161
servers. (This is harder, since the "weight" factor is admin-assigned,
6262
and how exactly is it computed? 2x CPU? 4x memory? 6x network?)
63+
64+
Implementation
65+
66+
Initially, we will want a simple implementation. Passive health checks,
67+
not active. Observe errors on existing traffic, mark unhealthy backends and
68+
skip them, for some period of time. This will only apply to reverse proxying,
69+
not forward proxying.
70+
71+
What types of errors should we watch for? How many errors before a backend
72+
is unhealthy? How long to skip over unhealthy backends? What happens when
73+
all backends are unhealthy?
74+
75+
Types of errors:
76+
TCP connect errors
77+
TLS handshake errors
78+
FTP connect/login errors (ignoring bad credentials!)
79+
80+
Specifically, we want to track errors that indicate that that server is
81+
unavailable for service. Thus probably NOT TLS handshake errors, or
82+
FTP data transfer errors. That leaves TCP connect errors, and FTP non-200
83+
responses on connect.
84+
85+
Consider also the case where a configured backend server is a DNS name/URL,
86+
which resolves to multiple IP addresses/ports. These resolved addresses/ports
87+
are not currently tracked in the SQLite database -- thus we currently have
88+
no state for them persisted/shareable outside of that session process. If
89+
we did persist these addresses in the db, how to clean them up? Consider
90+
dynamic backends whose IPs change a lot over time; the cruft would build up
91+
in the db. Hmm.
92+
93+
https://www.haproxy.com/blog/using-haproxy-as-an-api-gateway-part-3-health-checks/
94+
https://docs.nginx.com/nginx/admin-guide/load-balancer/http-health-check/#passive-health-checks
95+
96+
> Note that if there is only a single server in a group, the fail_timeout
97+
> and max_fails parameters are ignored and the server is never marked
98+
> unavailable.
99+
100+
https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-health-check/#passive-tcp-health-checks
101+
102+
> If several health checks are configured for an upstream group, the failure
103+
> of any check is enough to consider the corresponding server unhealthy.
104+
105+
https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-health-check/#fine-tuning-tcp-health-checks
106+
107+
Defaults: interval=5s, passes=1, fails=1

0 commit comments

Comments
 (0)