Description
For the hostgroup with a single backend, if the backend server goes down, the ProxySQL will return the error "Max connect timeout reached while reaching hostgroup" after the "mysql-connect_timeout_server_max" duration elapses if the backend is marked as "SHUNNED". In contrast, if establishing a direct connection without ProxySQL, the client will get a "connection refused" error or error code 111 immediately, which makes more sense.
After debugging the latest binary, it appears that ProxySQL keeps attempting to get a good connection at intervals controlled by "mysql-connect_retries_delay" until "mysql-connect_timeout_server_max" is reached according to the method MySQL_Session::handler_again___status_CONNECTING_SERVER.
proxysql/lib/MySQL_Session.cpp
Line 3177 in 27e71d2
And the retry is not even controlled by "mysql-connect_retries_on_failure" (as it should be) because the process always falls into this condition in the scenario I mentioned above.
proxysql/lib/MySQL_Session.cpp
Line 3244 in 27e71d2
If debug further, we will find the method MyHGC::get_random_MySrvC returns NULL all the time during the retry for this single "SHUNNED" backend.
Line 55 in 27e71d2
According to this method, the ProxySQL should attempt to bring the "SHUNNED" backend online but it fails all the time because the "mysrvc->time_last_detected_error" is always in the future related to "mysql-monitor_ping_interval".
// if Monitor is enabled and mysql-monitor_ping_interval is
// set too high, ProxySQL will unshun hosts that are not
// available. For this reason time_last_detected_error will
// be tuned in the future
if (mysql_thread___monitor_enabled) {
int a = mysql_thread___shun_recovery_time_sec;
int b = mysql_thread___monitor_ping_interval;
b = b/1000;
if (b > a) {
t = t + (b - a);
}
}
mysrvc->time_last_detected_error = t;
I'm not sure whether the logic is intentionally designed this way. However, the retry time should be controlled by 'mysql-connect_retries_on_failure' rather than its current implementation. Additionally, I believe it would be more effective to return the exact failure error to the client.