-
Notifications
You must be signed in to change notification settings - Fork 3k
Add last connection failure time diff and exception clarification to the pool timeout exception #2362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
I do recognize that the shown timestamp might not actually match the time that the exception shown happened (in the case of a race condition where the atomic reference with the throwable is updated before the atomicLong, and this thread reaches out and gets the new exception but the old timestamp). However, the message I think would still be helpful enough as is without needing to introduce an exception-timestamp tuple to explicitly pair them. |
|
This new message is displayed as below in the unit tests. |
|
Oh, and why did I hit a loginTimeout issue? I use driver-based connection (which spins up a DriverDataSource from the utils package) - and apparently hit this 8 year old bug where Postgres JDBC doesn't respect the DriverManager loginTimeout... pgjdbc/pgjdbc#879 |
I just spent a week attempting to diagnose why my services failed to reconnect to my database after an RDS Failover for Postgres.
What happened was the DNS cache was getting hits and misses, and I hit the race condition where the hikari pool was reaching out to the old unhealthy host after failover (but also after it has reached out to the new host once) - so my service was displaying "The database is not currently accepting connections" as the final "Caused by" in the Hikari stack-trace for connection pool timeouts for 1-2 hours after the failover had succeeded.
I didn't know that the last "Caused by" exception in the Hikari stack trace was actually the last connection failure that happened on a different thread; I thought it was actually part of the stack trace and running on the same thread - and I certainly didn't think it could be delayed by 1+ hours.
Because of this I dove down DNS rabbit holes, when the actual problem was my Postgres JDBC driver got stuck trying to log in (after the startup packet was acknowledged, but before hearing back from auth), that all could have been avoided if the exception message from Hikari was a bit more clear that the Failure shown in the stack trace might be stale and delayed.
I'm hoping that we can add the clarification that the failure happened some other time to Hikari, so some poor soul doesn't have the diagnostic issues I had.