Skip to content

Conversation

@jrobe
Copy link

@jrobe jrobe commented Oct 10, 2025

I just spent a week attempting to diagnose why my services failed to reconnect to my database after an RDS Failover for Postgres.

What happened was the DNS cache was getting hits and misses, and I hit the race condition where the hikari pool was reaching out to the old unhealthy host after failover (but also after it has reached out to the new host once) - so my service was displaying "The database is not currently accepting connections" as the final "Caused by" in the Hikari stack-trace for connection pool timeouts for 1-2 hours after the failover had succeeded.

I didn't know that the last "Caused by" exception in the Hikari stack trace was actually the last connection failure that happened on a different thread; I thought it was actually part of the stack trace and running on the same thread - and I certainly didn't think it could be delayed by 1+ hours.

Because of this I dove down DNS rabbit holes, when the actual problem was my Postgres JDBC driver got stuck trying to log in (after the startup packet was acknowledged, but before hearing back from auth), that all could have been avoided if the exception message from Hikari was a bit more clear that the Failure shown in the stack trace might be stale and delayed.

I'm hoping that we can add the clarification that the failure happened some other time to Hikari, so some poor soul doesn't have the diagnostic issues I had.

@jrobe jrobe changed the title Add last connection failure timestamp and exception clarification to the pool timeout Exception Add last connection failure timestamp and exception clarification to the pool timeout exception Oct 10, 2025
@jrobe
Copy link
Author

jrobe commented Oct 10, 2025

I do recognize that the shown timestamp might not actually match the time that the exception shown happened (in the case of a race condition where the atomic reference with the throwable is updated before the atomicLong, and this thread reaches out and gets the new exception but the old timestamp). However, the message I think would still be helpful enough as is without needing to introduce an exception-timestamp tuple to explicitly pair them.

@jrobe
Copy link
Author

jrobe commented Oct 10, 2025

This new message is displayed as below in the unit tests.

00:05:29,153456 [seq  107] [ari.pool.ExceptionTest:connection-closer] DEBUG PoolBase             - com.zaxxer.hikari.pool.ExceptionTest - Closing connection com.zaxxer.hikari.mocks.StubConnection@2f33eff8: (connection is dead)
java.sql.SQLTransientConnectionException: com.zaxxer.hikari.pool.ExceptionTest - Connection is not available, request timed out after 6006ms (total=1, active=1, idle=0, waiting=0) - The last connection failure (appended as the next exception below) happened 5ms999µs100ns ago.
	at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:719)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:184)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:142)
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:99)
	at com.zaxxer.hikari.pool.ExceptionTest.testLastErrorTimeout(ExceptionTest.java:150)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:377)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:284)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:248)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:167)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:456)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:169)
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:595)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:581)
Caused by: java.sql.SQLException: Exception 1
	at com.zaxxer.hikari.pool.ExceptionTest.lambda$testLastErrorTimeout$0(ExceptionTest.java:147)
	at com.zaxxer.hikari.mocks.StubConnection.setNetworkTimeout(StubConnection.java:535)
	at com.zaxxer.hikari.pool.PoolBase.setNetworkTimeout(PoolBase.java:594)
	at com.zaxxer.hikari.pool.PoolBase.isConnectionDead(PoolBase.java:162)
	at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:166)
	... 33 more

@jrobe jrobe changed the title Add last connection failure timestamp and exception clarification to the pool timeout exception Add last connection failure time diff and exception clarification to the pool timeout exception Oct 10, 2025
@jrobe
Copy link
Author

jrobe commented Oct 13, 2025

Oh, and why did I hit a loginTimeout issue? I use driver-based connection (which spins up a DriverDataSource from the utils package) - and apparently hit this 8 year old bug where Postgres JDBC doesn't respect the DriverManager loginTimeout... pgjdbc/pgjdbc#879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant