Skip to content

Conversation

@denyska
Copy link

@denyska denyska commented May 29, 2025

Avoid indefinite loops in case of non-login SQLException

Summary

There are some race conditions during failover (both regional and across AZs) that make get getVerifiedWriterConnection method go in infinite loop without delays causing Postgres JDBC Driver spawning thousands of connection threads. This ultimately shuts up CPU and memory consumption causing the service reboot.

Description

Added sleep to slow down the loop.

Additional Reviewers

N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

yes

Avoid indefinite loops in case of non-login SQLException
@denyska denyska changed the title Update AuroraInitialConnectionStrategyPlugin.java AuroraInitialConnectionStrategyPlugin: Avoid indefinite loops in case of non-login SQLException May 29, 2025
@karenc-bq
Copy link
Contributor

Hi @denyska, thank you for raising this and for taking the time to create a PR for this issue.

Could you please provide more information regarding the race condition that you're seeing?

  1. Could you please provide driver logs for when the race condition happens? See instructions to enable logs here: https://github.com/aws/aws-advanced-jdbc-wrapper/blob/main/docs/using-the-jdbc-driver/UsingTheJdbcDriver.md#logging
  2. Are you able to provide steps to reproduce this?

Thanks again for contributing to the project.

@denyska
Copy link
Author

denyska commented May 30, 2025

Thank you for following up, @karenc-bq.

🔍 Logs

Unfortunately, obtaining detailed logs is challenging in our environment, as logging levels are restricted to WARN and above due to performance considerations. However, during a thread dump analysis, I observed that a significant portion of CPU resources was consumed by the AuroraInitialConnectionStrategyPlugin. Notably, there were approximately 9,000 threads named PostgreSQL JDBC driver connection thread. This behavior appears to originate from the PostgreSQL JDBC driver's connection handling mechanism, as seen in the source code: Driver.java#L301.

🧪 Reproduction Steps

Setup is as follows:

  • Database Configuration: Multi-AZ PostgreSQL Cluster with one writer and two readers.

  • Engine Version: PostgreSQL 14.15.

  • Extensions: The rds_tools extension is not enabled, resulting in the RdsMultiAzDbClusterPgDialect not being selected.

  • AWS Advanced JDBC Wrapper Version: 2.5.6.

  • PostgreSQL JDBC Driver Version: 42.7.5.

  • Enabled Connection Plugins:

    • ConnectTimeConnectionPlugin
    • ExecutionTimeConnectionPlugin
    • AuroraInitialConnectionStrategyPlugin
    • IamAuthConnectionPlugin
    • AuroraConnectionTrackerPlugin
    • software.amazon.jdbc.plugin.failover2.FailoverConnectionPlugin (also known as failover2)
    • ReadWriteSplittingPlugin

According to the AWS documentation, for the RdsMultiAzDbClusterPgDialect to function correctly, the database engine should be at least PostgreSQL 15.4 with the rds_tools extension version 1.4 or higher. In our environment, enabling rds_tools on PostgreSQL 14.15 leads to failures in the RdsMultiAzDbClusterPgDialect, specifically due to the absence of the multi_az_db_cluster_source_dbi_resource_id() function.

I hope this provides clarity on the issue. Please let me know if you need further information or assistance.


@karenc-bq
Copy link
Contributor

Hi @denyska, thank you for the information, we will take a look into this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants