Skip to content

Update Psycopg3 example to include non-admin and connection refresh logic#113

Merged
danielfrankcom merged 3 commits intomainfrom
chore/update-psycopg3-samples
May 6, 2025
Merged

Update Psycopg3 example to include non-admin and connection refresh logic#113
danielfrankcom merged 3 commits intomainfrom
chore/update-psycopg3-samples

Conversation

@danielfrankcom
Copy link
Copy Markdown
Contributor

@danielfrankcom danielfrankcom commented May 2, 2025

This PR updates the Psycopg3 example. It includes the following changes:

  • Synchronize the example with the publicly available docs here
  • Clean up and improve the documentation/comments
  • Fix dependencies
  • Add non-admin connection example
  • Clarify token creation logic

By submitting this pull request, I confirm that my contribution is made under
the terms of the MIT-0 license.

@danielfrankcom danielfrankcom force-pushed the chore/update-psycopg3-samples branch 5 times, most recently from a618818 to 7d8b67d Compare May 3, 2025 00:38
Copy link
Copy Markdown
Collaborator

@trstephen-amazon trstephen-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add token refresh logic

Could you point this out to me?

Comment thread .github/workflows/python-psycopg3-integ-tests.yml
Comment thread python/psycopg/README.md Outdated
@danielfrankcom
Copy link
Copy Markdown
Contributor Author

The token refresh logic is separated into a function which provides a connection, and a comment was added to make it clear the token is per-connection. There isn't really a built-in mechanism for credential management for Psycopg. The PR description should probably say something like "clarified token creation".

@trstephen-amazon
Copy link
Copy Markdown
Collaborator

Token refresh

Thanks for the clarification! SGTM

@danielfrankcom danielfrankcom force-pushed the chore/update-psycopg3-samples branch 2 times, most recently from 3a0cf88 to 949ec91 Compare May 3, 2025 01:20
@danielfrankcom
Copy link
Copy Markdown
Contributor Author

Actions are fixed now, should be ready for final review.

I've had to add a FORCE_IPV4 environment variable to work around an issue with the action definition. It seems like the GitHub runner can resolve IPv6 addresses but can't reach out to the IPv6 internet. Psycopg prefers IPv6 so it's not able to reach the cluster.

I've modified the example to manually resolve an IPv4 address and use that for the connection when FORCE_IPV4 is set. Ideally we can update the workflow to disable IPv6 in a future change.

@danielfrankcom danielfrankcom force-pushed the chore/update-psycopg3-samples branch from 949ec91 to bb0347a Compare May 3, 2025 01:24
Comment thread python/psycopg/src/example.py Outdated
Comment on lines +42 to +48
if os.environ.get("FORCE_IPV4", False):
try:
# Get the IPv4 address for the host
conn_params["hostaddr"] = socket.gethostbyname(cluster_endpoint)
except socket.gaierror:
# If DNS resolution fails, continue without hostaddr
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good find! It's exactly the type of behavior we want to highlight for anyone who wants to use psycopg3 with dsql.

Let's dig a bit deeper here. I see a few issues with psycopg and ipv6: maybe we've discovered another case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing this morning, and haven't been able to reproduce the issue again. I've reverted the PR to exclude the FORCE_IPV4 workaround, since based on my investigation below it shouldn't be needed.

I looking into the Psycopg3 source code a bit, and found it uses the following approximate algorithm:

  1. Determine the connection attempts for the provided conninfo: ref
    a. Check the conninfo. If it contains hostaddr then return that address. If the host parameter is an IP address, return that address.
    b. Otherwise, do a DNS lookup and get all IPs associated with the domain. Return all addresses from the lookup.
  2. For each address returned: ref
    a. Attempt to open a connection to the address/port described by the connection attempt.
    b. If the connection succeeds, save the connection and break from the loop.
    c. If the connection fails, record the exception and continue with other connection attempts.
  3. If no working connection was found, rethrow the last exception.
  4. Return the working connection if we reach this point.

Based on this algorithm, it shouldn't matter whether the host supports IPv6, or whether the host is in a broken state with partial IPv6 support. If an IPv4 address is returned from the DNS lookup then it should be attempted in the loop, and a successful connection should result regardless of whether the IPv6 connection comes before or after it in the connection attempt list.

Unfortunately with the level of information we get from the error and stacktrace here, it is not possible to tell what the other connection attempts in the list were, or if there were any at all. There is a debug log here which would provide more information in the event of a failure, but that level of logging isn't enabled for our runs at the moment.

Given I can't reproduce the issue any more, I'm inclined to think it was caused by either a DNS issue preventing IPv4 hosts from being returned, or a genuine connectivity issue on the host running the workflow. It's also possible it's an intermittent problem with a low % chance to occur, but it seemed pretty consistent when I was testing last week. For whatever reason, now the same code that was failing before is passing in the workflows.

Perhaps the best course of action would be to enable debug logging for this driver, and set up some kind of regular run of the workflow to see if we can catch it again. As of the moment, I don't think we have enough information to create a local reproducer for this, since we don't know the underlying cause without more visibility into the other connection attempts that were made here.

It would be good to regularly run these workflows regardless of this issue, to ensure the tests continue to work over time and with newer dependency versions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to enable Psycopg3 debug logs to help find this issue in the future, I accidentally triggered it again, and think I understand what is going on here now.

There are workflow runs here and here which clearly shows the issue.

DEBUG - psycopg - connection attempt failed: host: '***' port: '5432', hostaddr '54.158.101.233': connection failed: connection to server at "54.158.101.233", port 5432 failed: root certificate file "./root.pem" does not exist
Either provide the file, use the system's trusted roots with sslrootcert=system, or change sslmode to disable server certificate verification.

DEBUG - psycopg - connection attempt failed: host: '***' port: '5432', hostaddr '2600:1f18:692c:303:31c4:3b43:ed6d:d04d': connection is bad: connection to server at "2600:1f18:692c:303:31c4:3b43:ed6d:d04d", port 5432 failed: Network is unreachable
	Is the server running on that host and accepting TCP/IP connections?

Though the example code was the same during my testing, it seems the workflow wasn't exactly the same. Last week when it was failing, the root.pem file was not being downloaded yet, as I encountered this issue before I reached any problem with the certificates and wanted to move one step at a time. My perception was that this was a network connectivity problem, and would occur before the certificate was verified as there is no way to verify a certificate if we can't communicate with the cluster.

As it turns out this assumption was incorrect, as the way the connection logic works accounts for the certificate in the connection attempt, even though it doesn't show the problem without debug logging enabled. When I tested again this morning I was still downloading the root.pem certificate, and so didn't encounter this issue even after reverting the FORCE_IPV4 changes to the example.

What is happening in this scenario is the following:

  1. Psycopg does a DNS lookup which returns an IPv4 and IPv6 address.
  2. The loop iterates through the addresses.
  3. The IPv4 connection is established, but the certificate verification fails. In this run it fails because the certificate isn't downloaded and the file doesn't exist. In this run it fails because I changed the sslrootcert parameter to be system, and the certificate wasn't in the system trust store.
  4. The loop marks the IPv4 address as failed and stores its exception. The loop continues to the IPv6 address.
  5. The IPv6 connection fails as the host does not have functioning dual stack networking.
  6. The IPv6 exception overwrites the stored exception.
  7. All connection attempts have failed, and Psycopg shows the most recent exception to the user. This exception is the IPv6 one, which suppresses the actual issue which is with the certificates.

This problem doesn't occur with what's on the current main branch, since it is using sslmode=require instead of sslmode=verify-full.

This behavior would likely occur for any user of this example which (a) does not have functioning IPv6 on their host, and (b) does not properly set up the root certificate before running the example. Though the README.md file describes the steps for setting up the root certificate, it seems very likely people will miss it or skip it, and encounter this confusing error message which implies they have an IPv6 problem.

To prevent this issue I think we should do the following:

  1. Add a try/except around the connect method call, and check if the certificate file exists. If it doesn't print an additional message to help diagnose the problem.
  2. Raise a ticket with the Psycopg3 team around this issue. It seems the intention of the looping connection logic was to prevent duplicate failure messages, but in this case the logic is suppressing the actual problem. The resulting message is pretty confusing, and actually has nothing to do with the cause of the connection failure. Within our team alone we've run into this twice seemingly, and it is likely other users of Psycopg3 have experienced the same. I have not checked yet to see if there is an existing issue around this, but I will check first.

@danielfrankcom danielfrankcom force-pushed the chore/update-psycopg3-samples branch 2 times, most recently from 1b5d019 to 73266d0 Compare May 5, 2025 20:27
@danielfrankcom
Copy link
Copy Markdown
Contributor Author

Latest push fixes the license headers on source files which I missed before.

@danielfrankcom
Copy link
Copy Markdown
Contributor Author

Latest push adds an explicit check for the SSL certificate, which should work around the IPv6 error for now.

If the user forgets to download the SSL certificate as mentioned in the README.md file, they will receive an error message which indicates the actual problem, rather than the confusing IPv6 error message they would otherwise see.

Traceback (most recent call last):
  File "/home/runner/work/aurora-dsql-samples/aurora-dsql-samples/python/psycopg/src/example.py", line 127, in <module>
    main()
  File "/home/runner/work/aurora-dsql-samples/aurora-dsql-samples/python/psycopg/src/example.py", line 117, in main
    conn = create_connection(cluster_user, cluster_endpoint, region)
  File "/home/runner/work/aurora-dsql-samples/aurora-dsql-samples/python/psycopg/src/example.py", line 43, in create_connection
    raise FileNotFoundError(f"SSL certificate file not found: {ssl_cert_path}")
FileNotFoundError: SSL certificate file not found: ./root.pem

vs.

DEBUG - psycopg - connection attempt failed: host: '***' port: '5432', hostaddr '2600:1f18:692c:303:31c4:3b43:ed6d:d04d': connection is bad: connection to server at "2600:1f18:692c:303:31c4:3b43:ed6d:d04d", port 5432 failed: Network is unreachable
	Is the server running on that host and accepting TCP/IP connections?

Copy link
Copy Markdown
Collaborator

@trstephen-amazon trstephen-amazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Comment thread python/psycopg/README.md Outdated
Comment thread python/psycopg/src/example.py
@danielfrankcom danielfrankcom force-pushed the chore/update-psycopg3-samples branch from 4479917 to 00807b9 Compare May 6, 2025 00:42
@danielfrankcom danielfrankcom merged commit db87703 into main May 6, 2025
3 checks passed
danielfrankcom pushed a commit to marcbowes/aurora-dsql-samples that referenced this pull request May 9, 2025
@danielfrankcom danielfrankcom deleted the chore/update-psycopg3-samples branch May 20, 2025 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants