Update Psycopg3 example to include non-admin and connection refresh logic#113
Update Psycopg3 example to include non-admin and connection refresh logic#113danielfrankcom merged 3 commits intomainfrom
Conversation
a618818 to
7d8b67d
Compare
trstephen-amazon
left a comment
There was a problem hiding this comment.
Add token refresh logic
Could you point this out to me?
7d8b67d to
9bf4f5a
Compare
|
The token refresh logic is separated into a function which provides a connection, and a comment was added to make it clear the token is per-connection. There isn't really a built-in mechanism for credential management for Psycopg. The PR description should probably say something like "clarified token creation". |
Thanks for the clarification! SGTM |
3a0cf88 to
949ec91
Compare
|
Actions are fixed now, should be ready for final review. I've had to add a I've modified the example to manually resolve an IPv4 address and use that for the connection when |
949ec91 to
bb0347a
Compare
| if os.environ.get("FORCE_IPV4", False): | ||
| try: | ||
| # Get the IPv4 address for the host | ||
| conn_params["hostaddr"] = socket.gethostbyname(cluster_endpoint) | ||
| except socket.gaierror: | ||
| # If DNS resolution fails, continue without hostaddr | ||
| pass |
There was a problem hiding this comment.
This is a good find! It's exactly the type of behavior we want to highlight for anyone who wants to use psycopg3 with dsql.
Let's dig a bit deeper here. I see a few issues with psycopg and ipv6: maybe we've discovered another case.
There was a problem hiding this comment.
I did some more testing this morning, and haven't been able to reproduce the issue again. I've reverted the PR to exclude the FORCE_IPV4 workaround, since based on my investigation below it shouldn't be needed.
I looking into the Psycopg3 source code a bit, and found it uses the following approximate algorithm:
- Determine the connection attempts for the provided
conninfo: ref
a. Check theconninfo. If it containshostaddrthen return that address. If thehostparameter is an IP address, return that address.
b. Otherwise, do a DNS lookup and get all IPs associated with the domain. Return all addresses from the lookup. - For each address returned: ref
a. Attempt to open a connection to the address/port described by the connection attempt.
b. If the connection succeeds, save the connection and break from the loop.
c. If the connection fails, record the exception and continue with other connection attempts. - If no working connection was found, rethrow the last exception.
- Return the working connection if we reach this point.
Based on this algorithm, it shouldn't matter whether the host supports IPv6, or whether the host is in a broken state with partial IPv6 support. If an IPv4 address is returned from the DNS lookup then it should be attempted in the loop, and a successful connection should result regardless of whether the IPv6 connection comes before or after it in the connection attempt list.
Unfortunately with the level of information we get from the error and stacktrace here, it is not possible to tell what the other connection attempts in the list were, or if there were any at all. There is a debug log here which would provide more information in the event of a failure, but that level of logging isn't enabled for our runs at the moment.
Given I can't reproduce the issue any more, I'm inclined to think it was caused by either a DNS issue preventing IPv4 hosts from being returned, or a genuine connectivity issue on the host running the workflow. It's also possible it's an intermittent problem with a low % chance to occur, but it seemed pretty consistent when I was testing last week. For whatever reason, now the same code that was failing before is passing in the workflows.
Perhaps the best course of action would be to enable debug logging for this driver, and set up some kind of regular run of the workflow to see if we can catch it again. As of the moment, I don't think we have enough information to create a local reproducer for this, since we don't know the underlying cause without more visibility into the other connection attempts that were made here.
It would be good to regularly run these workflows regardless of this issue, to ensure the tests continue to work over time and with newer dependency versions.
There was a problem hiding this comment.
While trying to enable Psycopg3 debug logs to help find this issue in the future, I accidentally triggered it again, and think I understand what is going on here now.
There are workflow runs here and here which clearly shows the issue.
DEBUG - psycopg - connection attempt failed: host: '***' port: '5432', hostaddr '54.158.101.233': connection failed: connection to server at "54.158.101.233", port 5432 failed: root certificate file "./root.pem" does not exist
Either provide the file, use the system's trusted roots with sslrootcert=system, or change sslmode to disable server certificate verification.
DEBUG - psycopg - connection attempt failed: host: '***' port: '5432', hostaddr '2600:1f18:692c:303:31c4:3b43:ed6d:d04d': connection is bad: connection to server at "2600:1f18:692c:303:31c4:3b43:ed6d:d04d", port 5432 failed: Network is unreachable
Is the server running on that host and accepting TCP/IP connections?
Though the example code was the same during my testing, it seems the workflow wasn't exactly the same. Last week when it was failing, the root.pem file was not being downloaded yet, as I encountered this issue before I reached any problem with the certificates and wanted to move one step at a time. My perception was that this was a network connectivity problem, and would occur before the certificate was verified as there is no way to verify a certificate if we can't communicate with the cluster.
As it turns out this assumption was incorrect, as the way the connection logic works accounts for the certificate in the connection attempt, even though it doesn't show the problem without debug logging enabled. When I tested again this morning I was still downloading the root.pem certificate, and so didn't encounter this issue even after reverting the FORCE_IPV4 changes to the example.
What is happening in this scenario is the following:
- Psycopg does a DNS lookup which returns an IPv4 and IPv6 address.
- The loop iterates through the addresses.
- The IPv4 connection is established, but the certificate verification fails. In this run it fails because the certificate isn't downloaded and the file doesn't exist. In this run it fails because I changed the
sslrootcertparameter to besystem, and the certificate wasn't in the system trust store. - The loop marks the IPv4 address as failed and stores its exception. The loop continues to the IPv6 address.
- The IPv6 connection fails as the host does not have functioning dual stack networking.
- The IPv6 exception overwrites the stored exception.
- All connection attempts have failed, and Psycopg shows the most recent exception to the user. This exception is the IPv6 one, which suppresses the actual issue which is with the certificates.
This problem doesn't occur with what's on the current main branch, since it is using sslmode=require instead of sslmode=verify-full.
This behavior would likely occur for any user of this example which (a) does not have functioning IPv6 on their host, and (b) does not properly set up the root certificate before running the example. Though the README.md file describes the steps for setting up the root certificate, it seems very likely people will miss it or skip it, and encounter this confusing error message which implies they have an IPv6 problem.
To prevent this issue I think we should do the following:
- Add a
try/exceptaround theconnectmethod call, and check if the certificate file exists. If it doesn't print an additional message to help diagnose the problem. - Raise a ticket with the Psycopg3 team around this issue. It seems the intention of the looping connection logic was to prevent duplicate failure messages, but in this case the logic is suppressing the actual problem. The resulting message is pretty confusing, and actually has nothing to do with the cause of the connection failure. Within our team alone we've run into this twice seemingly, and it is likely other users of Psycopg3 have experienced the same. I have not checked yet to see if there is an existing issue around this, but I will check first.
1b5d019 to
73266d0
Compare
|
Latest push fixes the license headers on source files which I missed before. |
|
Latest push adds an explicit check for the SSL certificate, which should work around the IPv6 error for now. If the user forgets to download the SSL certificate as mentioned in the vs. |
4479917 to
00807b9
Compare
This PR updates the Psycopg3 example. It includes the following changes:
By submitting this pull request, I confirm that my contribution is made under
the terms of the MIT-0 license.