Unguarded getsockname()[1] on async connection open → unretryable TypeError: 'NoneType' object is not subscriptable (deterministic repro; re: #730, #1057)

## AI DISCLOSURE 
I used Claude Code to generate this repro.  Its very hard to verify that the socketname is really "None" in the running system, but by patching this code in my production deployment the error did go away...

## The Issue

`AsyncBolt.__init__` (and several sibling sites in the async connect path) call `self.socket.getsockname()[1]` with no `None` guard. For the async driver, `BoltSocket.getsockname()` is `self._writer.transport.get_extra_info("sockname")`, which asyncio returns as **`None`** once the transport's socket is gone — i.e. when a load balancer drops a freshly-opened connection (we see this constantly on **Neo4j Aura**, most often on the connection opened for a **routing-table refresh**).

Because the result is a bare `TypeError` rather than `ServiceUnavailable`/`SessionExpired`, the driver's transaction-retry never catches it, so a *transient* connection drop becomes a hard, non-retryable crash with a misleading message.

This is the same defect previously reported in **#730** (2022) and **#1057** (2024); both were closed because the condition **could not be reproduced** (#1057: *"I was not able to reproduce this condition locally"*, and *"feel free to reopen … while providing the additional information requested"*). **This report provides a deterministic, minimal reproduction** — the missing piece — plus a verification that guarding the `None` fixes it.

### Versions

- Driver: reproduced on `neo4j==5.28.4`; the unguarded line is **identical in the latest `6.2.0`** (`_bolt.py:159`).
- Server: Neo4j Aura in production; the repro below works against any local Neo4j.
- Python 3.12.

### Production traceback (matches #730 / #1057)

```
File ".../neo4j/_async/io/_pool.py", line 792, in fetch_routing_info
    cx = await self._acquire(address, auth, deadline, None)
File ".../neo4j/_async/io/_pool.py", line 711, in opener
    return await AsyncBolt.open(...)
File ".../neo4j/_async/io/_bolt.py", line 413, in open
    connection = bolt_cls(...)
File ".../neo4j/_async/io/_bolt.py", line 156, in __init__
    self.local_port = self.socket.getsockname()[1]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
```

### Deterministic reproduction

The real-world trigger (an LB dropping a brand-new connection) is hard to race — presumably why the prior issues stalled. But the condition asyncio actually exposes is simple: `get_extra_info("sockname")` returns `None` once the transport's socket is gone. The script marks the *first* freshly-opened socket as "dropped" right after its handshake — exactly the production sequence — so the crash lands on the same line as #730/#1057. It then applies a one-line guard and shows the same drop become retryable and self-heal. Needs only `pip install neo4j`:

```bash
docker run --rm -d -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5
NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password python repro.py
```

```python
# repro.py
import asyncio
import os
import traceback

import neo4j._async.io._bolt as bolt_mod
import neo4j._async.io._bolt_socket as io_sock
import neo4j._async_compat.network._bolt_socket as base_sock
from neo4j import AsyncGraphDatabase
from neo4j.exceptions import ServiceUnavailable

URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
USER = os.environ.get("NEO4J_USER", "neo4j")
PASSWORD = os.environ.get("NEO4J_PASSWORD", "password")

# Make the sockets we choose report sockname == None -- the value asyncio yields
# for a transport whose socket is already gone.
_orig_getsockname = base_sock.AsyncBoltSocketBase.getsockname
_orig_connect = io_sock.AsyncBoltSocket.connect.__func__  # unwrap classmethod
_dead_sockets: set[int] = set()
_arm = {"on": False}


def _getsockname(self):
    return None if id(self) in _dead_sockets else _orig_getsockname(self)


async def _connect(cls, *args, **kwargs):
    sock, *rest = await _orig_connect(cls, *args, **kwargs)
    if _arm["on"]:
        _dead_sockets.add(id(sock))  # LB drops THIS freshly-opened connection
        _arm["on"] = False           # one-shot: only the first connection
    return (sock, *rest)


base_sock.AsyncBoltSocketBase.getsockname = _getsockname
io_sock.AsyncBoltSocket.connect = classmethod(_connect)


async def _run_query() -> int:
    driver = AsyncGraphDatabase.driver(URI, auth=(USER, PASSWORD))
    try:
        records, _, _ = await driver.execute_query("RETURN 1 AS n")
        return records[0]["n"]
    finally:
        await driver.close()


async def main() -> int:
    print(f"neo4j driver {__import__('neo4j').__version__}  |  target {URI}\n")

    print("SCENARIO 1 - stock driver, first connection's socket dropped:")
    _dead_sockets.clear(); _arm["on"] = True
    try:
        await _run_query()
        print("  UNEXPECTED: query succeeded\n"); s1 = False
    except TypeError as exc:
        line = next((l.strip() for l in traceback.format_exc().splitlines()
                     if "getsockname()[1]" in l), "?")
        print(f"  REPRODUCED -> TypeError: {exc}\n  at: {line}\n"
              "  (a bare TypeError -- the driver's retry never catches it)\n")
        s1 = True

    # The fix: reclassify the None-sockname subscript crash as retryable.
    _orig_init = bolt_mod.AsyncBolt.__init__

    def _guarded_init(self, *args, **kwargs):
        try:
            _orig_init(self, *args, **kwargs)
        except TypeError as exc:
            if "subscriptable" not in str(exc):
                raise
            raise ServiceUnavailable(
                "socket dropped before use (sockname unavailable); retrying"
            ) from exc

    bolt_mod.AsyncBolt.__init__ = _guarded_init

    print("SCENARIO 2 - same drop, with the None-guard applied:")
    _dead_sockets.clear(); _arm["on"] = True
    try:
        result = await _run_query()
        print(f"  HEALED -> driver retried on a fresh connection, returned {result}\n")
        s2 = result == 1
    finally:
        bolt_mod.AsyncBolt.__init__ = _orig_init

    print("RESULT:", "PASS" if (s1 and s2) else "FAIL")
    return 0 if (s1 and s2) else 1


if __name__ == "__main__":
    raise SystemExit(asyncio.run(main()))
```

Output (driver 5.28.4, local Neo4j):

```
SCENARIO 1 - stock driver, first connection's socket dropped:
  REPRODUCED -> TypeError: 'NoneType' object is not subscriptable
  at: self.local_port = self.socket.getsockname()[1]
  (a bare TypeError -- the driver's retry never catches it)

SCENARIO 2 - same drop, with the None-guard applied:
Transaction failed and will be retried in 0.83s (socket dropped before use (sockname unavailable); retrying)
  HEALED -> driver retried on a fresh connection, returned 1

RESULT: PASS
```

### All unguarded `getsockname()[1]` sites (any can be hit, depending on when the socket dies)

In `5.28.4` (same pattern in `6.2.0`):

- `_async/io/_bolt.py:156` — `AsyncBolt.__init__` (the production crash site)
- `_async/io/_bolt.py:381, 399, 407` — `AsyncBolt.open` close / auth-failure debug logging
- `_async/io/_bolt_socket.py:225` — `_handshake` (a one-liner `getsockname = lambda self: None` lands here instead)
- `_async/io/_bolt_socket.py:343, 361` — connect
- `_async/io/_common.py:41` — `AsyncOutbox`

### Why it matters

This is a **transient** condition (connection dropped mid/just-after open — normal during Aura leader elections / routing refresh; #1057 also noted it *"manifested itself under high load, and when leader elections were happening"*). The driver already retries transient connection failures, but a bare `TypeError` is not in the retryable set, so it escapes and kills the operation instead of re-opening.

Note: `get_extra_info("sockname")` returns `None` simply because the transport's socket is already `None`/closed — there isn't necessarily an `OSError` at the `getsockname` call itself (which may be why the `async-sockname` OSError-surfacing branch from #1057 didn't pan out).

### Suggested fix

Guard the `None` and raise a retryable driver error instead of subscripting, e.g. in `AsyncBolt.__init__`:

```python
sockname = self.socket.getsockname()
if sockname is None:
    raise ServiceUnavailable(
        "Connection's socket was closed before it could be used "
        "(sockname unavailable); will retry on a fresh connection"
    )
self.local_port = sockname[1]
```

and treat the debug-logging sites defensively (`local_port = sockname[1] if sockname else -1`). As Scenario 2 shows, this lets the existing retry transparently re-open — the behavior every reporter of this defect actually wants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unguarded getsockname()[1] on async connection open → unretryable TypeError: 'NoneType' object is not subscriptable (deterministic repro; re: #730, #1057) #1310

AI DISCLOSURE

The Issue

Versions

Production traceback (matches #730 / #1057)

Deterministic reproduction

All unguarded `getsockname()[1]` sites (any can be hit, depending on when the socket dies)

Why it matters

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unguarded getsockname()[1] on async connection open → unretryable TypeError: 'NoneType' object is not subscriptable (deterministic repro; re: #730, #1057) #1310

Description

AI DISCLOSURE

The Issue

Versions

Production traceback (matches #730 / #1057)

Deterministic reproduction

All unguarded getsockname()[1] sites (any can be hit, depending on when the socket dies)

Why it matters

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

All unguarded `getsockname()[1]` sites (any can be hit, depending on when the socket dies)