Skip to content

Unguarded getsockname()[1] on async connection open → unretryable TypeError: 'NoneType' object is not subscriptable (deterministic repro; re: #730, #1057) #1310

@riley-deliberately

Description

@riley-deliberately

AI DISCLOSURE

I used Claude Code to generate this repro. Its very hard to verify that the socketname is really "None" in the running system, but by patching this code in my production deployment the error did go away...

The Issue

AsyncBolt.__init__ (and several sibling sites in the async connect path) call self.socket.getsockname()[1] with no None guard. For the async driver, BoltSocket.getsockname() is self._writer.transport.get_extra_info("sockname"), which asyncio returns as None once the transport's socket is gone — i.e. when a load balancer drops a freshly-opened connection (we see this constantly on Neo4j Aura, most often on the connection opened for a routing-table refresh).

Because the result is a bare TypeError rather than ServiceUnavailable/SessionExpired, the driver's transaction-retry never catches it, so a transient connection drop becomes a hard, non-retryable crash with a misleading message.

This is the same defect previously reported in #730 (2022) and #1057 (2024); both were closed because the condition could not be reproduced (#1057: "I was not able to reproduce this condition locally", and "feel free to reopen … while providing the additional information requested"). This report provides a deterministic, minimal reproduction — the missing piece — plus a verification that guarding the None fixes it.

Versions

  • Driver: reproduced on neo4j==5.28.4; the unguarded line is identical in the latest 6.2.0 (_bolt.py:159).
  • Server: Neo4j Aura in production; the repro below works against any local Neo4j.
  • Python 3.12.

Production traceback (matches #730 / #1057)

File ".../neo4j/_async/io/_pool.py", line 792, in fetch_routing_info
    cx = await self._acquire(address, auth, deadline, None)
File ".../neo4j/_async/io/_pool.py", line 711, in opener
    return await AsyncBolt.open(...)
File ".../neo4j/_async/io/_bolt.py", line 413, in open
    connection = bolt_cls(...)
File ".../neo4j/_async/io/_bolt.py", line 156, in __init__
    self.local_port = self.socket.getsockname()[1]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable

Deterministic reproduction

The real-world trigger (an LB dropping a brand-new connection) is hard to race — presumably why the prior issues stalled. But the condition asyncio actually exposes is simple: get_extra_info("sockname") returns None once the transport's socket is gone. The script marks the first freshly-opened socket as "dropped" right after its handshake — exactly the production sequence — so the crash lands on the same line as #730/#1057. It then applies a one-line guard and shows the same drop become retryable and self-heal. Needs only pip install neo4j:

docker run --rm -d -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5
NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password python repro.py
# repro.py
import asyncio
import os
import traceback

import neo4j._async.io._bolt as bolt_mod
import neo4j._async.io._bolt_socket as io_sock
import neo4j._async_compat.network._bolt_socket as base_sock
from neo4j import AsyncGraphDatabase
from neo4j.exceptions import ServiceUnavailable

URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
USER = os.environ.get("NEO4J_USER", "neo4j")
PASSWORD = os.environ.get("NEO4J_PASSWORD", "password")

# Make the sockets we choose report sockname == None -- the value asyncio yields
# for a transport whose socket is already gone.
_orig_getsockname = base_sock.AsyncBoltSocketBase.getsockname
_orig_connect = io_sock.AsyncBoltSocket.connect.__func__  # unwrap classmethod
_dead_sockets: set[int] = set()
_arm = {"on": False}


def _getsockname(self):
    return None if id(self) in _dead_sockets else _orig_getsockname(self)


async def _connect(cls, *args, **kwargs):
    sock, *rest = await _orig_connect(cls, *args, **kwargs)
    if _arm["on"]:
        _dead_sockets.add(id(sock))  # LB drops THIS freshly-opened connection
        _arm["on"] = False           # one-shot: only the first connection
    return (sock, *rest)


base_sock.AsyncBoltSocketBase.getsockname = _getsockname
io_sock.AsyncBoltSocket.connect = classmethod(_connect)


async def _run_query() -> int:
    driver = AsyncGraphDatabase.driver(URI, auth=(USER, PASSWORD))
    try:
        records, _, _ = await driver.execute_query("RETURN 1 AS n")
        return records[0]["n"]
    finally:
        await driver.close()


async def main() -> int:
    print(f"neo4j driver {__import__('neo4j').__version__}  |  target {URI}\n")

    print("SCENARIO 1 - stock driver, first connection's socket dropped:")
    _dead_sockets.clear(); _arm["on"] = True
    try:
        await _run_query()
        print("  UNEXPECTED: query succeeded\n"); s1 = False
    except TypeError as exc:
        line = next((l.strip() for l in traceback.format_exc().splitlines()
                     if "getsockname()[1]" in l), "?")
        print(f"  REPRODUCED -> TypeError: {exc}\n  at: {line}\n"
              "  (a bare TypeError -- the driver's retry never catches it)\n")
        s1 = True

    # The fix: reclassify the None-sockname subscript crash as retryable.
    _orig_init = bolt_mod.AsyncBolt.__init__

    def _guarded_init(self, *args, **kwargs):
        try:
            _orig_init(self, *args, **kwargs)
        except TypeError as exc:
            if "subscriptable" not in str(exc):
                raise
            raise ServiceUnavailable(
                "socket dropped before use (sockname unavailable); retrying"
            ) from exc

    bolt_mod.AsyncBolt.__init__ = _guarded_init

    print("SCENARIO 2 - same drop, with the None-guard applied:")
    _dead_sockets.clear(); _arm["on"] = True
    try:
        result = await _run_query()
        print(f"  HEALED -> driver retried on a fresh connection, returned {result}\n")
        s2 = result == 1
    finally:
        bolt_mod.AsyncBolt.__init__ = _orig_init

    print("RESULT:", "PASS" if (s1 and s2) else "FAIL")
    return 0 if (s1 and s2) else 1


if __name__ == "__main__":
    raise SystemExit(asyncio.run(main()))

Output (driver 5.28.4, local Neo4j):

SCENARIO 1 - stock driver, first connection's socket dropped:
  REPRODUCED -> TypeError: 'NoneType' object is not subscriptable
  at: self.local_port = self.socket.getsockname()[1]
  (a bare TypeError -- the driver's retry never catches it)

SCENARIO 2 - same drop, with the None-guard applied:
Transaction failed and will be retried in 0.83s (socket dropped before use (sockname unavailable); retrying)
  HEALED -> driver retried on a fresh connection, returned 1

RESULT: PASS

All unguarded getsockname()[1] sites (any can be hit, depending on when the socket dies)

In 5.28.4 (same pattern in 6.2.0):

  • _async/io/_bolt.py:156AsyncBolt.__init__ (the production crash site)
  • _async/io/_bolt.py:381, 399, 407AsyncBolt.open close / auth-failure debug logging
  • _async/io/_bolt_socket.py:225_handshake (a one-liner getsockname = lambda self: None lands here instead)
  • _async/io/_bolt_socket.py:343, 361 — connect
  • _async/io/_common.py:41AsyncOutbox

Why it matters

This is a transient condition (connection dropped mid/just-after open — normal during Aura leader elections / routing refresh; #1057 also noted it "manifested itself under high load, and when leader elections were happening"). The driver already retries transient connection failures, but a bare TypeError is not in the retryable set, so it escapes and kills the operation instead of re-opening.

Note: get_extra_info("sockname") returns None simply because the transport's socket is already None/closed — there isn't necessarily an OSError at the getsockname call itself (which may be why the async-sockname OSError-surfacing branch from #1057 didn't pan out).

Suggested fix

Guard the None and raise a retryable driver error instead of subscripting, e.g. in AsyncBolt.__init__:

sockname = self.socket.getsockname()
if sockname is None:
    raise ServiceUnavailable(
        "Connection's socket was closed before it could be used "
        "(sockname unavailable); will retry on a fresh connection"
    )
self.local_port = sockname[1]

and treat the debug-logging sites defensively (local_port = sockname[1] if sockname else -1). As Scenario 2 shows, this lets the existing retry transparently re-open — the behavior every reporter of this defect actually wants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions