Skip to content

Intermittent timeouts and broken sockets in container with custom network attached (likely pasta related) #28480

@maybephilipp

Description

@maybephilipp

Issue Description

Describe your issue

  1. Can provide repro steps: Connecting to sockets in container with a custom network attached throws intermittent timeouts.
  2. Probably: TCP_KEEPALIVE parameters are not treated correctly

I'm ready to provide any details.

Steps to reproduce the issue

Timeouts

  1. Valkey instance on host (I have connectivity issues with valkey, but it's proven that Valkey is not the issue here)
  2. Run the following (replace 11.11.11.6:6379 with your Valkey):
Some code
ubuntu@host> podman network create --ignore --driver bridge --subnet 173.26.0.0/16 --gateway 173.26.0.1 test-net
ubuntu@host> podman run --network=test-net -it --rm docker.io/python:3.11 bash

root@container> pip install gevent==24.11.1
root@container> cat > check_conn.py <<EOF
from gevent import monkey
monkey.patch_all()  # Must be first, before all other imports

import socket
from datetime import datetime
import time
import gevent
from gevent.pool import Pool

REDIS_SOCKET_KEEPALIVE = True

FAILED = 0
ITERSLEEP = 1


def ping_job(job_id: int, iteration: int):
    """Single ping attempt, run inside a greenlet."""
    s = None
    t0 = time.time()
    try:
        s = socket.create_connection(("11.11.11.5", 6379), timeout=2)
        s.sendall(b"*1\r\n$4\r\nPING\r\n")
        s.recv(1024)
        print(datetime.fromtimestamp(t0), f"job={job_id} iter={iteration}", "ok", round(time.time() - t0, 3))
    except Exception as e:
        print(datetime.fromtimestamp(t0), f"job={job_id} iter={iteration}", "fail", round(time.time() - t0, 3), repr(e))
        global FAILED
        FAILED += 1
    finally:
        if s:
            s.close()


def worker(job_id: int, total_iterations: int = 200):
    """One long-running worker: loops, pings, sleeps."""
    for i in range(total_iterations):
        ping_job(job_id, i)
        gevent.sleep(ITERSLEEP)  # yield to other greenlets; never use time.sleep() here


def main(num_workers: int = 5, total_iterations: int = 200):
    pool = Pool(num_workers)

    def _spawn(i):
        time.sleep(ITERSLEEP / num_workers)
        return pool.spawn(worker, job_id=i, total_iterations=total_iterations)

    greenlets = [_spawn(i) for i in range(num_workers)]
    gevent.joinall(greenlets)

    print('FAILED JOBS:', FAILED)


if __name__ == "__main__":
    main(num_workers=10)

EOF
root@container> python check_conn.py
  1. You will see:
Logs
2026-04-09 18:35:39.844833 job=0 iter=0 ok 0.007
2026-04-09 18:35:39.945939 job=1 iter=0 ok 0.002
...
2026-04-09 18:36:19.329535 job=4 iter=39 ok 0.002
2026-04-09 18:36:19.436411 job=5 iter=39 ok 0.001
2026-04-09 18:36:19.536003 job=6 iter=39 ok 0.001
2026-04-09 18:36:19.635141 job=7 iter=39 ok 0.001
2026-04-09 18:36:19.741808 job=8 iter=39 ok 0.001
2026-04-09 18:36:19.834467 job=9 iter=39 ok 0.001
2026-04-09 18:36:17.934258 job=0 iter=38 fail 2.002 TimeoutError('timed out')
2026-04-09 18:36:20.033132 job=1 iter=40 ok 0.001
...
2026-04-09 18:36:21.539970 job=6 iter=41 ok 0.002
...

I tried:

  1. Running on VM host itself works well – no timeouts for 2 runs over 2000 iterations
  2. Running on podman container without network attached – same – working well for 2 runs
  3. Running on podman container with network (which has 16 other network-active containers attached) – fails with timeout at least 1 time per run over 2000 iterations, but actually very often
  4. Running on podman container with fresh network where only this container is attached – same – fails at least 1 time per run over 2000 iterations, but actually very often
  5. Running on VM and container with network at the same time – container run fails for 10 times, VM 0 times
  6. Running on container without network and container with network at the same time – with network run fails for 6 times, without network – 0 times

tcpdump seem to not see those fails (at least I cound't find it there).

TCP_KEEPALIVE parameters are not treated correctly

The code below fails with error:

Error
root@7b45ba3c3aa7:/# python check_broken_socket.py
True
b'123'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 534, in send_packed_command
    self._sock.sendall(item)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//check_broken_socket.py", line 38, in <module>
    print(r.get("LOL"))
          ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/commands/core.py", line 1822, in get
    return self.execute_command("GET", name, keys=[name])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 559, in execute_command
    return self._execute_command(*args, **options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 567, in _execute_command
    return conn.retry.call_with_retry(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/retry.py", line 65, in call_with_retry
    fail(error)
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 571, in <lambda>
    lambda error: self._disconnect_raise(conn, error),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 555, in _disconnect_raise
    raise error
  File "/usr/local/lib/python3.11/site-packages/redis/retry.py", line 62, in call_with_retry
    return do()
           ^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 568, in <lambda>
    lambda: self._send_command_parse_response(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redis/client.py", line 541, in _send_command_parse_response
    conn.send_command(*args, **options)
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 556, in send_command
    self.send_packed_command(
  File "/usr/local/lib/python3.11/site-packages/redis/connection.py", line 545, in send_packed_command
    raise ConnectionError(f"Error {errno} while writing to socket. {errmsg}.")
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe

Removing TCP_KEEPALIVE options or making them 300/30/3 solves the short-term issue, but in a long-term still fails randomly (can't prove – too long to wait :D – there was 1 error in 3 hours of running code).

Code

Prerequisites: pip install redis==5.2.1

import socket
from redis import Redis
import time

REDIS_SOCKET_KEEPALIVE = True
REDIS_SOCKET_KEEPALIVE_OPTS = dict()

# Linux
if hasattr(socket, "TCP_KEEPIDLE"):
    # Start probing after 10s idle
    REDIS_SOCKET_KEEPALIVE_OPTS[socket.TCP_KEEPIDLE] = 10  # pyright: ignore[reportAttributeAccessIssue]

# macOS equivalent of KEEPIDLE
if hasattr(socket, "TCP_KEEPALIVE"):
    REDIS_SOCKET_KEEPALIVE_OPTS[socket.TCP_KEEPALIVE] = 10

# Both Linux and macOS support these
if hasattr(socket, "TCP_KEEPINTVL"):
    # Probe every 3s
    REDIS_SOCKET_KEEPALIVE_OPTS[socket.TCP_KEEPINTVL] = 3

if hasattr(socket, "TCP_KEEPCNT"):
    # Drop after 5 failed probes (~15s total)
    REDIS_SOCKET_KEEPALIVE_OPTS[socket.TCP_KEEPCNT] = 5

r = Redis.from_url(
    "redis://11.11.11.6/0",
    socket_connect_timeout=2,
    socket_timeout=3,
    health_check_interval=0,
    socket_keepalive=True,
    socket_keepalive_options=REDIS_SOCKET_KEEPALIVE_OPTS,
)

print(r.set("LOL", "123"))
print(r.get("LOL"))
time.sleep(150)
print(r.get("LOL"))

Describe the results you received

Describe the results you received

Timeouts connecting to socket and broken sockets mid-connection.

Describe the results you expected

Describe the results you expected

No timeouts and no broken sockets:)

podman info output

ubuntu@host:~/builds$ podman version
Client:       Podman Engine
Version:      5.6.2
API Version:  5.6.2
Go Version:   go1.23.3
Git Commit:   9dd5e1ed33830612bc200d7a13db00af6ab865a4
Built:        Sun Mar  1 13:52:35 2026
OS/Arch:      linux/amd64
ubuntu@simulations:~/builds$ podman info
host:
  arch: amd64
  buildahVersion: 1.41.5
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.10+ds1-1build2_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: unknown'
  cpuUtilization:
    idlePercent: 99.06
    systemPercent: 0.19
    userPercent: 0.75
  cpus: 16
  databaseBackend: sqlite
  distribution:
    codename: noble
    distribution: ubuntu
    version: "24.04"
  eventLogger: journald
  freeLocks: 2024
  hostname: simulations
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.8.0-106-generic
  linkmode: dynamic
  logDriver: journald
  memFree: 1806901248
  memTotal: 67424378880
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns_1.4.0-5_amd64
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.4.0
    package: netavark_1.4.0-4_amd64
    path: /usr/lib/podman/netavark
    version: netavark 1.4.0
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/local/bin/crun
    version: |-
      crun version 1.24
      commit: 54693209039e5e04cbe3c8b1cd5fe2301219f0a1
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt_0.0~git20240220.1e6f92b-1_amd64
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 311h 18m 37.00s (Approximately 12.96 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/ubuntu/.config/containers/storage.conf
  containerStore:
    number: 23
    paused: 0
    running: 23
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/ubuntu/.local/share/containers/storage
  graphRootAllocated: 155414249472
  graphRootUsed: 122991693824
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 881
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/ubuntu/.local/share/containers/storage/volumes
version:
  APIVersion: 5.6.2
  Built: 1772373155
  BuiltTime: Sun Mar  1 13:52:35 2026
  GitCommit: 9dd5e1ed33830612bc200d7a13db00af6ab865a4
  GoVersion: go1.23.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.6.2

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Additional environment details

OS: Ubuntu 24.04.4 LTS

VM inside Proxmox VE.

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

Running in podman container with a custom network attached.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions