Skip to content

Slack Socket Mode never recovers from a half-open WebSocket (no TCP keepalive, no read/idle timeout, reconnect is read-error-driven only) #1101

@myliu1999

Description

@myliu1999

Summary

When the underlying TCP connection of the Slack Socket Mode WebSocket dies silently (no FIN/RST — e.g. a NAT/firewall idle-timeout drops the flow), openab never detects it and the bot goes permanently silent until the process is restarted. The read loop blocks on read.next() forever, so the existing "reconnect on error" path is never triggered.

Environment

  • openab 0.8.4 (also present in main as of commit pushed 2026-06-13, and v0.8.5-beta.9)
  • Slack adapter, Socket Mode
  • Running in Docker behind a home router doing NAT (very common deployment)

Observed behavior

Normally Slack cycles the Socket Mode connection roughly every ~5h, which logs cleanly and reconnects:

01:54:03 ERROR openab::slack: Socket Mode read error: WebSocket protocol error: Connection reset without closing handshake
01:54:03  WARN openab::slack: reconnecting to Slack Socket Mode in 5s...
01:54:08  INFO openab::slack: connecting to Slack Socket Mode url=wss://wss-primary.slack.com/link/?ticket=...
01:54:09  INFO openab::slack: Slack Socket Mode connected

But once, after a normal connected line, the ~5h cadence simply stopped and there were zero openab::slack log lines for ~23 hours — no read error, no reconnect. The bot silently stopped receiving any Slack events. docker inspect still showed the container "healthy" and the process alive.

Inspecting the socket from the host confirmed a classic half-open connection:

  • Exactly one ESTABLISHED TCP connection to a Slack WSS IP on :443.
  • /proc/<pid>/net/tcp showed tr=00, tm->when=0, retrnsmt=0 for that socket — i.e. no keepalive timer and no retransmit timer were running. The kernel had no mechanism to ever discover the peer was gone.

Root cause (in src/slack.rs)

The Socket Mode connection is opened with tokio_tungstenite::connect_async() and the read loop is purely reactive:

match tokio_tungstenite::connect_async(&ws_url).await {
    Ok((ws_stream, _)) => {
        let (mut write, mut read) = ws_stream.split();
        loop {
            tokio::select! {
                msg_result = read.next() => {
                    let Some(msg_result) = msg_result else { break }; // None -> reconnect
                    // Err(..) -> reconnect
                    ...
                    Ok(tungstenite::Message::Ping(data)) => {
                        let _ = write.send(tungstenite::Message::Pong(data)).await; // replies to pings
                    }
                    ...
                }
            }
        }
    }
}

Three gaps combine to make recovery impossible for a half-open socket:

  1. No SO_KEEPALIVE is set on the TCP stream, so the OS never probes a silently-dead peer.
  2. No read/idle timeoutread.next() can block indefinitely; nothing breaks the loop when frames simply stop arriving.
  3. No proactive WebSocket Ping — openab replies to Slack pings with pongs, but never sends its own pings, so it has no application-level liveness check either.

Result: reconnect only fires on Err/None from read.next(), which never happens on a half-open socket.

Reproduction

  1. Establish a Slack Socket Mode connection.
  2. Silently break the underlying TCP flow without sending FIN/RST (drop the conntrack/NAT entry on a middlebox, or iptables -A a DROP rule on the established 5-tuple, or pull the upstream link briefly so the NAT entry expires).
  3. openab keeps ESTABLISHED on its side and never reconnects; the bot stops responding indefinitely.

Impact

Silent, indefinite outage of the Slack integration with no error logged and a "healthy" container — only a manual restart recovers it. Especially likely behind consumer NAT where idle WSS flows get reaped.

Suggested fixes (any one helps; ideally 1 + 2)

  1. Enable TCP keepalive on the stream with a short interval (e.g. via socket2::SockRef::from(&tcp).set_tcp_keepalive(TcpKeepalive::new().with_time(30s).with_interval(10s))). This lets the kernel detect a dead peer within ~a minute and surface an error to read.next().
  2. Add an idle read timeout to the select loop, e.g.:
    match tokio::time::timeout(Duration::from_secs(60), read.next()).await {
        Err(_) => { warn!("no Slack frame in 60s; reconnecting"); break; }
        Ok(Some(msg)) => { /* handle */ }
        Ok(None) => break,
    }
    Slack sends regular traffic/pings on a healthy connection, so an idle window reliably indicates a dead socket.
  3. Send periodic WebSocket Ping frames and reconnect if no Pong is received within a deadline.

Workaround (no patch / no restart required)

For anyone hitting this before a fix lands: an external watchdog can force the dead socket closed so openab's own 5s-reconnect fires, keeping the process and all in-memory state intact:

  • Locate the openab process and its ESTABLISHED :443 socket to the Slack WSS host.
  • pidfd_open(pid) + pidfd_getfd(pidfd, fd) (Linux ≥5.6) to dup the socket, then shutdown(dup, SHUT_RDWR). read.next() returns EOF and openab reconnects.
  • Note: ss -K does not work on kernels built without CONFIG_INET_DIAG_DESTROY (e.g. Raspberry Pi OS), which is why the pidfd approach is used.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions