Slack Socket Mode never recovers from a half-open WebSocket (no TCP keepalive, no read/idle timeout, reconnect is read-error-driven only)

## Summary

When the underlying TCP connection of the Slack Socket Mode WebSocket dies *silently* (no FIN/RST — e.g. a NAT/firewall idle-timeout drops the flow), openab never detects it and the bot goes permanently silent until the process is restarted. The read loop blocks on `read.next()` forever, so the existing "reconnect on error" path is never triggered.

## Environment

- openab `0.8.4` (also present in `main` as of commit pushed 2026-06-13, and `v0.8.5-beta.9`)
- Slack adapter, Socket Mode
- Running in Docker behind a home router doing NAT (very common deployment)

## Observed behavior

Normally Slack cycles the Socket Mode connection roughly every ~5h, which logs cleanly and reconnects:

```
01:54:03 ERROR openab::slack: Socket Mode read error: WebSocket protocol error: Connection reset without closing handshake
01:54:03  WARN openab::slack: reconnecting to Slack Socket Mode in 5s...
01:54:08  INFO openab::slack: connecting to Slack Socket Mode url=wss://wss-primary.slack.com/link/?ticket=...
01:54:09  INFO openab::slack: Slack Socket Mode connected
```

But once, after a normal `connected` line, the ~5h cadence simply **stopped** and there were **zero** `openab::slack` log lines for ~23 hours — no read error, no reconnect. The bot silently stopped receiving any Slack events. `docker inspect` still showed the container "healthy" and the process alive.

Inspecting the socket from the host confirmed a classic half-open connection:

- Exactly one `ESTABLISHED` TCP connection to a Slack WSS IP on :443.
- `/proc/<pid>/net/tcp` showed `tr=00`, `tm->when=0`, `retrnsmt=0` for that socket — i.e. **no keepalive timer and no retransmit timer were running**. The kernel had no mechanism to ever discover the peer was gone.

## Root cause (in `src/slack.rs`)

The Socket Mode connection is opened with `tokio_tungstenite::connect_async()` and the read loop is purely reactive:

```rust
match tokio_tungstenite::connect_async(&ws_url).await {
    Ok((ws_stream, _)) => {
        let (mut write, mut read) = ws_stream.split();
        loop {
            tokio::select! {
                msg_result = read.next() => {
                    let Some(msg_result) = msg_result else { break }; // None -> reconnect
                    // Err(..) -> reconnect
                    ...
                    Ok(tungstenite::Message::Ping(data)) => {
                        let _ = write.send(tungstenite::Message::Pong(data)).await; // replies to pings
                    }
                    ...
                }
            }
        }
    }
}
```

Three gaps combine to make recovery impossible for a half-open socket:

1. **No `SO_KEEPALIVE`** is set on the TCP stream, so the OS never probes a silently-dead peer.
2. **No read/idle timeout** — `read.next()` can block indefinitely; nothing breaks the loop when frames simply stop arriving.
3. **No proactive WebSocket Ping** — openab *replies* to Slack pings with pongs, but never *sends* its own pings, so it has no application-level liveness check either.

Result: reconnect only fires on `Err`/`None` from `read.next()`, which never happens on a half-open socket.

## Reproduction

1. Establish a Slack Socket Mode connection.
2. Silently break the underlying TCP flow without sending FIN/RST (drop the conntrack/NAT entry on a middlebox, or `iptables -A` a DROP rule on the established 5-tuple, or pull the upstream link briefly so the NAT entry expires).
3. openab keeps `ESTABLISHED` on its side and never reconnects; the bot stops responding indefinitely.

## Impact

Silent, indefinite outage of the Slack integration with no error logged and a "healthy" container — only a manual restart recovers it. Especially likely behind consumer NAT where idle WSS flows get reaped.

## Suggested fixes (any one helps; ideally 1 + 2)

1. **Enable TCP keepalive** on the stream with a short interval (e.g. via `socket2::SockRef::from(&tcp).set_tcp_keepalive(TcpKeepalive::new().with_time(30s).with_interval(10s))`). This lets the kernel detect a dead peer within ~a minute and surface an error to `read.next()`.
2. **Add an idle read timeout** to the select loop, e.g.:
   ```rust
   match tokio::time::timeout(Duration::from_secs(60), read.next()).await {
       Err(_) => { warn!("no Slack frame in 60s; reconnecting"); break; }
       Ok(Some(msg)) => { /* handle */ }
       Ok(None) => break,
   }
   ```
   Slack sends regular traffic/pings on a healthy connection, so an idle window reliably indicates a dead socket.
3. **Send periodic WebSocket Ping frames** and reconnect if no Pong is received within a deadline.

## Workaround (no patch / no restart required)

For anyone hitting this before a fix lands: an external watchdog can force the dead socket closed so openab's own 5s-reconnect fires, keeping the process and all in-memory state intact:

- Locate the openab process and its `ESTABLISHED` :443 socket to the Slack WSS host.
- `pidfd_open(pid)` + `pidfd_getfd(pidfd, fd)` (Linux ≥5.6) to dup the socket, then `shutdown(dup, SHUT_RDWR)`. `read.next()` returns EOF and openab reconnects.
- Note: `ss -K` does **not** work on kernels built without `CONFIG_INET_DIAG_DESTROY` (e.g. Raspberry Pi OS), which is why the `pidfd` approach is used.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slack Socket Mode never recovers from a half-open WebSocket (no TCP keepalive, no read/idle timeout, reconnect is read-error-driven only) #1101

Summary

Environment

Observed behavior

Root cause (in `src/slack.rs`)

Reproduction

Impact

Suggested fixes (any one helps; ideally 1 + 2)

Workaround (no patch / no restart required)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Slack Socket Mode never recovers from a half-open WebSocket (no TCP keepalive, no read/idle timeout, reconnect is read-error-driven only) #1101

Description

Summary

Environment

Observed behavior

Root cause (in src/slack.rs)

Reproduction

Impact

Suggested fixes (any one helps; ideally 1 + 2)

Workaround (no patch / no restart required)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root cause (in `src/slack.rs`)