Summary
When the underlying TCP connection of the Slack Socket Mode WebSocket dies silently (no FIN/RST — e.g. a NAT/firewall idle-timeout drops the flow), openab never detects it and the bot goes permanently silent until the process is restarted. The read loop blocks on read.next() forever, so the existing "reconnect on error" path is never triggered.
Environment
- openab
0.8.4 (also present in main as of commit pushed 2026-06-13, and v0.8.5-beta.9)
- Slack adapter, Socket Mode
- Running in Docker behind a home router doing NAT (very common deployment)
Observed behavior
Normally Slack cycles the Socket Mode connection roughly every ~5h, which logs cleanly and reconnects:
01:54:03 ERROR openab::slack: Socket Mode read error: WebSocket protocol error: Connection reset without closing handshake
01:54:03 WARN openab::slack: reconnecting to Slack Socket Mode in 5s...
01:54:08 INFO openab::slack: connecting to Slack Socket Mode url=wss://wss-primary.slack.com/link/?ticket=...
01:54:09 INFO openab::slack: Slack Socket Mode connected
But once, after a normal connected line, the ~5h cadence simply stopped and there were zero openab::slack log lines for ~23 hours — no read error, no reconnect. The bot silently stopped receiving any Slack events. docker inspect still showed the container "healthy" and the process alive.
Inspecting the socket from the host confirmed a classic half-open connection:
- Exactly one
ESTABLISHED TCP connection to a Slack WSS IP on :443.
/proc/<pid>/net/tcp showed tr=00, tm->when=0, retrnsmt=0 for that socket — i.e. no keepalive timer and no retransmit timer were running. The kernel had no mechanism to ever discover the peer was gone.
Root cause (in src/slack.rs)
The Socket Mode connection is opened with tokio_tungstenite::connect_async() and the read loop is purely reactive:
match tokio_tungstenite::connect_async(&ws_url).await {
Ok((ws_stream, _)) => {
let (mut write, mut read) = ws_stream.split();
loop {
tokio::select! {
msg_result = read.next() => {
let Some(msg_result) = msg_result else { break }; // None -> reconnect
// Err(..) -> reconnect
...
Ok(tungstenite::Message::Ping(data)) => {
let _ = write.send(tungstenite::Message::Pong(data)).await; // replies to pings
}
...
}
}
}
}
}
Three gaps combine to make recovery impossible for a half-open socket:
- No
SO_KEEPALIVE is set on the TCP stream, so the OS never probes a silently-dead peer.
- No read/idle timeout —
read.next() can block indefinitely; nothing breaks the loop when frames simply stop arriving.
- No proactive WebSocket Ping — openab replies to Slack pings with pongs, but never sends its own pings, so it has no application-level liveness check either.
Result: reconnect only fires on Err/None from read.next(), which never happens on a half-open socket.
Reproduction
- Establish a Slack Socket Mode connection.
- Silently break the underlying TCP flow without sending FIN/RST (drop the conntrack/NAT entry on a middlebox, or
iptables -A a DROP rule on the established 5-tuple, or pull the upstream link briefly so the NAT entry expires).
- openab keeps
ESTABLISHED on its side and never reconnects; the bot stops responding indefinitely.
Impact
Silent, indefinite outage of the Slack integration with no error logged and a "healthy" container — only a manual restart recovers it. Especially likely behind consumer NAT where idle WSS flows get reaped.
Suggested fixes (any one helps; ideally 1 + 2)
- Enable TCP keepalive on the stream with a short interval (e.g. via
socket2::SockRef::from(&tcp).set_tcp_keepalive(TcpKeepalive::new().with_time(30s).with_interval(10s))). This lets the kernel detect a dead peer within ~a minute and surface an error to read.next().
- Add an idle read timeout to the select loop, e.g.:
match tokio::time::timeout(Duration::from_secs(60), read.next()).await {
Err(_) => { warn!("no Slack frame in 60s; reconnecting"); break; }
Ok(Some(msg)) => { /* handle */ }
Ok(None) => break,
}
Slack sends regular traffic/pings on a healthy connection, so an idle window reliably indicates a dead socket.
- Send periodic WebSocket Ping frames and reconnect if no Pong is received within a deadline.
Workaround (no patch / no restart required)
For anyone hitting this before a fix lands: an external watchdog can force the dead socket closed so openab's own 5s-reconnect fires, keeping the process and all in-memory state intact:
- Locate the openab process and its
ESTABLISHED :443 socket to the Slack WSS host.
pidfd_open(pid) + pidfd_getfd(pidfd, fd) (Linux ≥5.6) to dup the socket, then shutdown(dup, SHUT_RDWR). read.next() returns EOF and openab reconnects.
- Note:
ss -K does not work on kernels built without CONFIG_INET_DIAG_DESTROY (e.g. Raspberry Pi OS), which is why the pidfd approach is used.
Summary
When the underlying TCP connection of the Slack Socket Mode WebSocket dies silently (no FIN/RST — e.g. a NAT/firewall idle-timeout drops the flow), openab never detects it and the bot goes permanently silent until the process is restarted. The read loop blocks on
read.next()forever, so the existing "reconnect on error" path is never triggered.Environment
0.8.4(also present inmainas of commit pushed 2026-06-13, andv0.8.5-beta.9)Observed behavior
Normally Slack cycles the Socket Mode connection roughly every ~5h, which logs cleanly and reconnects:
But once, after a normal
connectedline, the ~5h cadence simply stopped and there were zeroopenab::slacklog lines for ~23 hours — no read error, no reconnect. The bot silently stopped receiving any Slack events.docker inspectstill showed the container "healthy" and the process alive.Inspecting the socket from the host confirmed a classic half-open connection:
ESTABLISHEDTCP connection to a Slack WSS IP on :443./proc/<pid>/net/tcpshowedtr=00,tm->when=0,retrnsmt=0for that socket — i.e. no keepalive timer and no retransmit timer were running. The kernel had no mechanism to ever discover the peer was gone.Root cause (in
src/slack.rs)The Socket Mode connection is opened with
tokio_tungstenite::connect_async()and the read loop is purely reactive:Three gaps combine to make recovery impossible for a half-open socket:
SO_KEEPALIVEis set on the TCP stream, so the OS never probes a silently-dead peer.read.next()can block indefinitely; nothing breaks the loop when frames simply stop arriving.Result: reconnect only fires on
Err/Nonefromread.next(), which never happens on a half-open socket.Reproduction
iptables -Aa DROP rule on the established 5-tuple, or pull the upstream link briefly so the NAT entry expires).ESTABLISHEDon its side and never reconnects; the bot stops responding indefinitely.Impact
Silent, indefinite outage of the Slack integration with no error logged and a "healthy" container — only a manual restart recovers it. Especially likely behind consumer NAT where idle WSS flows get reaped.
Suggested fixes (any one helps; ideally 1 + 2)
socket2::SockRef::from(&tcp).set_tcp_keepalive(TcpKeepalive::new().with_time(30s).with_interval(10s))). This lets the kernel detect a dead peer within ~a minute and surface an error toread.next().Workaround (no patch / no restart required)
For anyone hitting this before a fix lands: an external watchdog can force the dead socket closed so openab's own 5s-reconnect fires, keeping the process and all in-memory state intact:
ESTABLISHED:443 socket to the Slack WSS host.pidfd_open(pid)+pidfd_getfd(pidfd, fd)(Linux ≥5.6) to dup the socket, thenshutdown(dup, SHUT_RDWR).read.next()returns EOF and openab reconnects.ss -Kdoes not work on kernels built withoutCONFIG_INET_DIAG_DESTROY(e.g. Raspberry Pi OS), which is why thepidfdapproach is used.