Skip to content

fix(slack): reconnect on silent (half-open) Socket Mode disconnect via read idle timeout#1054

Open
samuel3132001 wants to merge 5 commits into
openabdev:mainfrom
samuel3132001:patch-1
Open

fix(slack): reconnect on silent (half-open) Socket Mode disconnect via read idle timeout#1054
samuel3132001 wants to merge 5 commits into
openabdev:mainfrom
samuel3132001:patch-1

Conversation

@samuel3132001

@samuel3132001 samuel3132001 commented Jun 9, 2026

Copy link
Copy Markdown

Implement timeout for Slack WebSocket read to detect half-open connections.

What problem does this solve?

The Slack Socket Mode read loop in src/slack.rs awaits read.next() inside tokio::select! with no timeout. On a half-open connection — a NAT/firewall silently drops the idle TCP socket without sending a Close frame or TCP RST — read.next() blocks forever. The existing reconnect loop is therefore never reached, and the bot silently stops receiving events while the pod stays Running (last log line: Slack Socket Mode connected). Only a manual restart recovers it.

Closes #1050

Discord Discussion URL:https://discord.com/channels/1491295327620169908/1491969620754567270/1514113993063792660

At a Glance

 Slack server ──ping ~every 30s──▶ ┌──────────────────────────────┐
                                   │ slack.rs Socket Mode loop    │
 (before)                          │   tokio::select! {           │
 NAT/firewall silently drops the   │     read.next()  ⚠ NO TIMEOUT│
 idle TCP socket (no Close, no RST)│   }                          │
            │                      └──────────────────────────────┘
            ▼                                    │ read.next() never returns
     no frames ever arrive  ───────────▶  reconnect loop (already exists) UNREACHABLE
                                                 │
                                                 ▼   bot silently goes dark

 (after)  tokio::select! { timeout(60s, read.next()) }
          └─ 60s with no frame ⇒ Elapsed ⇒ break ⇒ existing reconnect loop runs

Prior Art & Industry Research

OpenClaw: Uses the official Slack SDK with autoReconnect enabled plus an application-level health monitor that detects stale sockets and restarts with bounded backoff (~2s→30s, capped attempts). It exposes clientPingTimeout (default 15000ms) / serverPingTimeout and fails fast on non-recoverable auth errors. The exact class of bug — and their stale-socket health-monitor mitigation — is tracked in openclaw/openclaw#61072. Docs: OpenClaw Slack channel.

Hermes Agent: Not applicable — Hermes is an ACP coding backend, not a chat transport. Slack connection handling lives entirely in OpenAB's shared src/slack.rs, independent of which agent is running.

Other references: The official Slack SDKs implement a proactive ping/pong heartbeat to detect dead connections: the client expects a pong within clientPingTimeout (default 5000ms) and a server ping within serverPingTimeout (default 30000ms); on timeout it reconnects when auto-reconnect is enabled (the Python SDK flags a stale connection when idle exceeds ping_interval × 4). See Node SDK SocketModeOptions and python-slack-sdk Socket Mode. OpenAB's adapter is a custom tokio-tungstenite client with no such heartbeat, hence this gap.

Proposed Solution

Wrap the inbound read in a 60-second idle timeout inside the existing tokio::select!:

msg_result = tokio::time::timeout(std::time::Duration::from_secs(60), read.next()) => {
    let msg_result = match msg_result {
        Ok(inner) => inner,
        Err(_elapsed) => {
            warn!("no frame from Slack for 60s; assuming dead connection, reconnecting");
            break;
        }
    };
    let Some(msg_result) = msg_result else { break };
    // ... existing match on msg_result unchanged ...
}

Slack's server sends a ping roughly every 30s, so 60s with no frame at all is a reliable dead-connection signal. On timeout we break, which lets the already-existing reconnect loop run.

Why this approach?

OpenAB's Slack adapter is a hand-rolled tokio-tungstenite client with no SDK-level heartbeat, and it already has a reconnect loop — the only defect is that the loop is unreachable while read.next() blocks forever. A read/idle timeout is the smallest, most localized fix that closes the gap and reuses existing recovery, with no new tasks, channels, or config. Tradeoff: up to ~60s detection latency (acceptable; easily tunable). Known limitation: it relies on Slack's server pings (~30s) as the liveness signal rather than sending client-side pings; if a future need arises, a full client ping/pong (SDK-style) can be layered on top.

Alternatives Considered

  • Client-side ping + pong-timeout (Slack-SDK style): most robust, but more code and unnecessary here since the server already pings ~every 30s. Deferred.
  • Outer retry loop like the Discord adapter (fix: reconnect Discord gateway on silent WS disconnect #791, fix(discord): reconnect loop with exponential backoff on silent WS disconnect #919): doesn't apply — Slack already has a reconnect loop; the bug is reachability, which the timeout solves directly.
  • TCP keepalive (SO_KEEPALIVE): OS-level, not exposed in the current socket setup and less portable; doesn't cover application-layer staleness.
  • Dedicated stale-socket health-monitor task (OpenClaw-style): heavier; overkill for this single-connection adapter.

Validation

  • Manual testing — Reproduced in production (self-hosted k3s behind NAT): the bot went dark every few hours with Slack Socket Mode connected as the last log line and no further events; a manual kubectl rollout restart always recovered it. With the idle timeout in place, the Elapsed branch fires and the existing reconnect path runs.
  • cargo check — not run locally (no Rust toolchain on the reporter's machine); change is minimal and self-contained. Please verify via CI.
  • cargo test
  • cargo clippy

Implement timeout for Slack WebSocket read to detect half-open connections.
@samuel3132001 samuel3132001 requested a review from thepagent as a code owner June 9, 2026 23:06
@github-actions github-actions Bot added closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. pending-maintainer and removed closing-soon PR missing Discord Discussion URL — will auto-close in 24 hours. labels Jun 9, 2026

@doxagent doxagent left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture Review — ClaudeArchitect

Overall: 診斷正確,修法最小精準。兩個 minor 改動後可 approve。


✅ 優點

  • Half-open TCP 靜默斷線是 Slack Socket Mode 已知陷阱,此修法正確切入問題核心
  • +16/-1 行,完全複用現有 reconnect loop,無新 task / channel / 全域狀態
  • PR 描述品質高(ASCII 圖、alternatives considered、prior art)

🔴 Request Changes(輕量,改完即 approve)

1. 提取具名常數(必要)

const SLACK_READ_IDLE_TIMEOUT_SECS: u64 = 60;
// ...
msg_result = tokio::time::timeout(
    std::time::Duration::from_secs(SLACK_READ_IDLE_TIMEOUT_SECS),
    read.next(),
) => {

hardcoded 60 對未來維護者不可讀,且難以搜尋調整。

2. Reconnect counter / metric(建議)

目前只有 warn!() log。若此路徑每小時觸發 N 次,現有 alerting 無感知。建議加一個 counter(即使是簡單的 AtomicU64 或 tracing span attribute),方便 oncall 在 Grafana 上觀察。


🔵 Nit

  • Err(_elapsed)Err(_) — Rust 慣例,已忽略的 binding 不需具名

📝 Architecture Note(不擋 merge,建議在 issue 追蹤)

Discord adapter 使用 outer retry loop(#791, #919),Slack 現在使用 read idle timeout——兩種策略各有合理性,但長期維護時若有人試圖統一,需要知道這個分歧存在。建議在 ARCHITECTURE.md 或模組 comment 簡單標注。


改完以上兩點後 LGTM from architecture side。

Comment thread src/slack.rs Outdated
loop {
tokio::select! {
msg_result = read.next() => {
msg_result = tokio::time::timeout(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建議提成具名常數,方便未來調整與搜尋:

const SLACK_READ_IDLE_TIMEOUT_SECS: u64 = 60;
// ...
tokio::time::timeout(
    std::time::Duration::from_secs(SLACK_READ_IDLE_TIMEOUT_SECS),
    read.next(),
)

Comment thread src/slack.rs Outdated
// outer reconnect loop runs; otherwise read.next() blocks forever here
// and the bot silently stops receiving events.
let msg_result = match msg_result {
Ok(inner) => inner,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:Rust 慣例,已忽略的 binding 不需命名,改為 Err(_) 即可。

Added a constant for Slack read idle timeout to improve connection reliability.
@samuel3132001

Copy link
Copy Markdown
Author

感謝 review 🙏 已照建議修正:

  1. ✅ 抽成具名常數 SLACK_READ_IDLE_TIMEOUT_SECS(放在檔案頂端常數區)
  2. ✅ Err(elapsed) → Err()

關於 reconnect 計數 metrics:目前 slack.rs 沒有既有的 metrics/observability 機制(只有 log),為了讓這支修復維持最小、聚焦在 bug 本身,我傾向把可觀測性留成獨立 follow-up。不過如果你們希望我在這支 PR 直接加一個「log 形式的 reconnect 計數」,我很樂意跟進——告訴我偏好即可。另外「在文件記錄 Discord retry vs Slack idle-timeout 差異」的建議也很合理,可以一起放到 follow-up。

@howie

howie commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review summary

Reviewed via a multi-model pass (Claude + Codex, independent + cross-debate). Overall this is a correct, net-positive safety fix — previously read.next() could block forever on a silently-dropped TCP connection and the bot would go dark with no logs. CI is green. No correctness/security blockers found. The points below are improvements, ordered by value.

Suggestions

1. Reconnect immediately on a failed ack/pong write (highest-value)
src/slack.rs:735-737 (envelope ack) and :1047-1048 (pong) use let _ = write.send(...), discarding the error. On a half-open socket — exactly the case this PR targets — a send error is the strongest, fastest "socket is dead" signal available, yet it's currently swallowed, so detection has to wait out the full 60s idle timeout. A failed ack can also lead to Slack redelivering the envelope.
Suggested: on write.send(...) returning Err, warn! and break the read loop so the outer loop reconnects right away. This directly complements the idle-timeout mechanism. (These swallows are pre-existing, so feel free to do it here or as a quick follow-up — both reviewers independently flagged it as the most worthwhile change.)

2. Don't state Slack's ping cadence as a fact
The const doc (:58-60) and inline comment (:709) assert "Slack server-pings ~every 30s." Slack does not document a guaranteed ~30s server→client ping cadence, so this is an unverified number presented as the justification for the 60s value. The logic doesn't actually depend on 30s — any inbound frame (ping, ack, event, close) resets the timer, so the real invariant is simply "a healthy connection is never frame-silent for 60s." Stating a specific number risks a future maintainer tightening the timeout below the true keepalive interval and causing spurious reconnects.
Suggested wording:

/// Reconnect the Slack Socket Mode socket if no frame (ping, ack, or event)
/// arrives within this window. A healthy Socket Mode connection is never silent
/// this long, so 60s of silence means the connection is half-open (dead) —
/// detects silent NAT/firewall drops.

(If the 30s figure is sourced, a doc link / SDK constant reference would resolve this.)

3. Minor wording / readability (nits)

  • :709-713 comment: "60s with no frame means the socket is silently dead" → "...is treated as dead"; "read.next() blocks forever" → "can block indefinitely" (a half-open TCP connection may eventually surface an OS-level error, so not strictly forever).
  • :714-723: msg_result is bound three times in a row across three different types (the select!/timeout result, the unwrapped Option, then the Result<Message,_>). Consider naming the outer binding timeout_result and keeping msg_result for the unwrapped frame.

Optional follow-ups (not needed for this PR)

  • Observability / active liveness: if the ping-cadence assumption is ever wrong (or there's a legitimate >60s quiet window), a healthy connection would reconnect every 60s with only a per-cycle warn!. A consecutive-idle-reconnect counter that escalates to error!, or sending a client Ping at ~40s and only breaking if no frame follows, would distinguish "dead socket" from "quiet socket." Worthwhile but not blocking.
  • Test: the timeout → break → reconnect decision has no coverage and the loop isn't unit-testable as-is. A clean way (for a later PR) is to extract the per-frame decision into a pure fn, e.g. fn classify_read(outcome: Result<Option<...>, Elapsed>) -> LoopAction { Continue | Reconnect | Process(msg) }, called from the select! arm, then a synchronous #[test] asserting Err(Elapsed) → Reconnect, Ok(None) → Reconnect, Ok(Some(text)) → Process (auto-wired via the existing cargo test CI). A spin-up-the-adapter-and-wait test would be slow/flaky, so please avoid that shape.

Verified along the way (no change needed): break exits the correct loop and falls through to the existing 5s reconnect backoff; the timeout-wrapped read.next() is cancel-safe inside select!; spawned message handlers reply via the HTTP API so they survive a reconnect; and the Ping handler at :1047 confirms server pings are surfaced as frames (so they do reset the idle timer).

Thanks for the fix! 🙏

@samuel3132001

Copy link
Copy Markdown
Author

Thanks for the thorough multi-model review — really appreciate it 🙏 Pushed a commit addressing all three:

  1. Fail fast on a failed ack/pong write — both write.send(...) sites (envelope ack + pong) now warn! and break to reconnect immediately instead of swallowing the error, so a dead half-open socket is caught on the next write rather than waiting out the full idle timeout (and avoids the duplicate-envelope redelivery you flagged).
  2. Dropped the "~30s" ping-cadence claim — reworded the const doc + inline comment to state the real invariant ("a healthy connection is never frame-silent this long") rather than an unverified server-ping interval.
  3. Wording + binding nits — "treated as dead" / "can block indefinitely", and collapsed the triple msg_result rebind into a single match timeout_result { Ok(Some) / Ok(None) / Err }.

Left the two optional follow-ups (a consecutive-idle escalation counter / active client-ping, and the classify_read pure-fn unit test) out of this PR as suggested — happy to do them in a follow-up. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Slack] Socket Mode adapter hangs forever on half-open connection (no read/idle timeout)

4 participants