Skip to content

fix(telegram): retry polling on all errors, not just 409 Conflict#1377

Closed
3koozy wants to merge 1 commit intoanthropics:mainfrom
3koozy:fix/telegram-poll-loop-retry
Closed

fix(telegram): retry polling on all errors, not just 409 Conflict#1377
3koozy wants to merge 1 commit intoanthropics:mainfrom
3koozy:fix/telegram-poll-loop-retry

Conversation

@3koozy
Copy link
Copy Markdown

@3koozy 3koozy commented Apr 12, 2026

Summary

The poll-loop startup in server.ts only retries 409 Conflict errors. Any other transient error (connection timeout, TLS reset, DNS hiccup) causes bot.start() to reject, the catch block to return, and polling to stop permanently. The plugin process stays alive because MCP stdin keeps it running — so outbound tools (reply, react, edit_message) continue to work while the bot is completely deaf to inbound Telegram messages.

This is one of the root causes behind the widespread "outbound works, inbound doesn't" reports across 65+ open issues.

Root cause analysis

The vulnerable code path (lines 988–1035):

} catch (err) {
  // ...
  if (err instanceof GrammyError && err.error_code === 409) {
    // retry with backoff
    continue
  }
  // ...
  process.stderr.write(`telegram channel: polling failed: ${err}\n`)
  return  // ← ANY non-409 error exits the loop permanently
}

After return, the async IIFE exits. The process stays alive because the MCP server's stdin reader keeps the event loop running. Outbound tool calls work (they create fresh HTTP requests), but grammy's poll loop is dead — no getUpdates calls, no inbound messages.

Evidence (strace on a stalled plugin)

$ timeout 10 strace -p <plugin_pid> -e trace=network,write,read -f
strace: Process <pid> attached
strace: Process <pid> detached
(zero output — zero syscalls in 10 seconds)

FD inspection during the stall:

$ for fd in /proc/<pid>/fd/*; do echo "$(basename $fd) -> $(readlink $fd)"; done
0 -> socket:[...]    # MCP stdin (alive)
1 -> socket:[...]    # MCP stdout (alive)
2 -> socket:[...]    # stderr
3 -> /dev/urandom
4 -> anon_inode:[eventpoll]
# ... all internal FDs, ZERO TCP sockets to api.telegram.org

After applying this fix, the plugin maintained 2 ESTABLISHED TCP connections to 149.154.166.110:443 for 3+ hours and recovered from transient errors automatically.

What this PR changes

  • Catch ALL errors in the poll-loop, not just 409 — retry with exponential backoff (up to 15s)
  • Track consecutive 409s separately so the "another poller is active" give-up logic (8 attempts) is preserved
  • Reset counters on successful bot.start() via the onStart callback
  • Log all retry errors with the error message so operators can see what's happening

What this PR does NOT fix

There is a second, separate bug in Claude Code itself: after ~2–3 hours, Claude Code's internal handler for notifications/claude/channel events stops surfacing inbound messages, even when the plugin is alive and actively polling (confirmed via strace showing active write syscalls and ESTABLISHED TCP sockets). This is tracked in anthropics/claude-code#46744 and others. This PR fixes the plugin-side crash; the Claude Code notification handler degradation is a separate issue.

Related issues

Test plan

  • Verified in Docker container (Debian bookworm, Bun 1.x) for 3+ hours
  • Confirmed via strace that plugin maintains TCP connections after fix
  • 409 Conflict handling still works (consecutive409 counter, 8-attempt limit)
  • Graceful shutdown (bot.stop() / "Aborted delay") still exits cleanly
  • Needs testing on macOS and Windows native

🤖 Generated with Claude Code

The poll-loop startup in server.ts only retried 409 Conflict errors.
Any other transient error (connection timeout, TLS reset, DNS hiccup)
caused bot.start() to reject, the catch block to log and return, and
polling to stop permanently. The plugin process stayed alive because
MCP stdin kept it running — so outbound tools (reply, react,
edit_message) continued to work while the bot was completely deaf to
inbound Telegram messages.

This was confirmed via strace on a running container: the plugin
process had zero TCP sockets, zero syscalls in a 10-second window,
and zero network activity — despite being alive with 18 open FDs
(all internal: epoll, timerfd, eventfd, /dev/urandom, MCP stdio).

Fix: catch ALL errors (not just 409) and retry with exponential
backoff up to 15 seconds. Track consecutive 409s separately so the
"another poller is active" give-up logic is preserved. Reset both
counters on successful bot.start() via onStart callback.

Addresses the plugin-side root cause of:
- anthropics/claude-code#46744
- anthropics/claude-code#46016
- anthropics/claude-code#46356
- anthropics#870
- anthropics#1345
@github-actions
Copy link
Copy Markdown

Thanks for your interest! This repo only accepts contributions from Anthropic team members. If you'd like to submit a plugin to the marketplace, please submit your plugin here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant