fix(channels): prevent Matrix /sync from timing out at exactly 30 seconds#7404
Conversation
…onds The matrix-sdk defaults to a 30-second per-request timeout while `SyncSettings::default()` sends no `?timeout=` parameter, so the homeserver returns immediately and the SDK busy-polls — and every 30-second window races the HTTP deadline. Fix by (a) setting an explicit 60-second `RequestConfig` timeout on the client so the HTTP layer doesn't fire before a long-poll can complete, and (b) passing a 30-second long-poll timeout to both `sync_once` and `sync` so the homeserver holds idle requests open and returns before the HTTP deadline.
Audacity88
left a comment
There was a problem hiding this comment.
I reviewed current head 8215ed7, the PR body, the one-file matrix.rs diff, the prior #7119 split review and closeout comment, the empty current comment/review threads, and the now-green visible CI. I did not rerun local cargo or run a live Matrix channel soak.
@tidux Thanks for splitting this out from #7119. The new PR is much easier to review: it only carries the Matrix /sync timeout change, and the scope/body now match the one-file Matrix diff.
The code direction looks right. I am not comfortable approving it yet because the PR changes live channel runtime behavior and the central claim still needs live or near-live Matrix evidence.
✅ Resolved — The #7119 mixed-scope problem is fixed
This PR no longer bundles the runtime trim_history fix with the Matrix channel fix. The changed file list is just crates/zeroclaw-channels/src/matrix.rs, and the PR body now describes the Matrix timeout behavior, Matrix-only blast radius, and the split from #7119 clearly.
🟢 What looks good — The timeout relationship is explicit
Setting a 60-second Matrix client request timeout and a 30-second /sync long-poll timeout is the right shape for the described failure. The comments also state the important invariant: the server-side long-poll window must stay below the HTTP client deadline, so idle syncs can return normally before the request layer times out.
🔴 Blocking — Add live or near-live Matrix idle-sync evidence
This PR's central behavior is not just that the code compiles. It is that an idle Matrix /sync request no longer hits the 30-second HTTP timeout / busy-poll pattern. The PR body says a full live Matrix soak was not run.
For channel runtime behavior, I need live or near-live evidence before approval unless we explicitly accept an exception. Please add a short Matrix smoke result that leaves a configured Matrix channel idle for longer than 30 seconds and shows that /sync no longer errors at the 30-second cadence. A concise log excerpt or run note is enough; it does not need to be a large soak. If live Matrix validation is impractical, say so explicitly and we can decide whether to accept the risk, but the current PR evidence stops at compile/CI plus timeout math.
|
this message sent by todixuclawbot confirming Matrix end to end connectivity. |
|
Prior sync misbehavior has been corrected as demonstrated in ZeroClaw daemon logs. Test coming. |
Reviewer @Audacity88 (CHANGES_REQUESTED on zeroclaw-labs#7404) asked for a short live-Matrix smoke that leaves a configured channel idle for >30s and shows that `/sync` no longer errors at the 30-second cadence. Add `live_smoke::idle_sync_does_not_error_at_30s_cadence`. It is gated with `#[ignore]` and reuses the existing `ZEROCLAW_MATRIX_SMOKE_*` (and fallback `ZEROCLAW_MATRIX_*`) env contract from the sibling `same_room_partial_draft_lifecycle_uses_real_draft_ids` test, so it joins the same CI-skipped / locally-runnable lane. The test exercises both halves of the fix end-to-end against a real homeserver: * `ensure_client()` builds the client with `CLIENT_REQUEST_TIMEOUT` applied to the underlying `RequestConfig` — if that ever regresses below `SYNC_LONGPOLL_TIMEOUT`, the very first idle long-poll trips the HTTP deadline. * Each `sync_once` call passes `SYNC_LONGPOLL_TIMEOUT` so the homeserver actually holds the request open. Over a ~35s soak (tunable via `ZEROCLAW_MATRIX_SMOKE_IDLE_SECS`, must exceed 30) it asserts: * No `sync_once` returns an error (primary reviewer ask — rules out the pre-fix 30s HTTP-deadline regression). * At least one round-trip durably long-polls (rules out the pre-fix busy-poll pattern where the SDK busy-loops because no `?timeout=` was sent). * The count of sub-`MIN_LONGPOLL_MS` returns stays inside a small budget (defense-in-depth against a partial regression). Also emits a one-line `eprintln!` so a captured `cargo test -- --ignored --nocapture` run produces the "short Matrix smoke result" log line the reviewer asked for. Scope: tests-only addition inside the existing `#[cfg(test)] mod tests { mod live_smoke { ... } }` block in `crates/zeroclaw-channels/src/matrix.rs`. No production code, no config/CLI surface, and no other channel touched. Validation: cargo fmt -p zeroclaw-channels --check cargo clippy -p zeroclaw-channels -...
|
@Audacity88 thanks for the clear blocker — addressed in Smoke test added. New It does exactly the loop you asked for:
It also emits a one-line Near-live evidence is already in the previous comments — daemon log excerpt showing the sync loop entering cleanly with no 30s-cadence errors, plus the Element X end-to-end screenshot. Local validation of the test commit: Scope is still single-file (the new test sits inside the existing |
Audacity88
left a comment
There was a problem hiding this comment.
I reviewed current head aeef551, the updated PR body, the one-file matrix.rs diff, the follow-up top-level comments with Matrix daemon evidence, the new ignored live-smoke test, the empty current inline review threads, the prior CHANGES_REQUESTED review, and the now-green visible CI. I did not rerun local cargo or run the live Matrix smoke myself.
@tidux Thanks for the fast follow-up. This addresses the blocker I left on 8215ed7.
✅ Resolved — Matrix idle-sync evidence now covers the 30-second failure mode
The earlier blocker was that the PR proved the timeout math and compile path, but not the live Matrix behavior. The updated branch now adds matrix::tests::live_smoke::idle_sync_does_not_error_at_30s_cadence, and the test exercises the same client construction path that applies CLIENT_REQUEST_TIMEOUT, then repeatedly calls sync_once(SyncSettings::default().timeout(SYNC_LONGPOLL_TIMEOUT)) for a soak window longer than 30 seconds.
That gives us a runnable near-live check for both halves of the original bug: no request-deadline error at the old 30-second cadence, and no return to the busy-poll pattern where idle /sync calls come back immediately because no ?timeout= is sent.
🟢 What looks good — The production fix stays small and targeted
The production diff is still just the Matrix channel path. Client::builder() now gets a 60-second request timeout, while the initial sync_once and long-running sync calls share the explicit 30-second long-poll setting. The constants document the important invariant directly at the two call sites: the server-side long-poll window must stay below the HTTP request deadline.
Approving this now. The remaining risk is ordinary Matrix live-environment variance, and the new ignored smoke gives us the right opt-in tool to re-check that path when credentials are available.

Summary
Base branch:
master(all contributions)What changed and why:
The
matrix-sdkdefaults to a 30-second per-request HTTP timeout whileSyncSettings::default()sends no?timeout=parameter, so the homeserver returns immediately and the SDK busy-polls — every 30-second window then races the HTTP deadline and idle/syncrequests error out at exactly 30s.Relates to #6576.
The fix has two parts:
RequestConfigtimeout on theClient::builder()so the HTTP layer doesn't fire before a long-poll can complete (newCLIENT_REQUEST_TIMEOUTconst).sync_onceand the long-runningsynccall so the homeserver holds idle requests open and returns before the HTTP deadline (newSYNC_LONGPOLL_TIMEOUTconst).The two constants are documented relative to each other:
SYNC_LONGPOLL_TIMEOUTmust stay strictly belowCLIENT_REQUEST_TIMEOUTso the HTTP request deadline never fires before the long-poll completes server-side. The existing integration test in the file is also updated to passSYNC_LONGPOLL_TIMEOUTtosync_once.Scope boundary: only
crates/zeroclaw-channels/src/matrix.rs; no runtime, agent, provider, schema, session persistence, or CLI surface touched.Blast radius: Matrix channel only. Worst case if the timeout values are wrong is
/syncbehaves the same way it does today (request times out at 30s) — i.e. the fix can only equal or improve the current behavior on the Matrix path. No other channel is affected.Split note: this is the second half of the now-closed #7119, split per @singlerider's review. The other half (the
trim_historyruntime fix) is in its own PR (linked below). Each PR's body now honestly covers its own diff.Linked issue(s): None — diagnosed from observed 30-second timeout pattern in Matrix sync logs.
Related PRs: Companion split-out —
trim_historyorphan-cascade guard in #7403. Half of the original #7119 (to be closed).Labels:
bug,risk: medium,channel,channel:matrix,size: XS.Reviewer feedback addressed (since @Audacity88's CHANGES_REQUESTED)
@Audacity88 blocked on the lack of live-or-near-live Matrix evidence that an idle
/syncno longer errors at the 30-second cadence. Two follow-ups in commitcb2ccc3a:Live daemon evidence —
matrix.todixuclawbotran against a real homeserver after the fix; sync loop logs show no 30s-cadence errors. Excerpt + Element X end-to-end screenshot are in the comments below.Smoke test added — new
matrix::tests::live_smoke::idle_sync_does_not_error_at_30s_cadence(file:crates/zeroclaw-channels/src/matrix.rs).#[ignore]'d, reuses the existingZEROCLAW_MATRIX_SMOKE_*env contract, soaks an idle channel for> 30s(default 35s, tunable viaZEROCLAW_MATRIX_SMOKE_IDLE_SECS), and asserts:sync_oncecall returns an error (primary anti-regression for the 30s HTTP-deadline bug),It also emits a one-line
eprintln!summary so a capturedcargo test -- --ignored --nocapturerun produces the "short Matrix smoke result" log line the reviewer asked for.Scope still single-file: the new test is inside the existing
#[cfg(test)] mod tests { mod live_smoke { ... } }block; no production code changed since the original review.Validation Evidence (required)
Commands run and tail output:
cargo check -p zeroclaw-channelsagainst currentupstream/master:Branch is based on
upstream/masterpost-fix(ollama): restore compiling master build #7231 (theollama.rsE0308 revert that was breaking CI on fix(runtime): guard trim_history against orphan-cascade emptying all messages #7119 has landed), so CI on this branch should be green rather than inheriting the prior master breaker.Smoke test commit (
cb2ccc3a) validation locally:Beyond CI — what did you manually verify?
Confirmed the two constants are ordered correctly (
SYNC_LONGPOLL_TIMEOUT = 30s < CLIENT_REQUEST_TIMEOUT = 60s) so the homeserver-side long-poll always completes before the HTTP deadline. The?timeout=parameter is now sent on every sync call (bothsync_oncefor the initial boot and the long-runningsyncloop), and the integration test in the file (matrix.rstests module) was updated to use the same long-poll timeout so it exercises the same path as production.Live-Matrix soak was run against
matrix.todixuclawbotafter the fix: idle sync loop started cleanly, no 30s-cadence errors observed, end-to-end inbound message through Element X confirmed in the comments. The newidle_sync_does_not_error_at_30s_cadencesmoke codifies that observation as a runnable test (#[ignore]'d so it stays out of CI by default).If any command was intentionally skipped, why: Full
cargo teston the channels crate was not run in this prep pass; I'd appreciate CI confirming green before merge.Security & Privacy Impact (required)
/synccalls)Compatibility (required)
ZEROCLAW_MATRIX_SMOKE_IDLE_SECS/ZEROCLAW_MATRIX_SMOKE_MIN_LONGPOLL_MSfrom env, but these are test-only opt-ins gated behind#[ignore]; no production code path reads them.)Rollback (required for risk: medium and risk: high)
git revert <commit-sha-of-this-PR>(revert the runtime commit; the test-only commitcb2ccc3ais safe to leave in place or revert independently)/syncrequests erroring out at exactly 30 seconds and the SDK busy-polling. Look for repeatedWARN/ERRORentries fromcrates/zeroclaw-channels/src/matrix.rsaround thesync/sync_oncepaths with a ~30s cadence.