fix(matrix): Only advance sync token when allowed rooms present#133
fix(matrix): Only advance sync token when allowed rooms present#133
Conversation
Previously, the sync token would advance even when the sync response contained only events for non-allowed rooms. This caused messages to be missed when: - Sync returned events only for non-allowed rooms - Messages arrived during concurrent processing gaps Now the token only advances when: 1. Messages were extracted, OR 2. Allowed rooms appeared in sync (even with no messages), OR 3. No room filters configured (process everything) This prevents the token from advancing past unprocessed messages while still ensuring we don't get stuck re-processing the same events. Adds debug logging when token is NOT advanced to help diagnose issues.
| let should_advance_token = !messages.is_empty() || allowed_rooms_in_sync || !has_room_filters; | ||
|
|
||
| if should_advance_token { | ||
| let mut token = self.sync_token.lock().await; | ||
| *token = Some(next_batch.clone()); | ||
| self.save_sync_token(&next_batch); | ||
| } else { | ||
| eprintln!("DEBUG: sync - NOT advancing token (no allowed rooms in response)"); | ||
| } |
There was a problem hiding this comment.
🔴 Frozen sync token causes repeated re-fetching of non-allowed room events indefinitely
When has_room_filters is true and only non-allowed rooms have activity, should_advance_token evaluates to false (messages is empty, allowed_rooms_in_sync is false, has_room_filters is true). The sync token never advances, so every subsequent call to sync() re-requests the same events from the Matrix server using the stale since parameter. Because events already exist since the frozen token, the Matrix server returns immediately instead of long-polling for the configured timeout (1000ms at matrix_cli.rs:651), defeating the long-poll mechanism. Each poll cycle (every ~2 seconds per messenger_handler.rs:320) re-fetches and discards the same non-allowed room data. This continues indefinitely until an allowed room has activity. In deployments where the bot is in many active non-allowed rooms and the allowed rooms are quiet, this causes persistent wasted network bandwidth and server CPU. There is no staleness limit or fallback to force token advancement.
Scenario illustrating the issue
Bot configured with allowed_chats = ["!roomA"] and joined to rooms A (quiet) and B (active).
- Sync returns events for room B only
allowed_rooms_in_sync= false,messages= empty → token frozen- Next sync: same token → server returns B's events immediately (no long-poll)
- Repeat forever until room A has activity
Each cycle wastefully fetches B's events and discards them.
Prompt for agents
In crates/rustyclaw-core/src/messengers/matrix_cli.rs, the should_advance_token logic at line 469 should be modified to always advance the sync token, while still ensuring messages from allowed rooms are not lost. Two possible approaches:
1. Add a staleness counter or timer: track how many consecutive syncs have had no allowed room events. After a threshold (e.g., 10 syncs or 30 seconds), advance the token anyway to avoid indefinite re-fetching. Reset the counter when allowed rooms appear.
2. Always advance the token but track processed event IDs: advance the token on every sync (as the old code did), but maintain a bounded set of recent event IDs to detect and skip duplicates if needed. This avoids the frozen token problem entirely.
The fix should be applied in the sync() method around lines 462-477. The goal is to prevent the scenario where the bot never advances the sync token because non-allowed rooms have constant activity while allowed rooms are quiet.
Was this helpful? React with 👍 or 👎 to provide feedback.
Problem
The sync token was advancing even when the sync response contained only events for non-allowed rooms. This caused messages to be missed.
Root Cause
The token was saved unconditionally after every sync, regardless of whether allowed rooms appeared in the response.
Solution
Only advance the sync token when:
This ensures we don't skip past messages that arrived for allowed rooms while we were processing events from non-allowed rooms.
Testing
Related