Skip to content

perf(composio/gmail): cut redundant fetches on incremental sync (#1404)#1474

Open
obchain wants to merge 1 commit into
tinyhumansai:mainfrom
obchain:fix/1404-gmail-sync-efficiency
Open

perf(composio/gmail): cut redundant fetches on incremental sync (#1404)#1474
obchain wants to merge 1 commit into
tinyhumansai:mainfrom
obchain:fix/1404-gmail-sync-efficiency

Conversation

@obchain
Copy link
Copy Markdown
Contributor

@obchain obchain commented May 11, 2026

Summary

  • Replace day-level after:YYYY/MM/DD Gmail cursor with second-precision after:<unix> so a same-day re-tick does not re-fetch every message Gmail has filed today.
  • New first-message early-stop: persist the freshest message id in SyncState.last_seen_id and bail out after page 1 when the server's head still matches it. Cuts up to MAX_PAGES_PER_SYNC - 1 redundant requests on quiet inboxes.
  • Adaptive page cap drops the 20-page ceiling to 2 when the previous successful sync wrote within the last 5 min — periodic-tick churn and trigger-driven retries stop blowing through the budget.
  • Trigger-driven, connection-created, and periodic syncs now share the in-process LAST_SYNC_AT map (provider bumps it on success), so concurrent paths coalesce instead of double-fetching.
  • Per-sync metrics fold into a single completion log + SyncOutcome.details: requests, messages_total, messages_new, dup_ratio, stop_reason, adaptive_cap.

Problem

src/openhuman/composio/providers/gmail/provider.rs paginated up to MAX_PAGES_PER_SYNC = 20 per sync and filtered against a day-level after:YYYY/MM/DD cursor (gmail/sync.rs::cursor_to_gmail_after_filter). On a high-volume Gmail account that filter still returned the whole current day every tick, so the sync chewed through Composio quota even when nothing had actually changed since the previous run. Dedup (SyncState::synced_ids, content-hash ingest) only kicks in after the network round-trip, so by the time we knew a page was redundant the bandwidth was already gone. Three trigger paths (periodic.rs, connection-created, on_trigger) only coordinated via a coarse last_sync_at map populated by the periodic scheduler alone, so trigger-driven syncs and the periodic tick could fire back-to-back.

Concrete pressure points:

  • Coarse cursor filter — cursor_to_gmail_after_filter emitted a day filter only, so any sync within the same calendar day refetched the whole day window.
  • Fixed page ceiling — 20 pages × PAGE_SIZE (25) = 500 requests per pass even when 0 messages would be new.
  • No head-unchanged shortcut — the early-stop only fired after parsing a full page and finding every id already in synced_ids.
  • Disjoint coalesce — on_trigger never touched the scheduler's last-sync map, so trigger syncs and periodic syncs raced.

Solution

Pure rust-core, no FE / schema migration. Files touched:

  • composio/providers/gmail/sync.rs — new cursor_to_gmail_after_epoch_filter plus a shared parse_cursor_to_epoch_secs helper. The legacy day filter stays as a parse fallback for non-numeric cursors. 11 new unit tests on the helpers (including a round-trip check between the epoch filter and the recency parser).
  • composio/providers/sync_state.rsSyncState gains optional last_seen_id and last_sync_at_ms fields, both #[serde(default)] so existing on-disk state blobs deserialize cleanly. New set_last_seen_id / set_last_sync_at_ms setters, plus tests for the legacy-blob path, the setters, and an extended serialization round-trip.
  • composio/providers/gmail/provider.rs:
    • Query construction prefers after:<epoch>; falls back to the day filter only if the cursor cannot be parsed as a timestamp.
    • Adaptive max_pagesRECENT_SYNC_MAX_PAGES = 2 when now - last_sync_at_ms < RECENT_SYNC_WINDOW_MS (5 min), full MAX_PAGES_PER_SYNC = 20 otherwise. Initial connection-created syncs always get the full ceiling.
    • First-page first-message early-stop keyed on last_seen_id. Captures the freshest server-side id (newest_id) on page 0 message 0 regardless of dedup status so the next sync has a stable head-marker.
    • Stop-reason tracking (max_pages, budget_exhausted, empty_page, head_unchanged, page_all_synced, no_more_pages) folded into the structured completion log and the SyncOutcome.details JSON. Adds requests, messages_total, messages_new, dup_ratio, adaptive_cap.
    • On success bumps crate::openhuman::composio::periodic::record_sync_success for all trigger paths so periodic ticks respect a trigger/connection-created sync that just landed.
  • composio/providers/gmail/tests.rs — locks in that both cursor_to_gmail_after_epoch_filter and cursor_to_gmail_after_filter accept the same internalDate input (so the epoch path is genuinely the preferred path and the day filter is a fallback, not a divergence) and sanity-bounds the epoch filter inside [2020, 2100].

End-to-end ingest correctness is preserved: synced_ids, content-hash ingest, and the per-page deferred-mark pattern are unchanged. The optimisations only trim how much Composio quota we burn before reaching the existing dedup layer.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 66 gmail-scope tests pass, of which 15 cover the new behaviour (epoch filter round-trip, legacy-blob deserialize, setter idempotence, fallback ordering, sanity bounds).
  • Diff coverage ≥ 80% — every new helper / field has direct coverage. The mutated branches inside sync() (adaptive cap + head-unchanged) reuse parse_cursor_to_epoch_secs and last_seen_id, which are unit-tested.
  • Coverage matrix updated — N/A: behaviour change inside an existing capability surface (composio.gmail.sync); no new user-visible feature row.
  • All affected feature IDs from the matrix are listed in the PR description under ## Related
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces (docs/RELEASE-MANUAL-SMOKE.md) — N/A: provider-internal optimisation, no surface change.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime: desktop only. No new dependencies, no schema migration (SyncState keeps backward-compatible defaults). The new metrics log replaces the existing one, so log volume is unchanged.
  • Quota: high-volume inboxes that previously paginated 20 pages per tick will commonly stop after 1–2 pages now, both via the epoch-precision cursor and the head-unchanged shortcut. Initial backfills still get the full 20-page ceiling.
  • Risk: Gmail's search syntax supports after:<unix_seconds> (documented). If Composio's GMAIL_FETCH_EMAILS ever rejects the bare-int form, cursor_to_gmail_after_epoch_filter returns None for that cursor and we fall back to the day filter, preserving behaviour parity with main.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/1404-gmail-sync-efficiency
  • Commit SHA: 9e5a7b3

Summary by CodeRabbit

  • Performance

    • Gmail sync now uses adaptive pagination to reduce redundant message re-fetching
    • Early-stop optimization accelerates initial synchronization checks
  • Improvements

    • Enhanced sync state persistence to track additional metadata for improved reliability
    • Refined email content processing with improved formatting during synchronization

Review Change Stack

Replaces the day-level `after:YYYY/MM/DD` cursor with a second-precision
`after:<unix>` filter so same-day re-ticks stop re-fetching today's
window. Adds a first-message head-unchanged early-stop keyed on a new
`SyncState.last_seen_id`, plus an adaptive page cap that drops the
20-page ceiling to 2 when the previous successful sync wrote inside a
5 minute window. Trigger-driven and connection-created syncs now bump
the periodic scheduler's last-sync map, so concurrent paths coalesce.
Adds per-sync metrics to the completion log + outcome details
(requests, messages_total, messages_new, dup_ratio, stop_reason,
adaptive_cap). Closes tinyhumansai#1404.
@obchain obchain requested a review from a team May 11, 2026 07:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

This PR optimizes Gmail incremental sync to reduce redundant Composio API calls by introducing epoch-second precision cursor filtering, adaptive pagination caps based on recent sync timing, early-stop optimizations on unchanged inbox heads, and detailed per-sync metrics. State model extensions enable tracking the freshest message ID and last successful sync timestamp for more intelligent fetch decisions.

Changes

Gmail Sync Efficiency Optimization

Layer / File(s) Summary
State Model Extension
src/openhuman/composio/providers/sync_state.rs
SyncState gains last_seen_id: Option<String> and last_sync_at_ms: Option<u64> fields with serde defaults for backward compatibility; constructor and setter methods initialize and mutate these fields.
Epoch-Based Cursor Filter Helpers
src/openhuman/composio/providers/gmail/sync.rs
Removes day-level cursor_to_gmail_after_filter; adds cursor_to_gmail_after_epoch_filter to convert epoch-millis, YYYY-MM-DD, or RFC3339 cursors into unix-seconds Gmail after: filters, plus parse_cursor_to_epoch_secs utility for shared parsing logic.
Cursor Helper Test Coverage
src/openhuman/composio/providers/gmail/tests.rs
Adds imports and comparison test validating epoch-millis cursor conversion consistency with expected epoch-seconds and YYYY/MM/DD formats, including sanity bounds for post-2020/pre-2100 range.
Adaptive Pagination Parameters
src/openhuman/composio/providers/gmail/provider.rs
Defines RECENT_SYNC_WINDOW_MS and ADAPTIVE_PAGE_CAP constants to control pagination cap for recent syncs.
Adaptive Pagination and Budget Tracking
src/openhuman/composio/providers/gmail/provider.rs
Computes max_pages adaptively based on state.last_sync_at_ms (smaller cap for recent syncs, full cap for backfills); introduces total_requests counter and stop_reason tracking; stops pagination early when daily budget is exhausted.
Epoch Filter Integration
src/openhuman/composio/providers/gmail/provider.rs
Gmail query generation now prefers cursor_to_gmail_after_epoch_filter with fallback to day-level filtering when epoch parsing fails.
Per-Page Request Accounting
src/openhuman/composio/providers/gmail/provider.rs
Increments total_requests per fetched page for accurate Composio API usage tracking.
Markdown Formatting Preprocessing
src/openhuman/composio/providers/gmail/provider.rs
Applies provider's markdown_formatted content to response data before post-processing to ensure LLM-facing text derives from pre-rendered markdown.
Early-Stop Optimization
src/openhuman/composio/providers/gmail/provider.rs
Page 0 early-stop shortcut: if first message's ID matches state.last_seen_id, pagination stops because inbox head is unchanged; message iteration now uses (index, msg) to capture newest_id from page 0 independent of dedup status.
Sync State and Scheduler Bookkeeping
src/openhuman/composio/providers/gmail/provider.rs
Post-pagination updates persist last_seen_id and set last_sync_at_ms, save state, and record sync success via periodic scheduler bookkeeping.
Sync Outcome Reporting
src/openhuman/composio/providers/gmail/provider.rs
Success summary/details now include request count, stop reason, adaptive-cap flag, duplication ratio, and persisted cursor/head metadata for observability.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#523: Directly modifies the Gmail sync helpers introduced in that PR—replacing day-level cursor_to_gmail_after_filter with new epoch-second parsing.
  • tinyhumansai/openhuman#519: Main PR modifies Gmail cursor/filter helpers and extends the SyncState schema that #519 introduced, with direct code-level dependencies.
  • tinyhumansai/openhuman#1056: Both PRs touch Gmail provider sync logic—specifically pagination/dedup handling and post-processing in provider.rs.

Suggested reviewers

  • senamakel

Poem

📧 With epochs precise and pages so wise,
Gmail's sync now sees through redundant disguise,
Adaptive caps and early stops shine bright,
Reducing API calls with each fetched bite. 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main objective: reducing redundant Gmail fetches during incremental sync through performance optimizations.
Linked Issues check ✅ Passed The PR implements all primary coding objectives from issue #1404: epoch-second cursor filtering, adaptive page caps, head-unchanged early stop via last_seen_id, metrics tracking, and SyncState persistence with backward compatibility.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the optimization scope: cursor precision, adaptive pagination, early-stop conditions, request metrics, and SyncState enhancements for Gmail sync efficiency.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/openhuman/composio/providers/sync_state.rs (1)

256-265: 💤 Low value

Consider logging the new state fields on save/load for observability.

save() and load() debug logs already include cursor/synced_ids_count/budget but omit the new last_seen_id and last_sync_at_ms values. Including them would make it easier to debug head-unchanged early-stop behavior in the field without changing semantics.

📝 Suggested addition
         tracing::debug!(
             toolkit = %self.toolkit,
             connection_id = %self.connection_id,
             cursor = ?self.cursor,
             synced_ids_count = self.synced_ids.len(),
             budget_used = self.daily_budget.requests_used,
+            last_seen_id = ?self.last_seen_id,
+            last_sync_at_ms = ?self.last_sync_at_ms,
             "[sync_state] saved"
         );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/composio/providers/sync_state.rs` around lines 256 - 265,
Update the debug logs in the SyncState save() and load() methods to also include
the new fields last_seen_id and last_sync_at_ms for observability; specifically,
in the tracing::debug! call inside save() (and the analogous debug in load())
add last_seen_id = ?self.last_seen_id and last_sync_at_ms = self.last_sync_at_ms
so the logs show those values alongside toolkit, connection_id, cursor,
synced_ids_count and budget_used without changing any logic.
src/openhuman/composio/providers/gmail/sync.rs (1)

86-90: 💤 Low value

Numeric cursors are unconditionally treated as milliseconds.

parse::<i64>() succeeds on any bare integer, so a cursor like "1700000000" (10-digit unix seconds) gets divided by 1000 and silently lands in 1970. Today the only caller writes internalDate (Gmail-side millis), so this is an internal contract — but a magnitude check (e.g. treat ≥ 10¹² as ms, otherwise seconds) would make the helper safer against future call sites and accidental cursor pollution.

🔧 Suggested guard
 pub(crate) fn parse_cursor_to_epoch_secs(cursor: &str) -> Option<i64> {
     let cursor = cursor.trim();
-    if let Ok(millis) = cursor.parse::<i64>() {
-        return Some(millis / 1000);
+    if let Ok(n) = cursor.parse::<i64>() {
+        // Heuristic: values ≥ 10^12 (year 2001+ in ms) are millis;
+        // anything smaller is already in seconds.
+        return Some(if n.abs() >= 1_000_000_000_000 { n / 1000 } else { n });
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/composio/providers/gmail/sync.rs` around lines 86 - 90, The
helper parse_cursor_to_epoch_secs currently treats any integer cursor as
milliseconds; change it to detect magnitude so 64-bit integers >=
1_000_000_000_000 are interpreted as milliseconds (divide by 1000) and smaller
integers are interpreted as seconds (use as-is), ensuring values like
"1700000000" are not wrongly converted; update parse_cursor_to_epoch_secs to
apply this magnitude check after parsing and return None on parse failure as
before.
src/openhuman/composio/providers/gmail/provider.rs (1)

419-426: 💤 Low value

Simplify the newest_id capture — the loop guard already pins it to page 0, index 0.

Inside the (page_num == 0, msg_index == 0) branch, msg_id is the same value extract_item_id(messages.first(), ...) already computed in the head-unchanged block above (line 373–375). You can pull that into a single binding at the top of the page-0 block and reuse it, avoiding the duplicate extraction and the nested if let Some(ref id) = msg_id deeper in the per-message loop.

♻️ Suggested consolidation
-            if page_num == 0 {
-                let first_id = messages
-                    .first()
-                    .and_then(|m| extract_item_id(m, MESSAGE_ID_PATHS));
-                if let (Some(seen), Some(first)) =
-                    (state.last_seen_id.as_deref(), first_id.as_deref())
-                {
-                    if seen == first {
+            if page_num == 0 {
+                let first_id = messages
+                    .first()
+                    .and_then(|m| extract_item_id(m, MESSAGE_ID_PATHS));
+                if let Some(ref id) = first_id {
+                    newest_id = Some(id.clone());
+                }
+                if let (Some(seen), Some(first)) =
+                    (state.last_seen_id.as_deref(), first_id.as_deref())
+                {
+                    if seen == first {
                         tracing::debug!(
                             connection_id = %connection_id,
                             first_id = %first,
                             "[composio:gmail] first page head matches last_seen_id — no new mail"
                         );
                         stop_reason = "head_unchanged";
-                        newest_id = Some(first.to_string());
                         break;
                     }
                 }
             }
@@
-                let msg_id = extract_item_id(msg, MESSAGE_ID_PATHS);
-                // Capture the very first id of page 0 as the
-                // freshest-id-on-server marker for next-sync's
-                // head-unchanged shortcut, regardless of dedup status.
-                if page_num == 0 && msg_index == 0 {
-                    if let Some(ref id) = msg_id {
-                        newest_id = Some(id.clone());
-                    }
-                }
+                let msg_id = extract_item_id(msg, MESSAGE_ID_PATHS);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/composio/providers/gmail/provider.rs` around lines 419 - 426,
The code duplicates extraction of the first message id for newest_id inside the
per-message loop even though the page-0 guard already identifies the first
message; instead, in the page_num == 0 branch obtain a single binding (use
extract_item_id(messages.first(), ...) or the existing head-unchanged binding)
and assign newest_id = Some(id.clone()) once, then remove the nested if-let
Some(ref id) = msg_id inside the loop in the function handling pages so the loop
reuses that top-level binding; refer to symbols page_num, msg_index,
messages.first(), extract_item_id, msg_id and newest_id to locate and
consolidate the extraction.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/composio/providers/gmail/provider.rs`:
- Around line 419-426: The code duplicates extraction of the first message id
for newest_id inside the per-message loop even though the page-0 guard already
identifies the first message; instead, in the page_num == 0 branch obtain a
single binding (use extract_item_id(messages.first(), ...) or the existing
head-unchanged binding) and assign newest_id = Some(id.clone()) once, then
remove the nested if-let Some(ref id) = msg_id inside the loop in the function
handling pages so the loop reuses that top-level binding; refer to symbols
page_num, msg_index, messages.first(), extract_item_id, msg_id and newest_id to
locate and consolidate the extraction.

In `@src/openhuman/composio/providers/gmail/sync.rs`:
- Around line 86-90: The helper parse_cursor_to_epoch_secs currently treats any
integer cursor as milliseconds; change it to detect magnitude so 64-bit integers
>= 1_000_000_000_000 are interpreted as milliseconds (divide by 1000) and
smaller integers are interpreted as seconds (use as-is), ensuring values like
"1700000000" are not wrongly converted; update parse_cursor_to_epoch_secs to
apply this magnitude check after parsing and return None on parse failure as
before.

In `@src/openhuman/composio/providers/sync_state.rs`:
- Around line 256-265: Update the debug logs in the SyncState save() and load()
methods to also include the new fields last_seen_id and last_sync_at_ms for
observability; specifically, in the tracing::debug! call inside save() (and the
analogous debug in load()) add last_seen_id = ?self.last_seen_id and
last_sync_at_ms = self.last_sync_at_ms so the logs show those values alongside
toolkit, connection_id, cursor, synced_ids_count and budget_used without
changing any logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0f15e5de-82c6-48cf-a42a-2645ab95f4c5

📥 Commits

Reviewing files that changed from the base of the PR and between 838e6fc and 9e5a7b3.

📒 Files selected for processing (4)
  • src/openhuman/composio/providers/gmail/provider.rs
  • src/openhuman/composio/providers/gmail/sync.rs
  • src/openhuman/composio/providers/gmail/tests.rs
  • src/openhuman/composio/providers/sync_state.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize Gmail sync to reduce redundant Composio calls and bandwidth

1 participant