Skip to content

docs: document Gateway resume integrity — oldest_cursor, resync_required, snapshot, top-level seq (PraisonAI PR #2155 / Issue #2153) #722

Description

@MervinPraison

Summary

PraisonAI PR #2155 (closes issue #2153) added integrity checks to the gateway's resume-after-disconnect protocol. These are wire-level additions to the joined acknowledgement and to every outbound event — they are not yet documented in PraisonAIDocs. Client integrators reading docs/features/gateway.mdx today would still believe a resumed: true ack means they are caught up, when in fact the gateway can now explicitly signal that they must drop local state and resync from a fresh snapshot.

The PR added:

  1. oldest_cursor on the joined ack — floor of what can be replayed (oldest event still in the buffer).
  2. resync_required: bool on the joined ack — true when the requested since cursor is below oldest_cursor (i.e. the event buffer was trimmed past it). Tells the client: drop local state, take the snapshot.
  3. New snapshot message type — when resync_required=true, the gateway sends a {"type": "snapshot", "state": {...}} frame instead of a partial replay stream. The snapshot carries session_id, agent_id, state, full messages history, and current event_cursor.
  4. Top-level seq field on every delivered event — monotonic sequence so clients can detect a mid-stream gap cheaply (without digging into nested event.data.cursor). Applied to response, message, stream_end, error, token_stream, and tool_call_stream event types.
  5. since parameter validation — non-integer since now returns an {"type": "error", "message": "Invalid 'since' cursor. Must be an integer."} frame instead of being silently ignored.

Two new session-level helpers (get_oldest_cursor(), check_resync_required(), get_snapshot()) drive the above on the server.

Reference PR: MervinPraison/PraisonAI#2155
Reference issue: MervinPraison/PraisonAI#2153

These additions are backward-compatible (additive only) — older clients that ignore the new fields keep working — but new clients should read them to avoid silent data loss.


Where to put the docs

Per AGENTS.md §1.8, AI agents must not touch docs/concepts/. All new content goes under docs/features/.

There are two reasonable shapes. Pick option A unless the maintainer signals otherwise — it keeps the resume contract co-located with the gateway page that already explains sessions and events.

Option A (recommended) — extend docs/features/gateway.mdx

Add a new top-level section ## Resume Integrity to docs/features/gateway.mdx, placed after ## Event Types and before ## Common Patterns. Content guidance below.

Also:

  • Add an oldest_cursor row + a resync_required row + a snapshot row to the existing event/message reference area.
  • Add a <Tip> near the top of the page pointing at the new section: "Reconnecting after a disconnect? See Resume Integrity before trusting a resumed: true ack."

Option B (only if section grows past ~150 lines) — new page docs/features/gateway-resume-integrity.mdx

If the content is large enough to deserve its own page, create docs/features/gateway-resume-integrity.mdx and add it to docs.json next to the other docs/features/gateway* entries (same Features group as gateway, gateway-overview, gateway-cli, gateway-error-handling). Then leave a one-line <Tip> + Related card on gateway.mdx pointing at it.

Do not modify docs/concepts/, docs/sdk/reference/** (auto-generated), docs/js/, or docs/rust/.


SDK ground truth (read before writing)

Per AGENTS.md §1.2 "SDK-First Documentation Cycle", read these files first — the doc must mirror them exactly:

What Path in MervinPraison/PraisonAI
Protocol envelope — wire-format docstring for joined/snapshot/seq src/praisonai-agents/praisonaiagents/gateway/protocols.py (around the GatewayEvent dataclass; the PR added a "Wire Protocol Extensions" + "Resume Protocol" block to the docstring)
Session helpers — get_oldest_cursor, check_resync_required, get_snapshot src/praisonai/praisonai/gateway/server.py (on the GatewaySession class, around the add_event / get_events_since block, ~lines 127–170 after the PR)
Join handler — since validation, oldest_cursor / resync_required emission, snapshot-vs-replay branch src/praisonai/praisonai/gateway/server.py (_handle_client_messagejoin branch, ~lines 1347–1430 after the PR)
Top-level seq emission on outbound events src/praisonai/praisonai/gateway/server.py (_send_to_client, ~lines 1626–1660 after the PR) — note the expanded set of event types that now trigger session tracking (response, message, stream_end, error, token_stream, tool_call_stream)
Resume helper note about callers having to check check_resync_required src/praisonai/praisonai/gateway/server.py (resume_or_create_session docstring, ~line 1938 after the PR)
Default buffer bound (why this matters at all) src/praisonai-agents/praisonaiagents/gateway/config.py (max_messages = 1000; buffer rolls past ~2000 events)

The PR diff is the most direct view of the surface area: https://github.com/MervinPraison/PraisonAI/pull/2155/files

Do not invent fields. Document only what exists in those files.


Required content for the Resume Integrity section

Follow AGENTS.md §2 "Page Structure Template" for tone/layout. Concrete guidance below.

One-sentence intro

Something like: "When the gateway's event buffer has rolled past your last cursor, the joined ack now tells you so — and hands you a full snapshot instead of a partial replay — so a reconnecting client can never silently miss events."

Hero Mermaid (use AGENTS.md §3.1 colors, white text)

Suggested — a decision diagram covering both happy and resync paths:

graph TB
    Join[📨 join + since=N] --> Check{🔍 N < oldest_cursor?}
    Check -->|No, in buffer| Replay[📜 joined + replay events N+1..head]
    Check -->|Yes, trimmed| Resync[⚠ joined + resync_required=true]
    Resync --> Snapshot[📦 snapshot with full state]
    Replay --> Done[✅ Client up to date]
    Snapshot --> Done

    classDef request fill:#6366F1,stroke:#7C90A0,color:#fff
    classDef decision fill:#F59E0B,stroke:#7C90A0,color:#fff
    classDef happy fill:#10B981,stroke:#7C90A0,color:#fff
    classDef warn fill:#8B0000,stroke:#7C90A0,color:#fff

    class Join request
    class Check decision
    class Replay,Done happy
    class Resync,Snapshot warn
Loading

How the resume contract works now (sequence diagram)

Two sequence diagrams (or one with an alt block). Both start with a client sending {"type": "join", "agent_id": "support", "since": N}.

Happy path (since still in buffer):

  1. Server replies {"type": "joined", "cursor": HEAD, "oldest_cursor": OLDEST, "resync_required": false, "sequence": ..., ...}.
  2. Server streams {"type": "replay", "event": {...}, "seq": <cursor>} frames for each missed event.

Resync path (since < oldest_cursor):

  1. Server replies {"type": "joined", "cursor": HEAD, "oldest_cursor": OLDEST, "resync_required": true, ...}.
  2. Server sends a single {"type": "snapshot", "state": {session_id, agent_id, state, messages, event_cursor}} frame.
  3. Client discards local state and rebuilds from state.messages + state.event_cursor.

joined ack — wire fields table

Extracted from server.py _handle_client_message after the PR. Mark the new fields clearly:

Field Type New in #2155 Description
type "joined" Frame type
session_id str Session UUID — pass back on next reconnect
agent_id str Agent the client joined
resumed bool true if an existing session was resumed
cursor int Current head cursor on the server
oldest_cursor int Oldest event still in the replay buffer
resync_required bool true when since < oldest_cursor — drop local state
sequence int | null Sequence aligned with replay events
protocol_version int Negotiated protocol version
server_min_version int
server_max_version int
presence list[dict] Presence snapshot
health dict Gateway health

snapshot frame — wire fields table

Extracted from GatewaySession.get_snapshot():

Field Type Description
type "snapshot" Frame type
state.session_id str Session UUID
state.agent_id str Agent ID
state.state dict Free-form session state
state.messages list[dict] Full message history (content, sender_id, session_id, message_id, timestamp, metadata)
state.event_cursor int Current event cursor — use as your next since

Top-level seq on every event

Document that, in addition to the cursor nested in event.data.cursor, every outbound frame of these types now carries a top-level seq:

  • response
  • message
  • stream_end
  • error
  • token_stream (new in #2155 — previously not session-tracked)
  • tool_call_stream (new in #2155 — previously not session-tracked)

Show a 3-line client snippet for cheap gap detection:

expected = last_seq + 1
if event["seq"] != expected:
    # gap — resume from last good cursor or request snapshot
    await client.resync(since=last_seq)
last_seq = event["seq"]

since validation

A one-paragraph note: since must be an integer. The gateway now rejects non-integer values with {"type": "error", "message": "Invalid 'since' cursor. Must be an integer."} and drops the join. Useful for catching off-by-string-type bugs in client code.

User interaction flow (AGENTS.md §1.1 item 11)

Short narrative describing what the user actually sees:

Your Telegram bot stays connected through a 20-minute commute on flaky LTE. The session ID survives. On reconnect, normally the gateway just streams the dozen messages that arrived while you were offline. But if the buffer rolled (very long disconnect + busy session), the gateway sends a one-shot snapshot instead of a partial replay — your client UI rebuilds from authoritative state in a single round trip. Either way, no message is silently missed.

Best Practices (<AccordionGroup>)

3–4 accordions. Suggestions:

  • "Don't trust resumed: true alone — check resync_required." — Old clients that only read resumed will silently diverge from gateway state if the buffer rolls. New field is the explicit signal.
  • "Persist event_cursor from the snapshot as your next since." — After applying a snapshot, the next reconnect should pass since=state.event_cursor, not the old pre-snapshot value.
  • "Use the top-level seq for cheap mid-stream gap detection." — Cheaper than digging into event.data.cursor; same value, exposed at the envelope.
  • "Raise max_messages if your sessions are bursty." — Default is 1000 in praisonaiagents/gateway/config.py, so the buffer rolls after ~2000 events. Bump it (or expect resyncs) for high-volume sessions.

Common Patterns (<Tabs> recommended)

  • Polite reconnect — minimal Python asyncio loop that handles joined, replay, snapshot correctly.
  • Snapshot-aware client — branches on resync_required, clears local state, rebuilds from state.messages.
  • Gap-driven resync — uses top-level seq to detect mid-stream gaps and triggers resync().

Imports must stay friendly per AGENTS.md §6.1:

from praisonai.gateway import GatewayClient

(Don't introduce from praisonai.gateway.server import GatewaySession examples — that's internal.)

Related (<CardGroup cols={2}>)


Required edits to the existing docs/features/gateway.mdx

(Only the additions — do not rewrite the existing page.)

  1. Add a <Tip> near the top, right after the hero Mermaid:

    "Reconnecting after a disconnect? See Resume Integrity — the joined ack now signals when you must take a snapshot instead of trusting partial replay."

  2. Append the new ## Resume Integrity section between ## Event Types and ## Common Patterns (see content guidance above).

  3. Extend the "Event Types" table with a note that token_stream and tool_call_stream are now also session-tracked (so they carry top-level seq and survive resume).

  4. Extend the Related CardGroup with a card pointing at the new section anchor — or, in Option B, a card pointing at gateway-resume-integrity.mdx.


docs.json change (only if Option B)

If we end up creating docs/features/gateway-resume-integrity.mdx, insert one line in the same Features group that already lists the other docs/features/gateway* pages:

                       "docs/features/gateway",
                       "docs/features/gateway-overview",
+                      "docs/features/gateway-resume-integrity",
                       "docs/features/gateway-cli",
                       "docs/features/gateway-error-handling",

Verify the file remains valid JSON after editing (AGENTS.md §1.9 rule 8).

For Option A (extend gateway.mdx), no docs.json change is needed.


Quality checklist for the agent picking this up

Before opening the PR, confirm every item from AGENTS.md §9:

  • Read protocols.py, server.py, and config.py first — no guessed field names or types.
  • All five new wire artefacts documented: oldest_cursor, resync_required, snapshot frame, top-level seq, since validation error.
  • joined ack table lists every field (existing + new) with the new ones flagged.
  • snapshot.state sub-fields match GatewaySession.get_snapshot() exactly (session_id, agent_id, state, messages, event_cursor).
  • Note that token_stream and tool_call_stream are now session-tracked.
  • since validation error message documented verbatim.
  • Hero Mermaid uses §3.1 color palette + white text + classDef blocks.
  • Section intros are one sentence (AGENTS.md §6.2); none of the forbidden phrases (§6.3).
  • Friendly imports only — no from praisonai.gateway.server import … in user-facing examples.
  • Examples copy-paste runnable, no your-key-here placeholders.
  • If Option B: docs.json is still valid JSON; new slot sits inside the existing Features group (not Concepts, not auto-generated SDK reference).
  • No edits under docs/concepts/, docs/sdk/reference/**, docs/js/, docs/rust/.
  • PR is a draft, branched off main, not committed directly to main.

Out of scope

  • The PR is server-side only. There is no Python GatewayClient change in this PR — GatewayClient's gap callback already exists from #2131 (see open docs issue docs: add Gateway Reconnecting Client + Protocol Version Negotiation (PraisonAI PR #2131 / Issue #2130) #701). Don't change GatewayClient's docs here; only describe the wire contract a client should now read.
  • TypeScript and Rust SDKs are not touched by #2155. Their reference pages under docs/sdk/reference/typescript/** and docs/sdk/reference/rust/** will be regenerated by the parity tooling — do not hand-edit them.
  • Migration guide for old clients is not required — the fields are additive and old clients keep working (they just lose the new safety). A <Note> in Best Practices flagging this is enough.

cc @MervinPraison

Metadata

Metadata

Assignees

No one assigned

    Labels

    claudeTrigger Claude Code analysisdocumentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions