Skip to content

feat: migrate to pull model with gRPC unary RPCs#200

Draft
gfyrag wants to merge 9 commits into
mainfrom
feat/agent-pull-model
Draft

feat: migrate to pull model with gRPC unary RPCs#200
gfyrag wants to merge 9 commits into
mainfrom
feat/agent-pull-model

Conversation

@gfyrag

@gfyrag gfyrag commented May 29, 2026

Copy link
Copy Markdown
Contributor

Existing architecture and problems

The agent connects to membership via bidirectional gRPC streaming (Join RPC):

  • A permanent stream must be maintained, with a ping/pong mechanism every 5-10s
  • The agent receives orders (ExistingStack, DeletedStack, DisabledStack, EnabledStack) pushed via the stream
  • K8s status changes are reported back via the same stream (StatusChanged, ModuleStatusChanged, etc.)
  • Reconnection after a disconnection triggers a full sync of all stacks
  • Bidirectional coupling makes debugging and resilience complex
  • The code uses channels, worker pools (pond), and custom connection adapters with tracing

New architecture

Migration to a pull model with gRPC unary RPCs:

  • Polling client: periodic poll of ListStacks with a paginated (updated_at, id) cursor
  • Membership only returns stacks modified since the last cursor
  • Full sync on startup (empty cursor) with K8s orphan cleanup (deletes stacks with label formance.com/created-by-agent=true not returned by membership)
  • Status reporter: K8s informers call unary RPCs instead of Send() on a stream
  • Heartbeat replaces ping/pong (every 30s)
  • Auth metadata attached via UnaryClientInterceptor instead of stream connection

What we gain

  • Resilience: every poll is a reconciliation, no event loss
  • Simplicity: no stream, no channels, no connection adapter, no worker pool
  • Debugging: idempotent requests, standard logs
  • Orphan cleanup: automatic detection and deletion of orphaned K8s stacks on startup

Key changes

  • New polling_client.go: cursor-based poll loop with orphan cleanup
  • New membership_reporter.go: unary RPC wrapper for status reporting
  • Informers adapted to use MembershipReporter instead of stream Send()
  • Auth metadata via grpc.UnaryClientInterceptor
  • Removed stream client, connection adapter, gRPC tracing wrappers
  • New --poll-interval flag (default 10s)

Companion PR

Membership: formancehq/membership-api#755 — both PRs must be deployed together.

Test plan

  • Build passes
  • Unit tests (requires kubebuilder/etcd for K8s envtest)
  • Integration tests in CI
  • Deploy with updated membership simultaneously

Generated with Claude Code

Replace the bidirectional gRPC streaming client with a polling-based
architecture. The agent now pulls changed stacks from membership via
paginated ListStacks RPC and reports status via unary RPCs.

Key changes:
- New polling_client.go: cursor-based poll loop with orphan cleanup
- New membership_reporter.go: unary RPC wrapper for status reporting
- Informers adapted to use MembershipReporter instead of stream Send()
- Auth metadata attached via UnaryClientInterceptor
- Remove stream client, connection adapter, gRPC tracing wrappers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb314437-5c02-4b44-a08e-daae53eb43c1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agent-pull-model

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

gfyrag and others added 7 commits May 29, 2026 15:54
The tracer was only used by the deleted gRPC streaming code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The informer must see the stack via AddFunc before it can report
a DeleteFunc. Wait for the StackStatus event before deleting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A newly created stack has no status, so the AddFunc doesn't emit a
StackStatus event. Use a sleep to allow the informer cache to sync
before deleting the stack.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add graceful disconnect RPC for agent shutdown signaling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CRITICAL:
- Fix agentID vs region.ID in UpsertVersion, DeleteVersion, regionPing
  (was using agent_id where DB primary key region.id is needed)
- Extract resolveRegion() helper for consistent region lookup

HIGH:
- Revert Justfile proto ref to refs/heads/main
- Delete dead code: agent/storage.go, grpc/storage.go, server_impl.go,
  manager.go and their generated mocks

MEDIUM:
- Sanitize gRPC error messages (don't leak DB errors to clients)
- Use AnyRegion consistently for region lookups
- Only set nextCursor when hasMore is true
- Add nil check in InputUpdate.Validate()
- Clean .gitignore of stale pulumi entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the last page has no results, the server returns an empty cursor.
The agent was overwriting its saved cursor with this empty value,
causing the next poll to re-scan from the beginning.

Now the agent only updates its cursor when the server returns a
non-empty one, preserving the previous position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@flemzord flemzord left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments from Codex.

Comment on lines +153 to +161
p.reconcileStack(ctx, stack)
}

if !resp.GetHasMore() {
// Only update cursor if the server returned one.
// An empty cursor means no results — keep the previous cursor
// to avoid re-scanning from the beginning.
if next := resp.GetNextCursor(); next != "" {
p.cursor = next

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileStack cannot report failures, but SyncExistingStack can fail and return early after logging. This means the poll loop can still advance p.cursor and permanently skip a membership update that was not applied to Kubernetes. Return reconciliation errors and keep the cursor unchanged on failure, or add an explicit ack/retry path.

Comment on lines +169 to +171
if isFullSync {
p.cleanupOrphans(ctx, allStackNames)
p.fullSyncDone = true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first full-sync cleanup depends on p.k8sClient.List, which is backed by the informer cache, but polling is started before informers are started/synced. If this one-shot cleanup sees an empty or stale cache, fullSyncDone is still set and orphan cleanup is skipped permanently. Make cleanup return an error and only set fullSyncDone after a successful list/delete pass, or wait for cache sync first.

Restore the deleted `service Server { rpc Join(stream Message) returns
(stream Order) }` definition alongside the new AgentService. Old agent
binaries already deployed in production still speak the streaming
protocol; membership-api needs both service stubs generated to serve
them in parallel during the migration.

The agent binary itself only uses AgentService — the legacy schema is
shipped only so membership can keep streaming support behind a flag.
Sunset target: 2026-09-09 (~3 months).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants