Skip to content

Support pausing and resuming federation links#1841

Draft
markusgust wants to merge 3 commits intomainfrom
feat/federation-link-pause-resume
Draft

Support pausing and resuming federation links#1841
markusgust wants to merge 3 commits intomainfrom
feat/federation-link-pause-resume

Conversation

@markusgust
Copy link
Copy Markdown
Contributor

Summary

  • Add pause/resume functionality for federation links, allowing operators to temporarily halt federation without deleting upstream definitions or policies
  • Paused state is persisted to disk and survives broker restarts
  • Add HTTP API endpoints: PUT /api/federation-links/:vhost/:upstream/:name/{pause,resume}
  • Add pause/resume button and status column to management console federation page

Closes #1770

Test plan

  • Pause a running federation link via API and verify status becomes "paused"
  • Resume a paused link and verify it reconnects and resumes federating messages
  • Verify messages are not federated while link is paused
  • Verify pause state persists across broker restart
  • Test pause/resume button in management console
  • Verify 404 for non-existing links, 422 for invalid state transitions

🤖 Generated with Claude Code

Add pause/resume functionality for federation links, allowing operators
to temporarily halt federation without deleting upstream definitions or
policies. Paused state is persisted to disk and survives broker restarts.

- Add Paused state and pause/resume methods to federation Link
- Add HTTP API endpoints: PUT /api/federation-links/:vhost/:upstream/:name/{pause,resume}
- Add pause/resume button and status column to management console
- Add specs for pause/resume behavior and API endpoints

Closes #1770

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

PR Review: Federation Link Pause/Resume

Bug: Race condition in resume — old run_loop can corrupt new state

File: src/lavinmq/federation/link.cr

The resume method replaces @state_changed with a new channel and spawns a new run_loop fiber, but it does not wait for the old run_loop fiber (from before the pause) to finish. This causes two problems:

1. The old fiber can close the new channel.

When pause is called, the old run_loop will eventually break and hit its ensure block (line 145: @state_changed.close). If resume has already been called and assigned a new Channel to @state_changed, the old fiber closes the new channel, breaking the newly spawned run_loop.

Timeline:

  1. pause() → state = Paused, closes upstream connection
  2. resume() → creates new @state_changed, sets state to Stopped, spawns new run_loop
  3. Old run_loop ensure block runs → @state_changed.close closes the new channel
  4. New run_loop calls wait_before_reconnect@state_changed.receive? returns nil immediately (closed channel)

2. If resume is called quickly after pause, two run_loops can run concurrently.

The resume method sets state from Paused to Stopped before the old fiber has checked should_stop_loop?. Since should_stop_loop? checks @state.paused? (which is now false) and stop_link? only checks for Terminating/Terminated (also false), the old fiber does not break out of its loop. You now have two fibers running run_loop on the same link, both trying to manage the same upstream connection.

Fix suggestion: resume should wait for the old run_loop to fully exit before starting a new one. One approach: save a reference to the old @state_changed channel before replacing it, and receive on it (the old fiber closes it in ensure, which will unblock the receive). Only then set the new state and spawn the new loop.


Reviewed with Claude Code

Wait for the old run_loop fiber to fully exit before spawning a new one
on resume. Without this, the old fiber could close the new state channel
or two run_loops could run concurrently on the same link.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

PR Review

Bug: Deadlock in resume when link starts paused from disk persistence

File: src/lavinmq/federation/link.cr:78-90

When a federation link is loaded with a persisted paused state (the .paused file exists on disk), initialize sets @state = State::Paused and then run returns early without ever spawning run_loop:

def run
  return if @state.paused?  # returns immediately
  ...
  spawn(run_loop, ...)      # never reached
end

Since run_loop never runs, its ensure block never executes, and @state_changed is never closed.

When resume is later called:

def resume
  return unless @state.paused?
  delete_paused_file
  @log.info { "Resuming federation link #{name}" }
  old_channel = @state_changed
  old_channel.receive? unless old_channel.closed?  # BLOCKS FOREVER
  ...
end

old_channel is not closed and has no senders, so receive? blocks indefinitely. This deadlocks the calling fiber (the HTTP handler), making the link permanently un-resumable after a server restart.

Fix: Skip the receive? wait when no run_loop fiber is active. For example, track whether run_loop was spawned (e.g., a @running_loop flag), or simply check if the channel already has no associated fiber/was never used. The simplest fix might be to close the channel in run when returning early for the paused case:

def run
  if @state.paused?
    @state_changed.close  # so resume won't block
    return
  end
  ...
end

This way resume sees old_channel.closed? is true, skips the receive?, creates a new channel, and proceeds normally.


No other blocking issues found. Specs are thorough, the HTTP endpoints and JS UI look correct.

When a link starts paused from disk persistence, run() returns early
without spawning run_loop, leaving @state_changed open. A subsequent
resume() would block forever on old_channel.receive?. Fix by closing
the channel when returning early from run().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

PR Review

Bug: QueueLink monitor_consumers fiber not respawned on resume

src/lavinmq/federation/link.cr:81-92

The resume method creates a new @state_changed channel and spawns a new run_loop, but for QueueLink subclasses, the monitor_consumers fiber (spawned only once in the constructor at line 251) is not respawned.

When a QueueLink is paused:

  1. pause sets state to Paused and closes the upstream connection
  2. run_loop breaks out and its ensure block closes @state_changed
  3. monitor_consumers detects the closed @state_changed channel (line 270) and exits permanently

When resume is later called:

  1. A new @state_changed channel is created and run_loop is spawned
  2. But monitor_consumers is never respawned — it only runs from the constructor
  3. QueueLink#start_link (line 373) blocks on @consumer_available.receive?, which monitor_consumers is responsible for signaling
  4. The queue federation link hangs forever and never federates messages

This affects both runtime pause/resume and links that were paused on disk (where run closes @state_changed immediately, also causing monitor_consumers to exit).

The resume method needs to be overridden in QueueLink to also respawn monitor_consumers, or monitor_consumers needs to be resilient to channel replacement.

Missing require "file_utils"

src/lavinmq/federation/link.cr:104

FileUtils.rm is called but "file_utils" is never required in this file. It compiles only because src/lavinmq/message_store.cr happens to require it transitively. This is fragile — if that transitive dependency changes, this file breaks. Add require "file_utils" at the top of the file.

No specs for QueueLink pause/resume

All pause/resume specs (both unit and API) only exercise exchange federation links. Given the QueueLink bug above, there should be specs covering queue link pause and resume to catch regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support pausing and resuming federation links

1 participant