Skip to content

Conversation

@lalinsky
Copy link
Owner

@lalinsky lalinsky commented Sep 4, 2025

Summary

Implements a public reconnect() method that allows users to manually trigger the reconnection process, addressing issue #86.

This provides flexibility for scenarios like:

  • Refreshing authentication credentials
  • Rebalancing client connections
  • Testing reconnection behavior

Implementation Details

  • Public API: nc.reconnect() method with comprehensive error handling
  • Error Types: Added ReconnectDisabled, AlreadyReconnecting, NotConnected, ManualReconnect
  • Thread Safety: Reuses existing robust reconnection infrastructure with proper mutex protection
  • Full Recovery: Automatically re-establishes all subscriptions and flushes pending messages
  • Callbacks: Properly invokes disconnected/reconnected callbacks like automatic reconnection

Test Coverage

  • ✅ Manual reconnection with subscription survival verification
  • ✅ Error handling for disabled reconnection
  • ✅ Error handling for closed connections
  • ✅ Memory leak fixes and comprehensive testing
  • ✅ All existing tests continue to pass

Usage Example

// Force a reconnection to refresh auth or rebalance
try nc.reconnect();

🤖 Generated with Claude Code

Add public reconnect() method that allows users to manually trigger
reconnection process for scenarios like refreshing authentication
credentials or rebalancing client connections.

Features:
- Manual reconnection with full subscription recovery
- Comprehensive error handling for invalid states
- Reuses existing robust reconnection infrastructure
- Thread-safe implementation with proper callback support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Walkthrough

Adds a public Connection.reconnect() method, extends the public ConnectionError union with four new variants, integrates manual reconnect with the existing background reconnect thread, and adds tests for manual reconnect success and error cases plus a small test-helper mutex call change.

Changes

Cohort / File(s) Summary of Changes
Connection API and reconnection flow
src/connection.zig
Added pub fn reconnect(self: *Self) !void; added ReconnectDisabled, AlreadyReconnecting, NotConnected, ManualReconnect to pub const ConnectionError; checks connection status, logs, updates status to .reconnecting, performs cleanup using a synthetic ManualReconnect error, spawns background reconnect thread (doReconnectThread), schedules disconnected_cb invocation, and handles spawn failure by marking for close.
Reconnection tests and helper tweak
tests/reconnection_test.zig
Changed CallbackTracker.timedWait to pass &self.mutex to cond.timedWait; added tests asserting manual reconnect success (callbacks, subscription survives reconnect, message delivery) and error paths: ReconnectDisabled when reconnect disallowed and ConnectionClosed when connection closed before reconnect.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch implement-manual-reconnect

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
src/connection.zig (3)

507-513: Make the synthetic reconnect cause internal (don’t expose ManualReconnect publicly)

ManualReconnect is only used as an internal sentinel to drive cleanup. Keeping it in the public ConnectionError set leaks an implementation detail.

  • Remove ManualReconnect from ConnectionError (see separate diff comment).
  • Add a private error set and use it here:
-        const synthetic_error = ConnectionError.ManualReconnect;
+        const synthetic_error = InternalReconnect.ManualReconnect;

Add near the top-level (outside this range):

const InternalReconnect = error{ ManualReconnect };

112-116: Scope public errors to user-facing cases only; move ManualReconnect to a private error set

ManualReconnect need not be part of the public API.

Apply:

 pub const ConnectionError = error{
@@
-    NotConnected,
-    ManualReconnect,
+    NotConnected,
 } || PublishError || std.Thread.SpawnError || std.posix.WriteError || std.posix.ReadError;

Then define a private set (outside this block), e.g.:

const InternalReconnect = error{ ManualReconnect };

496-499: Minor: tweak wording for connecting-state log

The log says “initial connection not yet established” but returns error.NotConnected. Consider clarifying the log to “Cannot reconnect while initial connect is in progress”.

-            .connecting => {
-                log.warn("Cannot reconnect: initial connection not yet established", .{});
+            .connecting => {
+                log.warn("Cannot reconnect: initial connect is in progress", .{});
                 return ConnectionError.NotConnected;
             },
tests/reconnection_test.zig (2)

154-160: Use the canonical condvar pattern: hold the lock while checking the predicate

As written, there’s a small race between reading reconnected_called and waiting, which can cause an extra 100ms delay. Hold the lock across the predicate check.

-    var timer = try std.time.Timer.start();
-    while (tracker.reconnected_called == 0) {
-        if (timer.read() >= 5000 * std.time.ns_per_ms) {
-            return error.ReconnectionTimeout;
-        }
-        try tracker.timedWait(100);
-    }
+    var timer = try std.time.Timer.start();
+    tracker.mutex.lock();
+    defer tracker.mutex.unlock();
+    while (tracker.reconnected_called == 0) {
+        if (timer.read() >= 5000 * std.time.ns_per_ms) {
+            return error.ReconnectionTimeout;
+        }
+        _ = tracker.cond.timedWait(&tracker.mutex, 100 * std.time.ns_per_ms) catch {};
+    }

183-202: Add coverage for AlreadyReconnecting (optional)

Call nc.reconnect() twice back-to-back and assert error.AlreadyReconnecting on the second call.

test "reconnect() returns AlreadyReconnecting when called twice" {
    tracker.reset();
    const nc = try utils.createConnection(.node1, .{
        .reconnect = .{ .allow_reconnect = true, .reconnect_wait_ms = 100 },
    });
    defer utils.closeConnection(nc);

    try nc.reconnect();
    try testing.expectError(error.AlreadyReconnecting, nc.reconnect());
}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b088486 and 9f91f1b.

📒 Files selected for processing (2)
  • src/connection.zig (2 hunks)
  • tests/reconnection_test.zig (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.zig

📄 CodeRabbit inference engine (CLAUDE.md)

Format code using zig fmt before commit

Files:

  • src/connection.zig
  • tests/reconnection_test.zig
🪛 GitHub Actions: CI
tests/reconnection_test.zig

[error] 1-1: zig fmt --check . failed. Formatting issues detected in ./tests/reconnection_test.zig. Run 'zig fmt' to fix formatting.

🔇 Additional comments (2)
tests/reconnection_test.zig (2)

65-65: Correct mutex usage with cond.timedWait

Passing &self.mutex is the right fix; this matches Zig’s condvar API.


1-203: Fix CI: run zig fmt
CI requires all Zig sources formatted. Unable to verify here (no zig available); please run zig fmt --check . locally and commit any formatting changes.

When std.Thread.spawn() fails in reconnect(), ensure proper thread cleanup
by using deferred close pattern instead of setting status under mutex.
This mirrors the pattern used in markNeedsReconnect() and prevents
reader/flusher threads from hanging on acquireSocket().

Co-authored-by: Lukáš Lalinský <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/connection.zig (1)

467-530: Manual reconnect flow looks solid; spawn-failure path now safe

Locking, status transitions, buffer freezing, and the deferred disconnected_cb are correct. The spawn error fallback with needs_close + deferred close happens outside the mutex, resolving the prior deadlock risk.

🧹 Nitpick comments (6)
src/connection.zig (3)

112-116: Don’t expose internal “ManualReconnect” in the public error set

It’s only used as a synthetic trigger and never surfaced to callers. Keeping it public widens the error set type and can ripple through downstream code. Make it a local error set in reconnect() instead.

Apply:

@@
     NoResponders,
-    ReconnectDisabled,
-    AlreadyReconnecting,
-    NotConnected,
-    ManualReconnect,
+    ReconnectDisabled,
+    AlreadyReconnecting,
+    NotConnected,
@@
-        const synthetic_error = ConnectionError.ManualReconnect;
+        const SyntheticReconnectError = error{ ManualReconnect };
+        const synthetic_error: anyerror = SyntheticReconnectError.ManualReconnect;

467-530: Minor API polish: document non-blocking behavior and returned errors

Add to the doc comment that reconnect() is non-blocking and list the possible errors (ReconnectDisabled, AlreadyReconnecting, ConnectionClosed, NotConnected).


1400-1512: Reduce mutex hold during reconnect attempts (optional)

doReconnect holds the main mutex across connect(), handshake, sleeps, and server-pool walks. That can stall publishers/subscribers during long network delays. Consider narrowing the critical sections: compute next server + update state under the lock, perform blocking socket/handshake outside, then re-enter to commit results.

tests/reconnection_test.zig (3)

154-161: Avoid data race on tracker.reconnected_called; hold the mutex across the wait loop

The loop reads reconnected_called without the mutex while callbacks increment it under the mutex. Make the read+wait atomic with the condition variable.

Apply:

-    var timer = try std.time.Timer.start();
-    while (tracker.reconnected_called == 0) {
-        if (timer.read() >= 5000 * std.time.ns_per_ms) {
-            return error.ReconnectionTimeout;
-        }
-        try tracker.timedWait(100);
-    }
+    var timer = try std.time.Timer.start();
+    tracker.mutex.lock();
+    defer tracker.mutex.unlock();
+    while (tracker.reconnected_called == 0) {
+        if (timer.read() >= 5000 * std.time.ns_per_ms) return error.ReconnectionTimeout;
+        _ = tracker.cond.timedWait(&tracker.mutex, 100 * std.time.ns_per_ms) catch {};
+    }

183-193: Optional: assert no callbacks fired when reconnect is disabled

ReconnectDisabled should not change state. Assert disconnected_called and reconnected_called remain 0 to catch regressions.


195-202: Optional: add tests for AlreadyReconnecting and NotConnected

  • Call nc.reconnect() twice back-to-back; the second should yield error.AlreadyReconnecting.
  • Trigger during initial connect (or fake .connecting) to validate error.NotConnected.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9f91f1b and aff3f4a.

📒 Files selected for processing (2)
  • src/connection.zig (2 hunks)
  • tests/reconnection_test.zig (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.zig

📄 CodeRabbit inference engine (CLAUDE.md)

Format code using zig fmt before commit

Files:

  • src/connection.zig
  • tests/reconnection_test.zig
🧠 Learnings (1)
📚 Learning: 2025-08-30T18:48:28.333Z
Learnt from: lalinsky
PR: lalinsky/nats.zig#66
File: src/connection.zig:317-323
Timestamp: 2025-08-30T18:48:28.333Z
Learning: In src/connection.zig, the establishConnection() call in connectToServer() is during initial connection setup where no race conditions exist - the connection starts disconnected and other threads would see the connecting status and wait appropriately.

Applied to files:

  • src/connection.zig
🔇 Additional comments (1)
tests/reconnection_test.zig (1)

65-65: Good fix: pass &self.mutex to timedWait

This matches Condition.timedWait’s signature and avoids UB.

@lalinsky lalinsky merged commit 007ce50 into main Sep 5, 2025
4 checks passed
@lalinsky lalinsky deleted the implement-manual-reconnect branch September 5, 2025 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant