Skip to content

fix(ipc): retry unix socket comms if none available#117

Merged
LeoBorai merged 5 commits intomainfrom
fix/wait-for-components
Feb 8, 2026
Merged

fix(ipc): retry unix socket comms if none available#117
LeoBorai merged 5 commits intomainfrom
fix/wait-for-components

Conversation

@LeoBorai
Copy link
Owner

@LeoBorai LeoBorai commented Feb 8, 2026

Instead of sleeping to give time to processes, we now use IPC to check for them

@LeoBorai LeoBorai changed the title fix: retry unix socket comms if none available fix(ipc): retry unix socket comms if none available Feb 8, 2026
@LeoBorai LeoBorai requested a review from Copilot February 8, 2026 14:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make IPC over Unix sockets more resilient during component startup by retrying sends when the target socket isn’t immediately available, and by switching the hub startup flow from a fixed sleep to explicit component readiness checks.

Changes:

  • Add a short retry loop waiting for the target Unix socket file to appear before connecting/sending.
  • Add tracing dependency to the mate_ipc crate to support new debug logging.
  • Replace a fixed startup delay in the hub with wait_for_components() readiness checks (plus minor import cleanup in CLI).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/ipc/src/transport/unix_socket.rs Adds socket-availability polling + debug logging before connect/send.
src/ipc/Cargo.toml Adds tracing as a dependency for IPC crate.
src/cli/src/transport.rs Import formatting cleanup only.
src/cli/src/process/hub.rs Replaces fixed sleep with wait_for_components() during process spawn.
Cargo.lock Locks tracing dependency addition.
Comments suppressed due to low confidence (2)

src/ipc/src/transport/unix_socket.rs:200

  • The new socket-availability wait only sleeps 10ms * 3 plus connect retries (10ms + 20ms), so send_message_internal will fail after ~60ms if the target component is still starting. Previously the CLI waited 1s before pinging components, so this change can reintroduce flaky startup failures on slower machines/CI. Consider retrying up to a time-based deadline (e.g., a few seconds) with exponential backoff, or make the retry count/delay configurable.
        let target_socket = Self::socket_path_for_process(&self.base_path, &msg.to);
        let mut tries = 0;

        if !target_socket.exists() {
            loop {
                if target_socket.exists() {
                    break;
                }

                if tries >= UNIX_SOCKET_CONNECTION_RETRIES {
                    return Err(anyhow!(
                        "Target process {:?} socket does not exist at {:?}",
                        msg.to,
                        target_socket
                    ));
                }

                sleep(Duration::from_millis(10)).await;

                tries += 1;

                debug!(
                    "Waiting for target process {:?} socket to be available at {:?} (attempt {}/{})",
                    msg.to, target_socket, tries, UNIX_SOCKET_CONNECTION_RETRIES
                );
            }
        }

        let mut stream =
            Self::connect_with_retry(&target_socket, UNIX_SOCKET_CONNECTION_RETRIES).await?;
        let serialized = serde_json::to_vec(msg)?;
        let len = (serialized.len() as u32).to_le_bytes();

src/ipc/src/transport/unix_socket.rs:200

  • There are now two layers of retry logic: a manual target_socket.exists() polling loop and then connect_with_retry(...) which already retries on connect errors (including ENOENT). This duplication increases complexity and extends the overall wait in a hard-to-reason-about way; it would be simpler to consolidate into a single retry/backoff path (and optionally improve the error message when the last error is ENOENT).
        if !target_socket.exists() {
            loop {
                if target_socket.exists() {
                    break;
                }

                if tries >= UNIX_SOCKET_CONNECTION_RETRIES {
                    return Err(anyhow!(
                        "Target process {:?} socket does not exist at {:?}",
                        msg.to,
                        target_socket
                    ));
                }

                sleep(Duration::from_millis(10)).await;

                tries += 1;

                debug!(
                    "Waiting for target process {:?} socket to be available at {:?} (attempt {}/{})",
                    msg.to, target_socket, tries, UNIX_SOCKET_CONNECTION_RETRIES
                );
            }
        }

        let mut stream =
            Self::connect_with_retry(&target_socket, UNIX_SOCKET_CONNECTION_RETRIES).await?;
        let serialized = serde_json::to_vec(msg)?;
        let len = (serialized.len() as u32).to_le_bytes();


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LeoBorai LeoBorai merged commit 663a89a into main Feb 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants