Skip to content

[Design Question] IPC timeout behaviour undefined - transport has no socket timeout and fail-open/fail-closed is unresolved per hook #23

@M-Masood4

Description

@M-Masood4

Summary

While reviewing the Phase 1 implementation I found that the Python transport
has no socket timeout set on the UDS path. A stalled or slow sidecar will
block the agent thread indefinitely. This also surfaces a broader design
question that was raised in the project Slack but never formally resolved:
what is the intended fail behaviour per hook when IPC fails or times out?

The concrete issue in transport.py

In _connect_and_send_uds (transport.py:75):

with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as sock:
    sock.connect(self.socket_path)
    sock.sendall(frame_bytes)
    return self._read_response(sock)

No sock.settimeout() is called. If the sidecar stalls mid-response,
_read_response blocks on sock.recv() forever.

Additionally, the retry logic in send() only catches
ConnectionRefusedError and FileNotFoundError. A socket.timeout
exception would not be retried and would propagate as an unhandled error.

Why this matters for a security layer

The architecture doc specifies a 4-8ms typical latency budget and ~10ms
worst-case. But there is currently no enforcement of that budget on the
SDK side. An attacker who can induce load on the sidecar process can
stall every agent call that goes through the firewall.

The consequence depends on the fail behaviour, which is currently
undefined:

  • Fail-open (let the call proceed if firewall is unreachable):
    the attacker has bypassed the enforcement layer entirely
  • Fail-closed (block the call if firewall is unreachable):
    the attacker has created a denial of service

Neither is acceptable as a silent default. This needs to be an explicit,
configurable decision.

The design question

Different hooks have different risk profiles and the right fail mode
probably differs per hook:

Hook Suggested default Reasoning
on_prompt fail-closed blocking a turn is recoverable
on_tool_call fail-closed tool execution without inspection is unsafe
on_context configurable degraded RAG is acceptable in some deployments
on_memory fail-closed a poisoned write that bypasses inspection persists

This could be expressed in sidecar.yaml under each hook definition,
similar to how on_ipc_timeout was proposed in the policy taxonomy
discussion.

Questions for the mentor

  1. What is the intended fail behaviour when the sidecar is unreachable
    or times out: fail-open or fail-closed?
  2. Should this be configurable per hook, or a single global setting for v1?
  3. Should the SDK enforce the latency budget with a configurable timeout
    (e.g. ACF_TIMEOUT_MS env var), or is that the sidecar's responsibility?
  4. Should socket.timeout be included in the retry set, or should it
    propagate immediately as a hard failure?

Once the direction is decided, I can put together a PR implementing it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions