Summary
While reviewing the Phase 1 implementation I found that the Python transport
has no socket timeout set on the UDS path. A stalled or slow sidecar will
block the agent thread indefinitely. This also surfaces a broader design
question that was raised in the project Slack but never formally resolved:
what is the intended fail behaviour per hook when IPC fails or times out?
The concrete issue in transport.py
In _connect_and_send_uds (transport.py:75):
with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as sock:
sock.connect(self.socket_path)
sock.sendall(frame_bytes)
return self._read_response(sock)
No sock.settimeout() is called. If the sidecar stalls mid-response,
_read_response blocks on sock.recv() forever.
Additionally, the retry logic in send() only catches
ConnectionRefusedError and FileNotFoundError. A socket.timeout
exception would not be retried and would propagate as an unhandled error.
Why this matters for a security layer
The architecture doc specifies a 4-8ms typical latency budget and ~10ms
worst-case. But there is currently no enforcement of that budget on the
SDK side. An attacker who can induce load on the sidecar process can
stall every agent call that goes through the firewall.
The consequence depends on the fail behaviour, which is currently
undefined:
- Fail-open (let the call proceed if firewall is unreachable):
the attacker has bypassed the enforcement layer entirely
- Fail-closed (block the call if firewall is unreachable):
the attacker has created a denial of service
Neither is acceptable as a silent default. This needs to be an explicit,
configurable decision.
The design question
Different hooks have different risk profiles and the right fail mode
probably differs per hook:
| Hook |
Suggested default |
Reasoning |
on_prompt |
fail-closed |
blocking a turn is recoverable |
on_tool_call |
fail-closed |
tool execution without inspection is unsafe |
on_context |
configurable |
degraded RAG is acceptable in some deployments |
on_memory |
fail-closed |
a poisoned write that bypasses inspection persists |
This could be expressed in sidecar.yaml under each hook definition,
similar to how on_ipc_timeout was proposed in the policy taxonomy
discussion.
Questions for the mentor
- What is the intended fail behaviour when the sidecar is unreachable
or times out: fail-open or fail-closed?
- Should this be configurable per hook, or a single global setting for v1?
- Should the SDK enforce the latency budget with a configurable timeout
(e.g. ACF_TIMEOUT_MS env var), or is that the sidecar's responsibility?
- Should
socket.timeout be included in the retry set, or should it
propagate immediately as a hard failure?
Once the direction is decided, I can put together a PR implementing it.
Summary
While reviewing the Phase 1 implementation I found that the Python transport
has no socket timeout set on the UDS path. A stalled or slow sidecar will
block the agent thread indefinitely. This also surfaces a broader design
question that was raised in the project Slack but never formally resolved:
what is the intended fail behaviour per hook when IPC fails or times out?
The concrete issue in transport.py
In
_connect_and_send_uds(transport.py:75):No
sock.settimeout()is called. If the sidecar stalls mid-response,_read_responseblocks onsock.recv()forever.Additionally, the retry logic in
send()only catchesConnectionRefusedErrorandFileNotFoundError. Asocket.timeoutexception would not be retried and would propagate as an unhandled error.
Why this matters for a security layer
The architecture doc specifies a 4-8ms typical latency budget and ~10ms
worst-case. But there is currently no enforcement of that budget on the
SDK side. An attacker who can induce load on the sidecar process can
stall every agent call that goes through the firewall.
The consequence depends on the fail behaviour, which is currently
undefined:
the attacker has bypassed the enforcement layer entirely
the attacker has created a denial of service
Neither is acceptable as a silent default. This needs to be an explicit,
configurable decision.
The design question
Different hooks have different risk profiles and the right fail mode
probably differs per hook:
on_prompton_tool_callon_contexton_memoryThis could be expressed in
sidecar.yamlunder each hook definition,similar to how
on_ipc_timeoutwas proposed in the policy taxonomydiscussion.
Questions for the mentor
or times out: fail-open or fail-closed?
(e.g.
ACF_TIMEOUT_MSenv var), or is that the sidecar's responsibility?socket.timeoutbe included in the retry set, or should itpropagate immediately as a hard failure?
Once the direction is decided, I can put together a PR implementing it.