Fix async ACP restart loopback secret lookup deadlock#3737
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the async ACP restart path with a real loopback LookupSecret call: main reproduces the 30s timeout, while this PR resolves the same lookup immediately off the caller event loop.
Does this PR achieve its stated goal?
Yes. The goal is to prevent async ACP restart from blocking the agent-server event loop during loopback secret resolution. I reproduced the old behavior on origin/main: ACPAgent.astep() ran restart init on the caller thread and the real LookupSecret HTTP request timed out after ~30s. On the PR commit, the same astep() restart ran init off the caller thread, the same loopback server returned ok in 0.008s, and execution reached the post-restart path.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the uv-managed workspace dependencies. |
| CI Status | ⏳ At review time: 23 successful, 15 skipped, 7 pending, 0 failing. |
| Functional Verification | ✅ Before/after runtime reproduction confirms the deadlock/timeout is fixed. |
Functional Verification
Test 1: Async ACP restart serving a same-loop LookupSecret request
Step 1 — Reproduce / establish baseline without the fix:
I checked out origin/main (cabe776b) and ran a temporary SDK script that:
- starts an asyncio loopback HTTP server on the caller event loop,
- creates an
ACPAgentconversation, - triggers
ACPAgent.astep()with_restart_session_on_next_turn=True, and - makes restart initialization resolve a real
LookupSecret(url=http://127.0.0.1:<port>/api/settings/secrets/LOOPBACK_SECRET).
Ran:
git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_acp_loopback_probe.pyObserved excerpt:
HEAD is now at cabe776b [AgentProfile][sdk] AgentProfileStore + FK lifecycle (#3775)
Restarting ACP session after cancelled prompt drain timeout
init_state thread_is_caller=True
lookupsecret_result=ReadTimeout elapsed=30.037s
astep_result=ReadTimeout
httpx.ReadTimeout: timed out
This confirms the bug: restart initialization ran on the caller event-loop thread, so the same loop could not serve the loopback secret request until LookupSecret hit its 30s timeout.
Step 2 — Apply the PR's changes:
I checked out the PR branch at 3c4c40d51ed353151f43692efd979700eb0cc5ca.
Step 3 — Re-run with the fix in place:
Ran the same script on the PR branch:
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_acp_loopback_probe.pyObserved excerpt:
Restarting ACP session after cancelled prompt drain timeout
init_state thread_is_caller=False
lookupsecret_result=success body='ok' elapsed=0.008s
astep_result=reached_post_restart
This shows the fix works: the async astep() restart path moved initialization off the caller loop, allowing the loopback LookupSecret HTTP request to be served immediately instead of timing out.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Fix async ACP restart loopback secret lookup deadlock
🟢 Good taste - Elegant, minimal fix that solves a real concurrency problem.
Analysis
This PR fixes a deadlock that occurs when ACP restart initialization synchronously resolves LookupSecret values via HTTP loopback to the same agent server. By wrapping the sync _restart_session_after_drain_timeout() call in asyncio.to_thread(), the restart work is offloaded to a thread pool, keeping the async loop free to serve HTTP requests.
Key design decisions:
- Async wrapper pattern - The new
_arestart_session_after_drain_timeout()async method delegates to the existing sync method viaasyncio.to_thread(), avoiding code duplication. - Minimal surface area - Only the call site in
astep()is modified; no changes to the underlying restart logic. - Follows existing precedent - Consistent with
LocalConversation.arun()which similarly offloads agent initialization to avoid blocking the server loop.
No Issues Found
- No breaking changes to public APIs
- No security concerns
- No performance regressions (thread pool overhead is negligible for this path)
- Test follows project conventions and properly exercises the fix
Evidence Check
The PR description references manual testing that confirms the fix resolves the deadlock on main. This is appropriate evidence for a concurrency fix where the symptom is a hang (hard to capture in automated tests alone).
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
The change is minimal, well-isolated, and follows an established pattern in the codebase. The thread pool delegation is a standard async Python pattern with no new dependencies or complex state changes.
VERDICT:
✅ Worth merging: Clean, well-tested fix that solves a real concurrency deadlock without introducing complexity.
KEY INSIGHT:
The deadlock occurs because the async loop can't serve HTTP requests to resolve secrets while blocked waiting for those same secrets - a classic self-deadlock pattern that asyncio.to_thread() elegantly breaks.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
HUMAN:
I tested this manually and can confirm that on main there is a long pause on main after a follow-up message, and a minimal pause on the PR branch.
main

pr

AGENT:
Live failure diagnosis on June 15, 2026: after an interrupted ACP turn, the next turn logged
Restarting ACP session after cancelled prompt drain timeout, then two 30sLookupSecretread timeouts for loopback/api/settings/secrets/...URLs before ACP initialization resumed. A direct local secret endpoint request completed in milliseconds, so the issue was event-loop starvation during restart, not slow key storage.Why
ACP restart after a cancelled prompt drain timeout calls
init_state(). ACP startup resolves registry secrets into subprocess environment variables, andLookupSecret.get_value()performs a synchronous HTTP GET back to the same agent-server. In asyncACPAgent.astep(), the restart ran on the caller/server loop, so the loopback request could not be served until the synchronous secret lookup timed out.Summary
asyncio.to_thread().ACPAgent.astep()to use the async wrapper; syncstep()behavior is unchanged.Issue Number
N/A
How to Test
uv run pytest tests/sdk/agent/test_acp_agent.py -k "astep_restarts_session_off_caller_loop or cancel_drain_restart"3 passed, 352 deselected.uv run pytest tests/sdk/agent/test_acp_agent.py355 passed, 9 warnings.uv run ruff check openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.pyAll checks passed!.uv run ruff format --check openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.py2 files already formatted.uv run pre-commit run --files openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.pyuv run pythonscript that starts an async loopback HTTP server, registers a realLookupSecretagainst that server, and callsACPAgent._arestart_session_after_drain_timeout()whileinit_state()resolves the secret. Output included:loopback ACP restart resolved LOOPBACK_SECRET=ok without timeout.Video/Screenshots
Backend/runtime deadlock fix; no UI screenshot. Runtime reproduction output is included above.
Type
Notes
This keeps the existing restart implementation and error propagation. If restart initialization fails,
astep()still raises so the conversation transitions to error rather than continuing with an ambiguous ACP session.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:3c4c40d-pythonRun
All tags pushed for this build
About Multi-Architecture Support
3c4c40d-python) is a multi-arch manifest supporting both amd64 and arm643c4c40d-python-amd64) are also available if neededIssue
Fixes #3762.