Skip to content

Fix async ACP restart loopback secret lookup deadlock#3737

Merged
neubig merged 2 commits into
mainfrom
codex/acp-restart-secret-lookup-off-loop
Jun 17, 2026
Merged

Fix async ACP restart loopback secret lookup deadlock#3737
neubig merged 2 commits into
mainfrom
codex/acp-restart-secret-lookup-off-loop

Conversation

@neubig

@neubig neubig commented Jun 15, 2026

Copy link
Copy Markdown
Member

HUMAN:

I tested this manually and can confirm that on main there is a long pause on main after a follow-up message, and a minimal pause on the PR branch.

main
Screenshot 2026-06-17 at 2 33 43 PM

pr
Screenshot 2026-06-17 at 2 39 03 PM

  • A human has tested these changes.

AGENT:

Live failure diagnosis on June 15, 2026: after an interrupted ACP turn, the next turn logged Restarting ACP session after cancelled prompt drain timeout, then two 30s LookupSecret read timeouts for loopback /api/settings/secrets/... URLs before ACP initialization resumed. A direct local secret endpoint request completed in milliseconds, so the issue was event-loop starvation during restart, not slow key storage.


Why

ACP restart after a cancelled prompt drain timeout calls init_state(). ACP startup resolves registry secrets into subprocess environment variables, and LookupSecret.get_value() performs a synchronous HTTP GET back to the same agent-server. In async ACPAgent.astep(), the restart ran on the caller/server loop, so the loopback request could not be served until the synchronous secret lookup timed out.

Summary

  • Added an async ACP restart wrapper that runs the existing sync restart path via asyncio.to_thread().
  • Switched only ACPAgent.astep() to use the async wrapper; sync step() behavior is unchanged.
  • Added regression coverage that verifies async restart initialization runs off the caller loop.

Issue Number

N/A

How to Test

  • uv run pytest tests/sdk/agent/test_acp_agent.py -k "astep_restarts_session_off_caller_loop or cancel_drain_restart"
    • Result: 3 passed, 352 deselected.
  • uv run pytest tests/sdk/agent/test_acp_agent.py
    • Result: 355 passed, 9 warnings.
  • uv run ruff check openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.py
    • Result: All checks passed!.
  • uv run ruff format --check openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.py
    • Result: 2 files already formatted.
  • uv run pre-commit run --files openhands-sdk/openhands/sdk/agent/acp_agent.py tests/sdk/agent/test_acp_agent.py
    • Result: Ruff format/lint, pycodestyle, pyright, dependency rules, and tool registration checks passed.
  • Runtime reproduction check: ran a temporary uv run python script that starts an async loopback HTTP server, registers a real LookupSecret against that server, and calls ACPAgent._arestart_session_after_drain_timeout() while init_state() resolves the secret. Output included: loopback ACP restart resolved LOOPBACK_SECRET=ok without timeout.

Video/Screenshots

Backend/runtime deadlock fix; no UI screenshot. Runtime reproduction output is included above.

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Notes

This keeps the existing restart implementation and error propagation. If restart initialization fails, astep() still raises so the conversation transitions to error rather than continuing with an ambiguous ACP session.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:3c4c40d-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-3c4c40d-python \
  ghcr.io/openhands/agent-server:3c4c40d-python

All tags pushed for this build

ghcr.io/openhands/agent-server:3c4c40d-golang-amd64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-golang-amd64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-golang-amd64
ghcr.io/openhands/agent-server:3c4c40d-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:3c4c40d-golang-arm64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-golang-arm64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-golang-arm64
ghcr.io/openhands/agent-server:3c4c40d-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:3c4c40d-java-amd64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-java-amd64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-java-amd64
ghcr.io/openhands/agent-server:3c4c40d-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:3c4c40d-java-arm64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-java-arm64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-java-arm64
ghcr.io/openhands/agent-server:3c4c40d-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:3c4c40d-python-amd64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-python-amd64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-python-amd64
ghcr.io/openhands/agent-server:3c4c40d-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:3c4c40d-python-arm64
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-python-arm64
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-python-arm64
ghcr.io/openhands/agent-server:3c4c40d-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:3c4c40d-golang
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-golang
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-golang
ghcr.io/openhands/agent-server:3c4c40d-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:3c4c40d-java
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-java
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-java
ghcr.io/openhands/agent-server:3c4c40d-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:3c4c40d-python
ghcr.io/openhands/agent-server:3c4c40d51ed353151f43692efd979700eb0cc5ca-python
ghcr.io/openhands/agent-server:codex-acp-restart-secret-lookup-off-loop-python
ghcr.io/openhands/agent-server:3c4c40d-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 3c4c40d-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 3c4c40d-python-amd64) are also available if needed

Issue

Fixes #3762.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   acp_agent.py116910191%429, 794–796, 1033–1034, 1077, 1079, 1083, 1087, 1113, 1176–1177, 1182, 1249, 1577, 1580–1581, 1598–1599, 1635, 1640, 1748, 1753, 2087–2088, 2373–2376, 2380–2382, 2385–2389, 2391, 2616, 2630–2631, 2634–2636, 2644, 2648, 2652–2653, 2659–2660, 2672–2673, 2676, 2724, 2728–2730, 2734–2735, 2767, 2851, 3038–3040, 3043–3044, 3084, 3230, 3238–3240, 3278–3279, 3282, 3290–3292, 3294, 3296, 3300, 3303, 3312–3314, 3316, 3352–3353, 3371–3374, 3377, 3381–3383, 3385, 3389–3390, 3605–3606
TOTAL31499857372% 

@neubig neubig marked this pull request as ready for review June 17, 2026 18:46
@neubig neubig added the review-this This label triggers a PR review by OpenHands label Jun 17, 2026

all-hands-bot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the async ACP restart path with a real loopback LookupSecret call: main reproduces the 30s timeout, while this PR resolves the same lookup immediately off the caller event loop.

Does this PR achieve its stated goal?

Yes. The goal is to prevent async ACP restart from blocking the agent-server event loop during loopback secret resolution. I reproduced the old behavior on origin/main: ACPAgent.astep() ran restart init on the caller thread and the real LookupSecret HTTP request timed out after ~30s. On the PR commit, the same astep() restart ran init off the caller thread, the same loopback server returned ok in 0.008s, and execution reached the post-restart path.

Phase Result
Environment Setup make build completed and installed the uv-managed workspace dependencies.
CI Status ⏳ At review time: 23 successful, 15 skipped, 7 pending, 0 failing.
Functional Verification ✅ Before/after runtime reproduction confirms the deadlock/timeout is fixed.
Functional Verification

Test 1: Async ACP restart serving a same-loop LookupSecret request

Step 1 — Reproduce / establish baseline without the fix:

I checked out origin/main (cabe776b) and ran a temporary SDK script that:

  • starts an asyncio loopback HTTP server on the caller event loop,
  • creates an ACPAgent conversation,
  • triggers ACPAgent.astep() with _restart_session_on_next_turn=True, and
  • makes restart initialization resolve a real LookupSecret(url=http://127.0.0.1:<port>/api/settings/secrets/LOOPBACK_SECRET).

Ran:

git checkout --detach origin/main &&   OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_acp_loopback_probe.py

Observed excerpt:

HEAD is now at cabe776b [AgentProfile][sdk] AgentProfileStore + FK lifecycle (#3775)
Restarting ACP session after cancelled prompt drain timeout
init_state thread_is_caller=True
lookupsecret_result=ReadTimeout elapsed=30.037s
astep_result=ReadTimeout
httpx.ReadTimeout: timed out

This confirms the bug: restart initialization ran on the caller event-loop thread, so the same loop could not serve the loopback secret request until LookupSecret hit its 30s timeout.

Step 2 — Apply the PR's changes:

I checked out the PR branch at 3c4c40d51ed353151f43692efd979700eb0cc5ca.

Step 3 — Re-run with the fix in place:

Ran the same script on the PR branch:

OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_acp_loopback_probe.py

Observed excerpt:

Restarting ACP session after cancelled prompt drain timeout
init_state thread_is_caller=False
lookupsecret_result=success body='ok' elapsed=0.008s
astep_result=reached_post_restart

This shows the fix works: the async astep() restart path moved initialization off the caller loop, allowing the loopback LookupSecret HTTP request to be served immediately instead of timing out.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Fix async ACP restart loopback secret lookup deadlock

🟢 Good taste - Elegant, minimal fix that solves a real concurrency problem.

Analysis

This PR fixes a deadlock that occurs when ACP restart initialization synchronously resolves LookupSecret values via HTTP loopback to the same agent server. By wrapping the sync _restart_session_after_drain_timeout() call in asyncio.to_thread(), the restart work is offloaded to a thread pool, keeping the async loop free to serve HTTP requests.

Key design decisions:

  1. Async wrapper pattern - The new _arestart_session_after_drain_timeout() async method delegates to the existing sync method via asyncio.to_thread(), avoiding code duplication.
  2. Minimal surface area - Only the call site in astep() is modified; no changes to the underlying restart logic.
  3. Follows existing precedent - Consistent with LocalConversation.arun() which similarly offloads agent initialization to avoid blocking the server loop.

No Issues Found

  • No breaking changes to public APIs
  • No security concerns
  • No performance regressions (thread pool overhead is negligible for this path)
  • Test follows project conventions and properly exercises the fix

Evidence Check

The PR description references manual testing that confirms the fix resolves the deadlock on main. This is appropriate evidence for a concurrency fix where the symptom is a hang (hard to capture in automated tests alone).

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    The change is minimal, well-isolated, and follows an established pattern in the codebase. The thread pool delegation is a standard async Python pattern with no new dependencies or complex state changes.

VERDICT:
Worth merging: Clean, well-tested fix that solves a real concurrency deadlock without introducing complexity.

KEY INSIGHT:
The deadlock occurs because the async loop can't serve HTTP requests to resolve secrets while blocked waiting for those same secrets - a classic self-deadlock pattern that asyncio.to_thread() elegantly breaks.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@neubig neubig merged commit ab886d3 into main Jun 17, 2026
54 checks passed
@neubig neubig deleted the codex/acp-restart-secret-lookup-off-loop branch June 17, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track PR #3737: Fix async ACP restart loopback secret lookup deadlock

2 participants