fix(agent-server): propagate subscriber init errors#3200
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
|||||||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Clean error propagation fix
Clean bug fix that properly surfaces subscriber initialization errors. The changes are minimal and well-targeted:
✅ Removes error-swallowing try/except (fail fast = good taste)
✅ Properly awaits webhook subscriber registration
✅ Moves subscription after start() so _conversation is set (clever fix)
✅ Existing xfail test now passes
✅ Behavior change is explicitly documented
Breaking Change Note: The PR correctly flags that webhooks failing during initialization will now cause start_conversation to raise instead of silently logging. This is the correct behavior - loud failures are better than silent ones. Operators with misconfigured webhooks will see this, but that's intentional.
Suggestion: Consider adding release notes documenting this behavior change for operators who may have misconfigured webhooks that were previously failing silently.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
Bug fix in error handling infrastructure. Doesn't affect core agent decision logic or benchmark behavior. The breaking change is intentional and improves correctness by surfacing configuration errors that were previously hidden. Proper cleanup via event_service.close() ensures no half-started conversations are persisted.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
This PR successfully fixes subscriber initialization error handling.
Does this PR achieve its stated goal?
Yes. The PR delivers exactly what it promises: subscriber initialization errors now propagate to the caller instead of being silently swallowed. I verified this end-to-end by actually running the agent-server conversation service with a failing webhook subscriber. Before the fix, errors were logged but ignored; after the fix, errors properly propagate and prevent half-initialized conversations.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Built successfully with make build |
| CI Status | ✅ All core checks passing (agent-server-tests, sdk-tests, workspace-tests, pre-commit) |
| Functional Verification | ✅ Verified fix with real agent-server execution |
Functional Verification
Test Scenario
Created a functional test that simulates a real-world failure: a webhook subscriber that fails during initial registration (e.g., webhook endpoint unreachable).
Step 1 — Reproduce the bug (main branch):
Checked out main branch and ran test with ConversationService configured with a webhook spec. Monkeypatched WebhookSubscriber.call to raise RuntimeError on first invocation.
Result:
✅ Conversation started successfully
Conversation ID: 8fa91434-1110-466b-949f-6604e5f12f9f
----------------------------------------------------------------------
❌ RESULT: ERROR WAS SWALLOWED
----------------------------------------------------------------------
The webhook subscriber raised an error, but the conversation
started anyway. This means errors are being silently logged
rather than propagating to the caller.
Interpretation: This confirms the bug described in issue #3121. The WebhookSubscriber.call raised a RuntimeError during initialization (visible in logs at the end), but the conversation service swallowed it and proceeded with conversation creation. Users would have no indication that their webhook notifications are broken.
Step 2 — Apply the PR's changes:
Checked out PR branch 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites.
Step 3 — Re-run with the fix in place:
Ran the identical test scenario:
Result:
📍 WebhookSubscriber.__call__ invoked during registration
💥 Raising RuntimeError to simulate webhook registration failure
💥 RuntimeError caught: Webhook endpoint unavailable: connection refused
----------------------------------------------------------------------
✅ RESULT: ERROR PROPAGATED CORRECTLY
----------------------------------------------------------------------
The webhook subscriber error was raised to the caller,
preventing a half-initialized conversation from being created.
Interpretation: The fix works perfectly. The error now propagates immediately when start_conversation is called, preventing the creation of a conversation with a broken webhook subscriber. This is the correct fail-fast behavior that allows operators to detect and fix configuration issues immediately rather than silently losing events.
Code Changes Verified
- event_service.py: Removed try/except wrapper around initial subscriber call ✅
- conversation_service.py:
- Moved subscriber registration to after event_service.start() ✅
- Changed fire-and-forget asyncio.gather() to awaited gather() ✅
- Test update: Removed xfail marker from test_webhook_subscribe_errors_surface ✅
Issues Found
None.
Behavior change note: As the PR description correctly flags, operators relying on "start always succeeds even if a webhook is broken" will now see failures. This is the intended loud-fail behavior and is a significant improvement over silent data loss.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Clean error propagation fix
Clean bug fix that properly surfaces subscriber initialization errors. The changes are minimal and well-targeted:
✅ Removes error-swallowing try/except (fail fast = good taste)
✅ Properly awaits webhook subscriber registration
✅ Moves subscription after start() so _conversation is set (clever fix)
✅ Existing xfail test now passes
✅ Behavior change is explicitly documented
✅ Cleanup logic prevents partial state persistence
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
Minimal code changes fixing a real bug. The behavior change (startup now fails on broken webhooks instead of silent logging) is intentional and improves observability. Proper cleanup ensures no orphaned state.
VERDICT:
✅ Worth merging - Fixes subscriber init error handling as requested in #3121
KEY INSIGHT:
The fix addresses the root cause by ensuring subscriber registration happens when _conversation is set (after start), removing error swallowing, and awaiting async operations - three simple changes that eliminate silent failures.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Subscriber initialization errors are now properly raised instead of silently swallowed. The fix correctly addresses issue #3121.
Does this PR achieve its stated goal?
Yes. The PR successfully fixes the bug where subscriber initialization errors were being silently swallowed. Previously, errors during webhook subscriber initialization were caught by two "swallow sites" (try/except in event_service.py and fire-and-forget asyncio.gather in conversation_service.py). The PR removes both swallow sites, ensuring errors propagate to the caller as expected. I verified this by running the actual test that was designed to catch this bug: on main it's marked xfail (bug exists), on this PR branch it passes (bug fixed).
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies synced successfully |
| CI Status | ✅ All critical checks passing (build, pre-commit, sdk-tests, etc.) |
| Functional Verification | ✅ Error propagation verified, no regressions |
Functional Verification
Test 1: Reproduce the bug (baseline - main branch)
Step 1 — Establish baseline on main branch:
Checked out main branch at commit d690628:
git checkout origin/mainExamined the test file and confirmed it has the xfail marker:
@pytest.mark.xfail(
strict=True,
reason=(
"Subscribe-time errors from a subscriber's initial __call__ never "
"reach the start_conversation caller. Two swallow sites in series: "
"event_service.py:411-416 wraps the initial state sync in "
"try/except + logger.error (no re-raise), and "
"conversation_service.py:862 fires asyncio.gather() on the webhook "
"subscribes without awaiting. Both must be fixed for init errors "
"to surface. "
"Tracked in https://github.com/OpenHands/software-agent-sdk/issues/3121."
),
)
@pytest.mark.timeout(30)
async def test_webhook_subscribe_errors_surface(tmp_path, monkeypatch):Step 2 — Run the test on main:
uv run pytest tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface -vOutput:
tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface XFAIL
...
1 xfailed, 5 warnings in 0.71s
Interpretation: The test is marked as expected to fail (XFAIL) on main, confirming the bug exists. The test expects a RuntimeError to be raised when a webhook subscriber fails during initialization, but the error is being silently swallowed, so the RuntimeError is never raised and the test fails as expected.
Test 2: Verify the fix (PR branch)
Step 3 — Check out PR branch:
git checkout 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sitesVerified the xfail marker has been removed from the test (only @pytest.mark.timeout(30) remains).
Step 4 — Run the same test on PR branch:
uv run pytest tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface -vOutput:
tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface PASSED [100%]
...
1 passed, 5 warnings in 0.12s
Interpretation: The test now passes! This confirms that subscriber initialization errors are now properly raised to the caller. The test creates a webhook subscriber that raises RuntimeError("webhook subscriber init failed") on the first call, and with the PR changes, this error correctly propagates out of start_conversation instead of being silently logged.
Test 3: Verify no regressions
Step 5 — Run broader test suite:
uv run pytest tests/agent_server/test_webhook_subscriber.py \
tests/agent_server/test_event_service.py \
tests/agent_server/test_conversation_service.py \
tests/agent_server/test_conversation_router.py \
tests/agent_server/test_conversation_tags.py \
tests/agent_server/test_conversation_service_plugin.py -vOutput:
250 passed, 1 xfailed, 19 warnings in 20.99s
Interpretation: All 250 related tests pass with no new failures. The 1 xfailed test is unrelated to this PR. This confirms that:
- Normal webhook subscriber operation is unaffected
- Event service functionality is unaffected
- Conversation service startup and lifecycle work correctly
- No regressions introduced
Also ran conversation_service tests specifically:
uv run pytest tests/agent_server/test_conversation_service.py -vResult: 68 passed, confirming normal conversation operations work correctly.
Issues Found
None. The fix works as intended and introduces no regressions.
|
@OpenHands fix conflicts |
|
I'm on it! VascoSch92 can track my progress at all-hands.dev |
8d69df9 to
2fec5b0
Compare
|
OpenHands encountered an error: Request timeout after 30 seconds to https://pexkrixaggtzqhfk.prod-runtime.all-hands.dev/api/conversations/88c2bc74-a51c-459b-b617-05854d0562e2/ask_agent See the conversation for more information. |
Summary
later via pub_sub, which swallows by design for inter-subscriber isolation. Registering after start() exercises the propagating path uniformly.
Behavior change to flag
A subscriber failing during initial state push now causes start_conversation to raise instead of logging silently. event_service.close() runs as part of cleanup, so save_meta() no longer persists half-started conversations. This is the loud-fail behavior the issue calls for, but operators relying on "start always succeeds even if a webhook is broken" will see a difference.
Issue Number
Closes #3121
How to Test
tests/agent_server/test_conversation_tags.py tests/agent_server/test_conversation_service_plugin.py — 249 passed, 1 unrelated pre-existing failure (TestAutoTitle::test_autotitle_sets_title_on_first_user_message, reproduces on
baseline)
Type
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:2fec5b0-pythonRun
All tags pushed for this build
About Multi-Architecture Support
2fec5b0-python) is a multi-arch manifest supporting both amd64 and arm642fec5b0-python-amd64) are also available if needed