Skip to content

fix(agent-server): propagate subscriber init errors#3200

Merged
VascoSch92 merged 1 commit into
mainfrom
3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites
May 11, 2026
Merged

fix(agent-server): propagate subscriber init errors#3200
VascoSch92 merged 1 commit into
mainfrom
3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites

Conversation

@VascoSch92

@VascoSch92 VascoSch92 commented May 11, 2026

Copy link
Copy Markdown
Member
  • A human has tested these changes.

Summary

  • Removes the try/except + logger.error wrap around the initial-state push in EventService.subscribe_to_events (event_service.py:411-416) so failures from a new subscriber's first call propagate to the caller.
  • awaits the previously fire-and-forget asyncio.gather(...) of WebhookSubscriber registrations in ConversationService._start_event_service.
  • Moves all subscriber registration to after event_service.start(). Required because at the prior call site _conversation was still None, so subscribe_to_events's initial-state branch was skipped — the broken call only fired
    later via pub_sub, which swallows by design for inter-subscriber isolation. Registering after start() exercises the propagating path uniformly.
  • Drops the xfail marker on test_webhook_subscribe_errors_surface (was tracking this issue).

Behavior change to flag
A subscriber failing during initial state push now causes start_conversation to raise instead of logging silently. event_service.close() runs as part of cleanup, so save_meta() no longer persists half-started conversations. This is the loud-fail behavior the issue calls for, but operators relying on "start always succeeds even if a webhook is broken" will see a difference.

Issue Number

Closes #3121

How to Test

  • uv run pytest tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface — passes (previously xfail)
  • uv run pytest tests/agent_server/test_webhook_subscriber.py tests/agent_server/test_event_service.py tests/agent_server/test_conversation_service.py tests/agent_server/test_conversation_router.py
    tests/agent_server/test_conversation_tags.py tests/agent_server/test_conversation_service_plugin.py — 249 passed, 1 unrelated pre-existing failure (TestAutoTitle::test_autotitle_sets_title_on_first_user_message, reproduces on
    baseline)
  • uv run pre-commit run --files — ruff format, ruff lint, pycodestyle, pyright, import-rules, tool-registration all pass

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2fec5b0-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2fec5b0-python \
  ghcr.io/openhands/agent-server:2fec5b0-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2fec5b0-golang-amd64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-golang-amd64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-golang-amd64
ghcr.io/openhands/agent-server:2fec5b0-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2fec5b0-golang-arm64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-golang-arm64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-golang-arm64
ghcr.io/openhands/agent-server:2fec5b0-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2fec5b0-java-amd64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-java-amd64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-java-amd64
ghcr.io/openhands/agent-server:2fec5b0-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2fec5b0-java-arm64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-java-arm64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-java-arm64
ghcr.io/openhands/agent-server:2fec5b0-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2fec5b0-python-amd64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-python-amd64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-python-amd64
ghcr.io/openhands/agent-server:2fec5b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:2fec5b0-python-arm64
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-python-arm64
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-python-arm64
ghcr.io/openhands/agent-server:2fec5b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:2fec5b0-golang
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-golang
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-golang
ghcr.io/openhands/agent-server:2fec5b0-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:2fec5b0-java
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-java
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-java
ghcr.io/openhands/agent-server:2fec5b0-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:2fec5b0-python
ghcr.io/openhands/agent-server:2fec5b0c689afc20855a2daf2481ecf4b11b56f1-python
ghcr.io/openhands/agent-server:3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites-python
ghcr.io/openhands/agent-server:2fec5b0-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 2fec5b0-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 2fec5b0-python-amd64) are also available if needed

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   conversation_service.py56011878%99–100, 127–128, 132–133, 138, 280–281, 309, 312, 314, 321–327, 355, 362, 383, 482, 488, 493, 499, 507–508, 517–520, 529, 543–545, 552, 587–588, 626–627, 631, 660–664, 666–667, 670–671, 674–679, 776, 783–787, 790–791, 795–799, 802–803, 807–811, 814–815, 837–838, 842–843, 845–847, 849, 852, 860–864, 867, 874–879, 881–882, 892, 902, 906, 908–909, 914–915, 921–922, 930, 945–946, 964, 988, 1278, 1281
   event_service.py4389578%78–79, 104, 108, 111–112, 116–117, 124, 134–138, 141–144, 164, 268, 285, 326, 336, 360–361, 365, 373, 376, 441, 443, 447–449, 453, 462–463, 465, 469, 475, 477, 507–512, 538, 541, 592, 596, 742, 744–745, 749, 763–765, 767, 771–774, 778–781, 789–792, 840–841, 843–850, 852–853, 862–863, 865–866, 873–874, 876–877, 897, 903, 909, 918–919
TOTAL266191159956% 

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste - Clean error propagation fix

Clean bug fix that properly surfaces subscriber initialization errors. The changes are minimal and well-targeted:

✅ Removes error-swallowing try/except (fail fast = good taste)
✅ Properly awaits webhook subscriber registration
✅ Moves subscription after start() so _conversation is set (clever fix)
✅ Existing xfail test now passes
✅ Behavior change is explicitly documented

Breaking Change Note: The PR correctly flags that webhooks failing during initialization will now cause start_conversation to raise instead of silently logging. This is the correct behavior - loud failures are better than silent ones. Operators with misconfigured webhooks will see this, but that's intentional.

Suggestion: Consider adding release notes documenting this behavior change for operators who may have misconfigured webhooks that were previously failing silently.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Bug fix in error handling infrastructure. Doesn't affect core agent decision logic or benchmark behavior. The breaking change is intentional and improves correctness by surfacing configuration errors that were previously hidden. Proper cleanup via event_service.close() ensures no half-started conversations are persisted.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

This PR successfully fixes subscriber initialization error handling.

Does this PR achieve its stated goal?

Yes. The PR delivers exactly what it promises: subscriber initialization errors now propagate to the caller instead of being silently swallowed. I verified this end-to-end by actually running the agent-server conversation service with a failing webhook subscriber. Before the fix, errors were logged but ignored; after the fix, errors properly propagate and prevent half-initialized conversations.

Phase Result
Environment Setup ✅ Built successfully with make build
CI Status ✅ All core checks passing (agent-server-tests, sdk-tests, workspace-tests, pre-commit)
Functional Verification ✅ Verified fix with real agent-server execution
Functional Verification

Test Scenario

Created a functional test that simulates a real-world failure: a webhook subscriber that fails during initial registration (e.g., webhook endpoint unreachable).

Step 1 — Reproduce the bug (main branch):

Checked out main branch and ran test with ConversationService configured with a webhook spec. Monkeypatched WebhookSubscriber.call to raise RuntimeError on first invocation.

Result:

✅ Conversation started successfully
   Conversation ID: 8fa91434-1110-466b-949f-6604e5f12f9f

----------------------------------------------------------------------
❌ RESULT: ERROR WAS SWALLOWED
----------------------------------------------------------------------
The webhook subscriber raised an error, but the conversation
started anyway. This means errors are being silently logged
rather than propagating to the caller.

Interpretation: This confirms the bug described in issue #3121. The WebhookSubscriber.call raised a RuntimeError during initialization (visible in logs at the end), but the conversation service swallowed it and proceeded with conversation creation. Users would have no indication that their webhook notifications are broken.

Step 2 — Apply the PR's changes:

Checked out PR branch 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites.

Step 3 — Re-run with the fix in place:

Ran the identical test scenario:

Result:

📍 WebhookSubscriber.__call__ invoked during registration
💥 Raising RuntimeError to simulate webhook registration failure
💥 RuntimeError caught: Webhook endpoint unavailable: connection refused

----------------------------------------------------------------------
✅ RESULT: ERROR PROPAGATED CORRECTLY
----------------------------------------------------------------------
The webhook subscriber error was raised to the caller,
preventing a half-initialized conversation from being created.

Interpretation: The fix works perfectly. The error now propagates immediately when start_conversation is called, preventing the creation of a conversation with a broken webhook subscriber. This is the correct fail-fast behavior that allows operators to detect and fix configuration issues immediately rather than silently losing events.

Code Changes Verified

  1. event_service.py: Removed try/except wrapper around initial subscriber call
  2. conversation_service.py:
    • Moved subscriber registration to after event_service.start() ✅
    • Changed fire-and-forget asyncio.gather() to awaited gather() ✅
  3. Test update: Removed xfail marker from test_webhook_subscribe_errors_surface ✅

Issues Found

None.


Behavior change note: As the PR description correctly flags, operators relying on "start always succeeds even if a webhook is broken" will now see failures. This is the intended loud-fail behavior and is a significant improvement over silent data loss.

@VascoSch92 VascoSch92 marked this pull request as ready for review May 11, 2026 09:30

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste - Clean error propagation fix

Clean bug fix that properly surfaces subscriber initialization errors. The changes are minimal and well-targeted:

✅ Removes error-swallowing try/except (fail fast = good taste)
✅ Properly awaits webhook subscriber registration
✅ Moves subscription after start() so _conversation is set (clever fix)
✅ Existing xfail test now passes
✅ Behavior change is explicitly documented
✅ Cleanup logic prevents partial state persistence

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Minimal code changes fixing a real bug. The behavior change (startup now fails on broken webhooks instead of silent logging) is intentional and improves observability. Proper cleanup ensures no orphaned state.

VERDICT:
Worth merging - Fixes subscriber init error handling as requested in #3121

KEY INSIGHT:
The fix addresses the root cause by ensuring subscriber registration happens when _conversation is set (after start), removing error swallowing, and awaiting async operations - three simple changes that eliminate silent failures.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Subscriber initialization errors are now properly raised instead of silently swallowed. The fix correctly addresses issue #3121.

Does this PR achieve its stated goal?

Yes. The PR successfully fixes the bug where subscriber initialization errors were being silently swallowed. Previously, errors during webhook subscriber initialization were caught by two "swallow sites" (try/except in event_service.py and fire-and-forget asyncio.gather in conversation_service.py). The PR removes both swallow sites, ensuring errors propagate to the caller as expected. I verified this by running the actual test that was designed to catch this bug: on main it's marked xfail (bug exists), on this PR branch it passes (bug fixed).

Phase Result
Environment Setup ✅ Dependencies synced successfully
CI Status ✅ All critical checks passing (build, pre-commit, sdk-tests, etc.)
Functional Verification ✅ Error propagation verified, no regressions
Functional Verification

Test 1: Reproduce the bug (baseline - main branch)

Step 1 — Establish baseline on main branch:

Checked out main branch at commit d690628:

git checkout origin/main

Examined the test file and confirmed it has the xfail marker:

@pytest.mark.xfail(
    strict=True,
    reason=(
        "Subscribe-time errors from a subscriber's initial __call__ never "
        "reach the start_conversation caller. Two swallow sites in series: "
        "event_service.py:411-416 wraps the initial state sync in "
        "try/except + logger.error (no re-raise), and "
        "conversation_service.py:862 fires asyncio.gather() on the webhook "
        "subscribes without awaiting. Both must be fixed for init errors "
        "to surface. "
        "Tracked in https://github.com/OpenHands/software-agent-sdk/issues/3121."
    ),
)
@pytest.mark.timeout(30)
async def test_webhook_subscribe_errors_surface(tmp_path, monkeypatch):

Step 2 — Run the test on main:

uv run pytest tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface -v

Output:

tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface XFAIL
...
1 xfailed, 5 warnings in 0.71s

Interpretation: The test is marked as expected to fail (XFAIL) on main, confirming the bug exists. The test expects a RuntimeError to be raised when a webhook subscriber fails during initialization, but the error is being silently swallowed, so the RuntimeError is never raised and the test fails as expected.


Test 2: Verify the fix (PR branch)

Step 3 — Check out PR branch:

git checkout 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites

Verified the xfail marker has been removed from the test (only @pytest.mark.timeout(30) remains).

Step 4 — Run the same test on PR branch:

uv run pytest tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface -v

Output:

tests/agent_server/test_webhook_subscriber.py::test_webhook_subscribe_errors_surface PASSED [100%]
...
1 passed, 5 warnings in 0.12s

Interpretation: The test now passes! This confirms that subscriber initialization errors are now properly raised to the caller. The test creates a webhook subscriber that raises RuntimeError("webhook subscriber init failed") on the first call, and with the PR changes, this error correctly propagates out of start_conversation instead of being silently logged.


Test 3: Verify no regressions

Step 5 — Run broader test suite:

uv run pytest tests/agent_server/test_webhook_subscriber.py \
    tests/agent_server/test_event_service.py \
    tests/agent_server/test_conversation_service.py \
    tests/agent_server/test_conversation_router.py \
    tests/agent_server/test_conversation_tags.py \
    tests/agent_server/test_conversation_service_plugin.py -v

Output:

250 passed, 1 xfailed, 19 warnings in 20.99s

Interpretation: All 250 related tests pass with no new failures. The 1 xfailed test is unrelated to this PR. This confirms that:

  • Normal webhook subscriber operation is unaffected
  • Event service functionality is unaffected
  • Conversation service startup and lifecycle work correctly
  • No regressions introduced

Also ran conversation_service tests specifically:

uv run pytest tests/agent_server/test_conversation_service.py -v

Result: 68 passed, confirming normal conversation operations work correctly.

Issues Found

None. The fix works as intended and introduces no regressions.

@VascoSch92

Copy link
Copy Markdown
Member Author

@OpenHands fix conflicts

@openhands-ai

openhands-ai Bot commented May 11, 2026

Copy link
Copy Markdown

I'm on it! VascoSch92 can track my progress at all-hands.dev

@VascoSch92 VascoSch92 force-pushed the 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites branch from 8d69df9 to 2fec5b0 Compare May 11, 2026 16:00
@openhands-ai

openhands-ai Bot commented May 11, 2026

Copy link
Copy Markdown

OpenHands encountered an error: Request timeout after 30 seconds to https://pexkrixaggtzqhfk.prod-runtime.all-hands.dev/api/conversations/88c2bc74-a51c-459b-b617-05854d0562e2/ask_agent

See the conversation for more information.

@VascoSch92 VascoSch92 merged commit 7275215 into main May 11, 2026
42 of 45 checks passed
@VascoSch92 VascoSch92 deleted the 3121-agent-server-subscriber-init-errors-are-silently-swallowed-two-sites branch May 11, 2026 16:30
StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agent-server: subscriber init errors are silently swallowed (two sites)

3 participants