Skip to content

bug(dashboard-api): async hygiene — extension_logs blocks event loop, broad except in _call_agent, fire-and-forget cleanup #455

@yasinBursali

Description

@yasinBursali

Severity: High
Category: Error Handling / Async Hygiene
Platform: All
Confidence: Confirmed

Description

Three async / exception-handling defects in routers/extensions.py that together degrade dashboard reliability and debuggability. The standout is a blocking urllib.request.urlopen(timeout=30) call sitting directly on the event loop in an async def handler — opening the Console modal (which auto-polls every 2s) can freeze the entire dashboard API for up to 30 seconds. Adjacent issues: a bare except Exception in _call_agent discards root causes, and _cleanup_stale_progress is launched fire-and-forget with no exception handling.

Affected File(s)

  • dream-server/extensions/services/dashboard-api/routers/extensions.py:851-881 (extension_logs — blocking urllib in async def)
  • dream-server/extensions/services/dashboard-api/routers/extensions.py:488-492 (_call_agent bare except Exception)
  • dream-server/extensions/services/dashboard-api/routers/extensions.py:648 (_cleanup_stale_progress fire-and-forget)
  • dream-server/extensions/services/dashboard-api/routers/extensions.py:476 (_AGENT_LOG_TIMEOUT = 30)

Root Cause

1. extension_logs blocks the event loop. async def extension_logs(...) calls urllib.request.urlopen(req, timeout=_AGENT_LOG_TIMEOUT) directly (L868). Starlette runs async def handlers on the event-loop thread; any blocking I/O freezes all concurrent coroutines. The Console modal auto-polls every 2 seconds (Extensions.jsx L1018), so while open the event loop can be frozen up to 30 out of every ~32 seconds.

2. _call_agent broad except Exception (L488-492). Catches ConnectionRefusedError, URLError, TimeoutError, but ALSO AttributeError, NameError, JSONDecodeError, and any programming bug in the call path. Always logs "Host agent unreachable at %s — fallback to restart_required" with no exception type/message. Violates the project's CLAUDE.md rule "Narrow exceptions at I/O boundaries" — this is a catch-all.

3. _cleanup_stale_progress fire-and-forget (L648). asyncio.get_running_loop().run_in_executor(None, _cleanup_stale_progress) — returned Future is not stored, awaited, or attached to a task group. Any uncaught exception is GC'd with "Future exception was never retrieved" noise in stderr.

Evidence

# extensions.py L851-881
@router.get("/{service_id}/logs")
async def extension_logs(service_id: str, ...):
    ...
    req = urllib.request.Request(url, headers=headers)
    with urllib.request.urlopen(req, timeout=_AGENT_LOG_TIMEOUT) as resp:  # BLOCKS EVENT LOOP
        data = resp.read().decode()
    ...

# extensions.py L488-492
try:
    with urllib.request.urlopen(req, timeout=_AGENT_REQUEST_TIMEOUT) as resp:
        return True, json.loads(resp.read().decode())
except Exception:  # too broad
    logger.warning("Host agent unreachable at %s — fallback to restart_required", url)
    return False, None

# extensions.py L648
asyncio.get_running_loop().run_in_executor(None, _cleanup_stale_progress)
# return value discarded; exceptions lost

Platform Analysis

  • macOS / Linux / Windows-WSL2: Identical behavior — dashboard-api runs inside the same Docker container on all three.

Reproduction

Impact

  • Dashboard API becomes unresponsive to all requests while the Console modal is open if the host agent is slow.
  • Support-debugging is hampered: maintainers can't distinguish "agent offline" from "agent rejected request" from logs.
  • Stale progress cleanup failures accumulate silently.

Suggested Approach

  • Wrap the extension_logs blocking call in asyncio.to_thread(...), matching the pattern used in main.py:api_settings_env_save. Keep the HTTPError/URLError catches around the threaded result, not the top-level handler.
  • Replace _call_agent's broad except Exception with specific types (urllib.error.URLError, urllib.error.HTTPError, OSError, TimeoutError). Log exception type and message. Let unexpected types propagate. Apply the same to _call_agent_compose_rename (L584-588).
  • Attach a log-on-exception callback to the _cleanup_stale_progress future, or move it to the existing background service-poll loop in main.py where exceptions are already logged.

Labels

bug, error-handling, async, dashboard-api, all-platforms

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions