Severity: High
Category: Error Handling / Async Hygiene
Platform: All
Confidence: Confirmed
Description
Three async / exception-handling defects in routers/extensions.py that together degrade dashboard reliability and debuggability. The standout is a blocking urllib.request.urlopen(timeout=30) call sitting directly on the event loop in an async def handler — opening the Console modal (which auto-polls every 2s) can freeze the entire dashboard API for up to 30 seconds. Adjacent issues: a bare except Exception in _call_agent discards root causes, and _cleanup_stale_progress is launched fire-and-forget with no exception handling.
Affected File(s)
dream-server/extensions/services/dashboard-api/routers/extensions.py:851-881 (extension_logs — blocking urllib in async def)
dream-server/extensions/services/dashboard-api/routers/extensions.py:488-492 (_call_agent bare except Exception)
dream-server/extensions/services/dashboard-api/routers/extensions.py:648 (_cleanup_stale_progress fire-and-forget)
dream-server/extensions/services/dashboard-api/routers/extensions.py:476 (_AGENT_LOG_TIMEOUT = 30)
Root Cause
1. extension_logs blocks the event loop. async def extension_logs(...) calls urllib.request.urlopen(req, timeout=_AGENT_LOG_TIMEOUT) directly (L868). Starlette runs async def handlers on the event-loop thread; any blocking I/O freezes all concurrent coroutines. The Console modal auto-polls every 2 seconds (Extensions.jsx L1018), so while open the event loop can be frozen up to 30 out of every ~32 seconds.
2. _call_agent broad except Exception (L488-492). Catches ConnectionRefusedError, URLError, TimeoutError, but ALSO AttributeError, NameError, JSONDecodeError, and any programming bug in the call path. Always logs "Host agent unreachable at %s — fallback to restart_required" with no exception type/message. Violates the project's CLAUDE.md rule "Narrow exceptions at I/O boundaries" — this is a catch-all.
3. _cleanup_stale_progress fire-and-forget (L648). asyncio.get_running_loop().run_in_executor(None, _cleanup_stale_progress) — returned Future is not stored, awaited, or attached to a task group. Any uncaught exception is GC'd with "Future exception was never retrieved" noise in stderr.
Evidence
# extensions.py L851-881
@router.get("/{service_id}/logs")
async def extension_logs(service_id: str, ...):
...
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=_AGENT_LOG_TIMEOUT) as resp: # BLOCKS EVENT LOOP
data = resp.read().decode()
...
# extensions.py L488-492
try:
with urllib.request.urlopen(req, timeout=_AGENT_REQUEST_TIMEOUT) as resp:
return True, json.loads(resp.read().decode())
except Exception: # too broad
logger.warning("Host agent unreachable at %s — fallback to restart_required", url)
return False, None
# extensions.py L648
asyncio.get_running_loop().run_in_executor(None, _cleanup_stale_progress)
# return value discarded; exceptions lost
Platform Analysis
- macOS / Linux / Windows-WSL2: Identical behavior — dashboard-api runs inside the same Docker container on all three.
Reproduction
Impact
- Dashboard API becomes unresponsive to all requests while the Console modal is open if the host agent is slow.
- Support-debugging is hampered: maintainers can't distinguish "agent offline" from "agent rejected request" from logs.
- Stale progress cleanup failures accumulate silently.
Suggested Approach
- Wrap the
extension_logs blocking call in asyncio.to_thread(...), matching the pattern used in main.py:api_settings_env_save. Keep the HTTPError/URLError catches around the threaded result, not the top-level handler.
- Replace
_call_agent's broad except Exception with specific types (urllib.error.URLError, urllib.error.HTTPError, OSError, TimeoutError). Log exception type and message. Let unexpected types propagate. Apply the same to _call_agent_compose_rename (L584-588).
- Attach a log-on-exception callback to the
_cleanup_stale_progress future, or move it to the existing background service-poll loop in main.py where exceptions are already logged.
Labels
bug, error-handling, async, dashboard-api, all-platforms
Severity: High
Category: Error Handling / Async Hygiene
Platform: All
Confidence: Confirmed
Description
Three async / exception-handling defects in
routers/extensions.pythat together degrade dashboard reliability and debuggability. The standout is a blockingurllib.request.urlopen(timeout=30)call sitting directly on the event loop in anasync defhandler — opening the Console modal (which auto-polls every 2s) can freeze the entire dashboard API for up to 30 seconds. Adjacent issues: a bareexcept Exceptionin_call_agentdiscards root causes, and_cleanup_stale_progressis launched fire-and-forget with no exception handling.Affected File(s)
dream-server/extensions/services/dashboard-api/routers/extensions.py:851-881(extension_logs— blocking urllib inasync def)dream-server/extensions/services/dashboard-api/routers/extensions.py:488-492(_call_agentbareexcept Exception)dream-server/extensions/services/dashboard-api/routers/extensions.py:648(_cleanup_stale_progressfire-and-forget)dream-server/extensions/services/dashboard-api/routers/extensions.py:476(_AGENT_LOG_TIMEOUT = 30)Root Cause
1.
extension_logsblocks the event loop.async def extension_logs(...)callsurllib.request.urlopen(req, timeout=_AGENT_LOG_TIMEOUT)directly (L868). Starlette runsasync defhandlers on the event-loop thread; any blocking I/O freezes all concurrent coroutines. The Console modal auto-polls every 2 seconds (Extensions.jsx L1018), so while open the event loop can be frozen up to 30 out of every ~32 seconds.2.
_call_agentbroadexcept Exception(L488-492). CatchesConnectionRefusedError,URLError,TimeoutError, but ALSOAttributeError,NameError,JSONDecodeError, and any programming bug in the call path. Always logs"Host agent unreachable at %s — fallback to restart_required"with no exception type/message. Violates the project's CLAUDE.md rule "Narrow exceptions at I/O boundaries" — this is a catch-all.3.
_cleanup_stale_progressfire-and-forget (L648).asyncio.get_running_loop().run_in_executor(None, _cleanup_stale_progress)— returned Future is not stored, awaited, or attached to a task group. Any uncaught exception is GC'd with "Future exception was never retrieved" noise in stderr.Evidence
Platform Analysis
Reproduction
/api/extensions/catalog) — it will block for up to 30s on any iteration where the host agent is slow._call_agentto fail with a non-network error (e.g. agent returns 422 with malformed JSON). Logs always say "unreachable" even though the agent was reachable._cleanup_stale_progressto fail (e.g. revoke read permission on a progress file). Python emits "Future exception was never retrieved" warning to stderr; caller never knows.Impact
Suggested Approach
extension_logsblocking call inasyncio.to_thread(...), matching the pattern used inmain.py:api_settings_env_save. Keep the HTTPError/URLError catches around the threaded result, not the top-level handler._call_agent's broadexcept Exceptionwith specific types (urllib.error.URLError,urllib.error.HTTPError,OSError,TimeoutError). Log exception type and message. Let unexpected types propagate. Apply the same to_call_agent_compose_rename(L584-588)._cleanup_stale_progressfuture, or move it to the existing background service-poll loop inmain.pywhere exceptions are already logged.Labels
bug,error-handling,async,dashboard-api,all-platforms