Skip to content

AppProxyClient endpoint methods do not surface error responses as domain exceptions #11331

@rapsealk

Description

@rapsealk

Summary

AppProxyClient in src/ai/backend/manager/clients/appproxy/client.py has inconsistent error handling across its methods. Only fetch_status maps low-level aiohttp exceptions to the AppProxyConnectionError / AppProxyResponseError domain exceptions defined in manager/errors/appproxy.py. The four endpoint methods — create_endpoint, create_endpoints_bulk, delete_endpoint, delete_endpoints_bulk — leak raw aiohttp exceptions or, in the case of delete_endpoint, silently swallow non-2xx responses entirely.

This was flagged during review of #11328 (BA-1929) as an out-of-scope but real robustness gap.

Concrete defects

  1. delete_endpoint uses async with ... as resp: pass with no resp.raise_for_status() and no body read. A 4xx/5xx response from the coordinator is silently dropped — the manager logs and returns successfully, even though the deletion never happened.
  2. create_endpoint, create_endpoints_bulk, delete_endpoints_bulk call resp.raise_for_status() followed by await resp.json(). Both can raise raw aiohttp.ClientResponseError / aiohttp.ContentTypeError, neither of which is wrapped into a BackendAIError subclass. Callers in manager/sokovan/deployment/executor.py then see a non-domain exception, and the eventual DeploymentExecutionError does not carry the AppProxy domain.
  3. The structured BackendAIError JSON body returned by the coordinator on validation failures (after fix(BA-1929): Return JSON instead of HTML for coordinator API errors #11329) is not preserved through to the caller. The error gets re-raised as an aiohttp exception with only the status code.

Proposed fix

Apply the same try / except (ClientConnectorError, ClientResponseError, ContentTypeError, JSONDecodeError) pattern that fetch_status uses to all four endpoint methods, and add raise_for_status() to delete_endpoint. Where possible, attach the parsed coordinator error body to AppProxyResponseError.extra_data so the upstream JSON error survives the translation.

Out of scope

  • Re-architecting the resilience policy or retry behavior.
  • Changing the public signatures of the four methods.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions