feat(BA-5528): add deployment chat CLI for vLLM-backed model services by jopemachine · Pull Request #11344 · lablup/backend.ai

jopemachine · 2026-04-27T07:48:58Z

📚 Stacked PRs

This PR is part of a 2-PR stack. Merge in order:

👉 feat(BA-5528): add deployment chat CLI for vLLM-backed model services #11344 — feat(BA-5528): add deployment chat CLI for vLLM-backed model services ← you are here
⬇️ feat(BA-5903): persist deployment chat history and replay as request context #11412 — feat(BA-5903): persist deployment chat history and replay as request context

Summary

Add ./bai deployment chat <id> "<content>" for one-shot OpenAI-compatible chat against deployed inference services. Requests are sent directly to the deployment's inference endpoint with optional Authorization: Bearer <token> (the value the runtime — vLLM/SGLang/NIM/TGI/custom — was started with), bypassing the Backend.AI manager. Use --params to forward runtime-variant-specific sampling knobs.
Add ./bai deployment chat-config set/show/clear/clear-cache to register, inspect, and remove per-deployment chat state.
Auto-derive the request model when the user did not specify one: the CLI calls GET /v1/models on the inference endpoint, picks data[0].id, and caches it as cache.default_model for subsequent calls (matches the webui ChatCard.tsx fallback). The user is no longer required to run chat-config set --model before the first chat.
Persist state under ~/.backend.ai/deployment_chat/, grouped per-feature (matching the existing ~/.backend.ai/session/ layout used by ./bai login):
- cache.json — auto-managed: endpoint_url, default_model (auto-fetched from /v1/models), last_synced_at (24-h TTL).
- config.json — user-managed: per-deployment { token, model } entries. The user's model takes precedence over cache.default_model.
Both files are written via plain path.write_text() to match the existing CLI credential-storage convention (client/cli/v2/config_cmd.py). On 401/403 from the inference endpoint, the cached token for that deployment is cleared and the user is prompted to re-register.
Add an SDK-side BackendAIAppProxyClient base in client/v2/base_client.py for direct-to-deployment HTTP traffic (Bearer-token auth, app-proxy-aware JSON parsing) and a thin DeploymentChatClient subclass exposing chat_completion() and list_models() (returning a typed ListModelsResponse).

Model resolution order

When the runtime needs a model field for a chat call, the CLI walks this list and stops at the first hit:

--model <name> on the chat command line.
config.<deployment-id>.model — the user's pinned model in config.json.
cache.<deployment-id>.default_model — the auto-derived value in cache.json.
GET /v1/models on the inference endpoint, taking data[0].id. The result is written to cache.default_model so subsequent calls skip the round trip.

This means a fresh deployment works with zero configuration as long as the runtime serves /v1/models; you only need chat-config set --model for multi-model deployments where [0] is not the right pick.

Command usage

# One-shot chat — model is auto-derived on first call from /v1/models
./bai deployment chat <deployment-id> "Hello, who are you?"

# Override the model for one call
./bai deployment chat <deployment-id> "..." --model llama-3-8b-instruct

# Forward runtime-specific sampling knobs as a JSON object
./bai deployment chat <deployment-id> "..." \
    --params '{"temperature": 0.7, "max_tokens": 256}'

# Register a Bearer token for a token-gated deployment
./bai deployment chat-config set <deployment-id> --token <runtime-token>

# Pin a model (overrides the cached default; useful for multi-model deployments)
./bai deployment chat-config set <deployment-id> --model llama-3-8b-instruct

# Set both at once
./bai deployment chat-config set <deployment-id> \
    --token <runtime-token> --model llama-3-8b-instruct

# Inspect what's currently registered/cached (token is masked)
./bai deployment chat-config show <deployment-id>

# Remove the user-managed config entry (token + model) for a deployment
./bai deployment chat-config clear <deployment-id>

# Force-invalidate the auto-managed cache entry (endpoint_url, default_model)
./bai deployment chat-config clear-cache <deployment-id>

chat-config set writes to config.json only — it does not contact the manager, so it stays usable while the deployment is still provisioning or the manager is unreachable. chat-config clear and clear-cache operate on the two storage files independently: clearing user config never touches the cache, and vice versa.

On-disk state

State lives under ~/.backend.ai/deployment_chat/ so it stays grouped with the other Backend.AI CLI state directories.

cache.json — auto-managed by the CLI; do not hand-edit.

{
  "deployments": {
    "d55e251a-3a70-408d-97a9-ca305502aa58": {
      "endpoint_url": "https://app-proxy.example.com/v1/some-deployment",
      "default_model": "llama-3-8b-instruct",
      "last_synced_at": "2026-04-29T12:34:56.789012+00:00"
    }
  }
}

endpoint_url — fetched from the manager's deployment.network_access.endpoint_url and refreshed on a 24-hour TTL.
default_model — auto-derived from GET /v1/models on first use; never written by chat-config set.
last_synced_at — UTC timestamp of the last manager fetch; entries past CACHE_ENTRY_TTL (24 h) are treated as a cache miss.

config.json — user-managed: one { token, model } entry per deployment.

{
  "deployments": {
    "d55e251a-3a70-408d-97a9-ca305502aa58": {
      "token": "sk-runtime-token-here",
      "model": "llama-3-8b-instruct"
    }
  }
}

Either field may be null — chat-config set upserts only the fields you pass, and an entry is dropped automatically once both fields are cleared. The token is also cleared automatically on 401/403 from the inference endpoint so the next chat call surfaces the re-register hint instead of silently re-sending a stale credential.

Resolves BA-5528.

Copilot

Pull request overview

Adds a new CLI workflow for one-shot OpenAI-compatible chat calls against deployed vLLM inference endpoints, backed by a per-deployment local cache and a dedicated SDK-side HTTP client/DTOs (bypassing the Backend.AI manager API).

Changes:

Add ./bai deployment chat and ./bai deployment chat-config set/show/clear commands plus a JSON cache at ~/.backend.ai/deployment_chat.json (0600, atomic write).
Add DeploymentChatClient (direct aiohttp client) and OpenAI-compatible Pydantic DTOs under the v2 client package.
Add unit tests for the direct chat client and the cache load/save semantics.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/unit/client/v2/test_deployment_chat_client.py	Unit tests for direct vLLM chat posting, auth error handling, serialization, and session ownership.
tests/unit/client/cli/test_deployment_chat_cache.py	Unit tests for cache schema/version guard, permissions, atomic write, masking, and tolerant loading.
src/ai/backend/client/v2/domains_v2/deployment_chat.py	New direct-to-inference chat client (aiohttp) with OpenAI-compatible request/response handling.
src/ai/backend/client/v2/chat_dto.py	New Pydantic DTOs for `/v1/chat/completions` request/response payloads with forward-compatible extra fields.
src/ai/backend/client/cli/v2/deployment_chat_cache.py	New cache implementation for endpoint URL + vLLM API key persistence with 0600 permissions and atomic writes.
src/ai/backend/client/cli/v2/deployment/chat_config.py	New `chat-config` CLI group to set/show/clear cache entries.
src/ai/backend/client/cli/v2/deployment/chat.py	New `chat` CLI command to send one-shot chat completions and invalidate cached key on 401/403.
src/ai/backend/client/cli/v2/deployment/init.py	Registers the new `chat` and `chat-config` commands under `deployment`.
changes/5528.feature.md	Changelog entry for the new CLI commands and cache behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review comments from #11344: - Drop chat_dto.py and switch the SDK to plain dict[str, Any] for both request and response, so it doesn't try to track every runtime variant's extension fields (vllm reasoning_content, tool_calls, etc.) - Rename DeploymentChatClient -> InferenceChatClient and decouple it from the vllm runtime variant: works against any OpenAI-compatible endpoint (vllm, tgi, sglang, nim) and exposes a configurable path plus a list_models helper - Rename the cached api key field vllm_api_key -> api_key throughout the cache schema, CLI options, show output, and tests - chat-config set: --token is now optional and pairs with a new --no-token flag for deployments started without --api-key. The served model name is auto-discovered via GET /v1/models (option B from the discussion) so users no longer have to know it - chat: replace the local _abort helper with click.ClickException, validate --max-tokens via click.IntRange(min=1) and the sampling knobs via click.FloatRange, and add --top-p, --frequency-penalty, --presence-penalty, --seed, --stop options - inference_chat client: add ClientTimeout (sock_connect/sock_read) to the owned aiohttp session and normalize trailing slashes when building the chat / models URL - cache loader: tolerate corrupted JSON (OSError/JSONDecodeError) and skip individual malformed entries instead of aborting the whole load - tests: drop redundant atomic-write/permission-reset cases, add loader resilience cases, and shorten the changelog entry

Address review comments on PR #11344: - chat.py: - Drop the auto-clear of the cached API key on inference 401/403 — it was deleting user-supplied config out from under them. Just raise the error and ask the user to re-register. - Use print() instead of sys.stdout.write() for the response payload. - chat_config.py: - Remove --no-token; clearing is the dedicated chat-config clear command's job. Resolved-key handling collapses to a single expression. - Use print() instead of click.echo() for status output. - Parse the inference endpoint's /v1/models response with a typed Pydantic model (_ServedModelsResponse) instead of manual dict.get walking. - _print_entry now delegates the entry portion to DeploymentChatCacheEntry.format_summary() so the per-entry fields are owned by the cache type. - deployment_chat_cache.py / deployment_chat_config.py: - Drop schema_version as a Pydantic field on the wrapper model. The version is metadata, not data — emit it manually around model_dump in save_*, and check it manually in load_* before validating individual records. - DeploymentChatCacheEntry gains a format_summary() method returning the endpoint/default_model/last_synced_at lines so consumers don't duplicate that formatting.

…Args type Address review comments on PR #11344: - Drop _owns_session and the optional session= kwarg on DeploymentChatClient. Match BackendAIAuthClient: __init__ takes a pre-built session, factory method create() builds one, close() always closes. Removes the dual-ownership branch. - Introduce DeploymentChatClientArgs (frozen dataclass) for connection knobs (skip_ssl_verification, connect_timeout, read_timeout). Callers use DeploymentChatClient.create(args) instead of passing multiple kwargs to the constructor. - Rename chat_completion's 'request' parameter to 'body'. - Tests: rename the cache-entry helper to _make_entry, the chat-body helper to _make_body. Drop TestExternalSession since the new contract is 'whatever you pass to __init__ gets closed'.

The cache file holds endpoint URL, model name and a sync timestamp — no secrets. The 0600 chmod was copy-pasted from the config file path where it actually matters (plaintext API keys). Default umask applies to the cache; only save_chat_config keeps the chmod. Module/function docstrings updated and the corresponding cache permission test goes away.

…oning Token registration is purely user-side state — it should not block on the deployment's runtime status. Previously set_ went through _resolve_endpoint_entry which raises 'no endpoint_url yet' when the deployment is in DEPLOYING/PROVISIONING, dropping the user's token along with the cache write. Restructure set_: 1. Always fetch the deployment record (so a typo in deployment_id still surfaces a 404). 2. Save the token unconditionally when --token is provided. 3. Write the cache entry only when endpoint_url is already populated; otherwise warn that --default-model will be picked up on the first chat call once the deployment is READY. The chat command's _resolve_endpoint_entry is unchanged — chat still requires a usable endpoint to talk to.

- DeploymentChatCache/Config gain `save()` instance methods (paired with the existing `load()` classmethods); free functions in utils.py removed. - `_write_text_file` writes via tmp+rename and creates the file with the target permission directly, closing the brief world-readable window that `write_text() + chmod(0600)` left open on the config file. - `is_fresh()` flipped to `is_expired()` to align with the cache miss call site. - `_resolve_endpoint_entry` had a single caller and an unused `default_model_override` parameter; inlined into `chat`. - Renamed local `connection` to `connection_config` to match `V2ConnectionConfig`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ish naming - 401/403 from the inference endpoint now clears the stored API key for that deployment so the user is not silently retried with a known-bad token. The error message tells the user the cache was cleared. - Replace the ad-hoc ``dict[str, Any]`` chat body with ``ChatCompletionRequest`` (pydantic, ``extra="allow"``) so runtime- variant-specific knobs supplied via ``--params`` still flow through while the model/messages shape is enforced. - Rename ``chat_config_store`` → ``chat_config`` in the ``chat`` command and ``config`` inside the ``chat-config`` subcommands to match the reviewer's preferred naming and avoid shadowing the click group. - Clarify ``_ensure_dict`` wording: payloads that are valid JSON but not an object now report ``non-object payload (type=...)`` instead of the misleading ``non-JSON response``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…der, named timeout consts - Rename ``api_key`` to ``token`` across CLI flag binding, local variables, client method signatures, error messages, and the ``chat-config show`` summary label so the user-facing vocabulary matches the storage method names (``get_token``/``set_token``). - Replace the length-leaking ``sk-***...***xxxx``-style mask with a fixed ``********`` placeholder that never reveals the token's prefix, suffix, or length. - Pull ``DeploymentChatClientArgs`` magic numbers into named module constants (``DEFAULT_CONNECT_TIMEOUT_SEC``, ``DEFAULT_READ_TIMEOUT_SEC``). - Update the affected test names and assertions accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…edential convention - Drop the bespoke tmp-and-rename / 0600-permission helper used for ``deployment_chat_config.json``. The existing CLI credential store (``client/cli/v2/config_cmd.py``) writes plain TOML without atomic semantics or explicit permissions; the chat config now matches that convention rather than introducing a stricter parallel one. - Introduce ``write_json_file`` in ``utils.py`` so the cache and config models share a single, plain ``mkdir`` + ``write_text`` helper. - Drop the ``test_config_save_enforces_0600`` test along with the no-longer-needed ``os``/``stat`` imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…chat tests with aioresponses - Collapse ``_read_payload``/``_ensure_dict`` into a single read-and-parse block inside ``_request``: parse ``resp.text()`` as JSON in one step, surface ``BackendAPIError`` (with the raw body in ``detail``) when the status is already a 4xx/5xx, and only raise ``BackendClientError`` when a 2xx body is unparsable. The clarified comment now names Backend.AI's app-proxy as the layer that produces non-JSON 5xx pages. - Remove ``--path`` from ``./bai deployment chat``. The CLI body is fixed to OpenAI-shaped ``{model, messages}`` via ``ChatCompletionRequest``, so a custom path never paired with a matching custom body — keeping the option encouraged the misconception that arbitrary inference contracts could be driven through this command. The SDK still accepts ``path`` as a kwarg for programmatic callers. - Migrate ``test_deployment_chat_client.py`` from a real ``aiohttp.web`` test server to ``aioresponses``-based mocks, matching the existing client-test convention (see ``tests/unit/client/test_resource_usage.py``). Headers and JSON body are asserted via ``m.requests``. New coverage: HTML 5xx now produces a ``BackendAPIError`` whose ``detail`` carries the upstream body verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndAIAppProxyClient - Add :class:`BackendAIAppProxyClient` to ``client/v2/base_client.py``: a ``ClientConfig``-driven base for SDK-side, direct-to-deployment HTTP traffic. It owns the aiohttp session, ``_request`` (with Bearer-token auth, app-proxy-aware JSON parsing, status-to-exception mapping), URL normalization, and the lifecycle hooks. The name is deliberately distinct from ``manager/clients/appproxy/client.py``'s ``AppProxyClient`` (control plane: coordinator admin API with ``X-BackendAI-Token``); this base sits in the SDK and handles the data plane (per-deployment Bearer-token traffic). - Trim ``DeploymentChatClient`` to a single OpenAI Chat Completions method on top of the new base. Drop the ABC layer / separate ``OpenAICompatibleChatClient`` / ``DeploymentChatClientArgs`` / per-module timeout constants — those duties now live on ``BackendAIAppProxyClient`` and ``ClientConfig``. The path constant is renamed ``_OPENAI_COMPATIBLE_CHAT_PATH`` to make the contract explicit at the call site. - Rename ``DeploymentChatAuthError`` → ``DeploymentAuthError`` since the 401/403 mapping now lives on the AppProxy base and is no longer chat-specific. - Update the CLI to build a ``ClientConfig`` from ``V2ConnectionConfig`` and instantiate ``DeploymentChatClient`` directly. Tests follow the same construction path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…polish help text - ``DeploymentChatCache.remove`` → ``pop`` and ``DeploymentChatConfig.clear_token`` → ``pop_token`` so the names match the underlying ``dict.pop`` semantics (return value indicates whether something was actually removed). - Inline the ``TOKEN_PLACEHOLDER`` constant into ``mask_token`` — the literal only has one call site. - Reword ``./bai deployment chat-config set --token`` help text: "Omit when the deployment is open to public" instead of the previous runtime-startup phrasing. - Update tests for the renames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…at/ subdirectory Match the existing per-feature subdirectory layout used by ``./bai login`` (``~/.backend.ai/session/cookie.dat`` + ``session/config.json``): - ``~/.backend.ai/deployment_chat.json`` → ``~/.backend.ai/deployment_chat/cache.json`` - ``~/.backend.ai/deployment_chat_config.json`` → ``~/.backend.ai/deployment_chat/config.json`` Drops the ``deployment_chat_`` filename prefix duplication and lets future chat-related files (logs, sessions, etc.) land naturally under the same directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… omitted Previously ``./bai deployment chat`` errored out when neither ``--model`` nor a cached ``default_model`` was provided. Now the CLI calls ``GET /v1/models`` on the deployment's inference endpoint, picks the first ``id`` (matches webui ChatCard.tsx fallback), and caches it as the deployment's ``default_model`` so subsequent ``chat`` calls reuse it. Add ``DeploymentChatClient.list_models()`` returning a typed ``ListModelsResponse`` so the CLI consumes ``models_response.data[0].id`` instead of dict-drilling. Hoist the ``DeploymentAuthError`` handler to the whole ``async with`` block (auth handling is identical for both ``/v1/models`` and ``/v1/chat/completions``) and drop the per-call ``BackendAPIError`` handlers — ``_run_async`` already formats them.

…under entry Introduce ``DeploymentChatConfigEntry { token, model }`` so per-deployment user state lives in one nested record (mirrors ``DeploymentChatCacheEntry``) instead of two parallel ``tokens`` / ``models`` dicts. Resolution order in ``chat`` becomes: ``--model`` flag > ``config.model`` (user-set, ``config.json``) > ``cache.default_model`` (auto, ``cache.json``) > ``GET /v1/models[0].id`` (auto-fetched and cached). Both fields can co-exist; the user-set value always wins, matching the user's "config는 사용자, cache는 자동" mental model. CLI surface changes: - Rename ``chat-config set --default-model`` to ``--model``; the flag now writes to ``config.json`` (user store) instead of ``cache.json`` (auto store), so the new name matches the field it sets. - Drop the manager fetch from ``chat-config set`` — both token and model go to ``config.json`` only, so the command stays usable while the deployment is still provisioning or the manager is unreachable. - Rename ``chat-config clear-config`` to ``chat-config clear``; clears the whole user config entry (token + model) for that deployment. - Keep ``chat-config clear-cache`` for invalidating the auto-managed cache entry (``endpoint_url``, ``default_model``, ``last_synced_at``) on demand rather than waiting for the 24h TTL. - ``chat-config show`` now prints both the user-set ``model`` and the auto-cached ``default_model`` so the resolved value is clear at a glance.

…es to one line Replace the manual ``self.deployments.get(id) or DeploymentChatConfigEntry()`` + ``self.deployments[id] = entry`` dance with a ``defaultdict``-backed store so ``set_token`` / ``set_model`` reduce to a single bracket assignment. Pydantic v2 cannot infer a default factory for a ``defaultdict`` whose value is a ``BaseModel`` subclass, so the field annotation uses ``Annotated[..., Field(default_factory=...)]`` per the explicit form ``PydanticSchemaGenerationError`` directs callers to. Without it, importing the module raises at class-construction time: Unable to infer a default factory for keys of type DeploymentChatConfigEntry. Only set, bool, str, tuple, dict, int, frozenset, float, list are supported, other types require an explicit default factory set using DefaultDict[..., Annotated[..., Field( default_factory=...)]] Read paths (``get`` / ``get_token`` / ``get_model`` / ``pop_*``) still go through ``dict.get`` / ``dict.pop`` so a missing-key lookup never plants a stale empty entry.

…onfig The block was restating things the code already says (method names already imply read vs write paths) and explaining pydantic boilerplate that the import-time error message itself points at, so it was net noise.

Relocate the wire-format and persistence Pydantic models added in this PR into the shared `common/` tree so any backend.ai component can consume them, not just the CLI: - OpenAI-compat wire DTOs (`ChatCompletionMessage`, `ChatCompletionRequest`, `ListModelsResponse`, `ModelEntry`) → `common/dto/clients/openai_compat/{request,response}.py`, paralleling the existing `common/dto/clients/prometheus/` layout for third-party HTTP service contracts. - Chat persistence data types (`DeploymentChatCache(Entry)`, `DeploymentChatConfig(Entry)`, `CACHE_ENTRY_TTL`) → `common/data/deployment_chat/types.py` as pure Pydantic models with no I/O coupling. Disk load/save lives in a new `client/cli/v2/deployment/chat/storage.py` (`load_chat_cache`, `save_chat_cache`, `load_chat_config`, `save_chat_config`) so the data types stay free of `client.cli` imports — `common/` MUST NOT depend on component-specific packages per `common/dto/CLAUDE.md`. The previous `DeploymentChatCache.load`/`.save` classmethods that pulled in `client.cli.v2.deployment.chat.utils` are removed in favor of these free functions, eliminating the backward dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e in Reverse the relocation done in 69fd070: - `DeploymentChatCache(Entry)` and `DeploymentChatConfig(Entry)` (and `CACHE_ENTRY_TTL`) move from `common/data/deployment_chat/` back to `client/cli/v2/deployment/chat/types.py`. - `load_chat_*` / `save_chat_*` free functions go away; the corresponding `.load()` / `.save()` classmethods are restored on `DeploymentChatCache` / `DeploymentChatConfig`. - `client/cli/v2/deployment/chat/storage.py` is removed — the typed models own their own disk format directly. `common/dto/clients/openai_compat/{request,response}.py` (the OpenAI-compat wire DTOs) are left in place, since those are reused by the SDK and may grow more component consumers.

Address review feedback on PR #11344 — the OpenAI-compat chat endpoint treats each turn as a "message" with role/content, so the user-facing CLI argument is more naturally named `message`. Update the click argument declaration, the function parameter, the help text, and the request-body construction site. The JSON key on the wire stays `content` (that's the OpenAI spec); only the local variable / argument name changes.

…onfig `chat-config show` was printing both the cache (auto-managed `endpoint_url` / `default_model` / `last_synced_at`) and the user's config (`token` / `model`) in one block, which blurred the responsibility split between the two files. Trim the command to print only the config entry it owns. Drop ``DeploymentChatFormatter.print_summary``/``entry_lines`` (the only consumers of the cache half) in favor of a dedicated ``print_config(deployment_id, entry)``. Update the formatter test to match.

…mand group The auto-managed cache and the user-managed config are two separate files (``cache.json`` vs ``config.json``); having a `clear-cache` subcommand under `chat-config` mixed the two responsibilities. Replace ``./bai deployment chat-config clear-cache`` with a dedicated ``chat-cache`` group: - ``./bai deployment chat-cache show <id>`` — print the cached endpoint metadata (``endpoint_url``, ``default_model``, ``last_synced_at``) for inspection / debugging. - ``./bai deployment chat-cache clear <id>`` — drop the cache entry, forcing the next ``chat`` call to refetch endpoint and re-derive the default model. ``DeploymentChatFormatter`` gains ``print_cache(deployment_id, entry)`` to render the cache view; the `chat-config clear` docstring is updated to reference the new path.

…args Address review feedback (PR #11412 discussion r3165318334) — the deployment id values flowing through the chat data classes, the formatter, and the click handlers represent a deployment, not a generic UUID. Switch the static signatures to ``ai.backend.common.identifier.deployment.DeploymentID`` (a ``NewType(UUID)``) so type checkers can distinguish deployment ids from other UUIDs without any runtime cost. The click ``type=click.UUID`` parser still emits a plain ``UUID`` at runtime; ``DeploymentID`` is structurally identical, so the wrap is implicit and no conversion is needed at the boundary.

…context (#11412)

…history None vs empty - Extract `_OpenAICompatModel` base class so all OpenAI-compat response DTOs share a single `ConfigDict(extra="allow")` declaration instead of repeating it on each subclass. - In `history_show`, distinguish "no history record" (`messages is None`) from the invariant-violating "record exists but empty list" case so the CLI message reflects the actual state instead of conflating both as falsy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Python's `pop` convention (`dict.pop`, `list.pop`) implies the popped value is returned, but these methods return a plain `bool` because every caller only needs "did anything actually get removed?" Renaming so the method names match what the calls do: - `DeploymentChatCache.pop` → `delete` (removes the entry) - `DeploymentChatConfig.pop` → `delete` (removes the entry) - `DeploymentChatConfig.pop_token` → `clear_token` (nulls the field, drops the entry only when both fields are unset) - `DeploymentChatConfig.pop_model` → `clear_model` (same shape as `clear_token`) `pop_token`/`pop_model` were already misnomers — they null one field rather than fully popping the entry, so `clear_*` reflects the actual behavior. Return types stay `bool` since no caller uses the popped value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lient._request` Address PR #11344 review: split JSON parsing and payload validation into a dedicated method so `_request` only orchestrates the HTTP call and status handling.

`PrometheusQueryPresetRepository.preview_template` was rewired to call `PrometheusClient.preview_query_template` in #11274, but the component test added in #11482 still mocked the now-unused `query_instant`. The real client method falls through and returns an `AsyncMock`, so the PrometheusResponse model fails validation and the API returns 500. Mock the method actually called so the preview-endpoint tests cover the success path and the FailedToGetMetric → PrometheusQueryEvaluationFailed mapping again.

Copilot AI review requested due to automatic review settings April 27, 2026 07:49

github-actions Bot assigned jopemachine Apr 27, 2026

github-actions Bot added size:XL 500~ LoC comp:client Related to Client component comp:cli Related to CLI component labels Apr 27, 2026

jopemachine marked this pull request as draft April 27, 2026 07:49

Copilot started reviewing on behalf of jopemachine April 27, 2026 07:49 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes