feat(BA-5528): add deployment chat CLI for vLLM-backed model services#11344
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new CLI workflow for one-shot OpenAI-compatible chat calls against deployed vLLM inference endpoints, backed by a per-deployment local cache and a dedicated SDK-side HTTP client/DTOs (bypassing the Backend.AI manager API).
Changes:
- Add
./bai deployment chatand./bai deployment chat-config set/show/clearcommands plus a JSON cache at~/.backend.ai/deployment_chat.json(0600, atomic write). - Add
DeploymentChatClient(direct aiohttp client) and OpenAI-compatible Pydantic DTOs under the v2 client package. - Add unit tests for the direct chat client and the cache load/save semantics.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/client/v2/test_deployment_chat_client.py | Unit tests for direct vLLM chat posting, auth error handling, serialization, and session ownership. |
| tests/unit/client/cli/test_deployment_chat_cache.py | Unit tests for cache schema/version guard, permissions, atomic write, masking, and tolerant loading. |
| src/ai/backend/client/v2/domains_v2/deployment_chat.py | New direct-to-inference chat client (aiohttp) with OpenAI-compatible request/response handling. |
| src/ai/backend/client/v2/chat_dto.py | New Pydantic DTOs for /v1/chat/completions request/response payloads with forward-compatible extra fields. |
| src/ai/backend/client/cli/v2/deployment_chat_cache.py | New cache implementation for endpoint URL + vLLM API key persistence with 0600 permissions and atomic writes. |
| src/ai/backend/client/cli/v2/deployment/chat_config.py | New chat-config CLI group to set/show/clear cache entries. |
| src/ai/backend/client/cli/v2/deployment/chat.py | New chat CLI command to send one-shot chat completions and invalidate cached key on 401/403. |
| src/ai/backend/client/cli/v2/deployment/init.py | Registers the new chat and chat-config commands under deployment. |
| changes/5528.feature.md | Changelog entry for the new CLI commands and cache behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
commented
Apr 27, 2026
jopemachine
added a commit
that referenced
this pull request
Apr 27, 2026
Address review comments from #11344: - Drop chat_dto.py and switch the SDK to plain dict[str, Any] for both request and response, so it doesn't try to track every runtime variant's extension fields (vllm reasoning_content, tool_calls, etc.) - Rename DeploymentChatClient -> InferenceChatClient and decouple it from the vllm runtime variant: works against any OpenAI-compatible endpoint (vllm, tgi, sglang, nim) and exposes a configurable path plus a list_models helper - Rename the cached api key field vllm_api_key -> api_key throughout the cache schema, CLI options, show output, and tests - chat-config set: --token is now optional and pairs with a new --no-token flag for deployments started without --api-key. The served model name is auto-discovered via GET /v1/models (option B from the discussion) so users no longer have to know it - chat: replace the local _abort helper with click.ClickException, validate --max-tokens via click.IntRange(min=1) and the sampling knobs via click.FloatRange, and add --top-p, --frequency-penalty, --presence-penalty, --seed, --stop options - inference_chat client: add ClientTimeout (sock_connect/sock_read) to the owned aiohttp session and normalize trailing slashes when building the chat / models URL - cache loader: tolerate corrupted JSON (OSError/JSONDecodeError) and skip individual malformed entries instead of aborting the whole load - tests: drop redundant atomic-write/permission-reset cases, add loader resilience cases, and shorten the changelog entry
jopemachine
commented
Apr 28, 2026
jopemachine
added a commit
that referenced
this pull request
Apr 28, 2026
Address review comments on PR #11344: - chat.py: - Drop the auto-clear of the cached API key on inference 401/403 — it was deleting user-supplied config out from under them. Just raise the error and ask the user to re-register. - Use print() instead of sys.stdout.write() for the response payload. - chat_config.py: - Remove --no-token; clearing is the dedicated chat-config clear command's job. Resolved-key handling collapses to a single expression. - Use print() instead of click.echo() for status output. - Parse the inference endpoint's /v1/models response with a typed Pydantic model (_ServedModelsResponse) instead of manual dict.get walking. - _print_entry now delegates the entry portion to DeploymentChatCacheEntry.format_summary() so the per-entry fields are owned by the cache type. - deployment_chat_cache.py / deployment_chat_config.py: - Drop schema_version as a Pydantic field on the wrapper model. The version is metadata, not data — emit it manually around model_dump in save_*, and check it manually in load_* before validating individual records. - DeploymentChatCacheEntry gains a format_summary() method returning the endpoint/default_model/last_synced_at lines so consumers don't duplicate that formatting.
jopemachine
commented
Apr 28, 2026
jopemachine
commented
Apr 28, 2026
jopemachine
commented
Apr 28, 2026
jopemachine
commented
Apr 28, 2026
jopemachine
commented
Apr 28, 2026
jopemachine
added a commit
that referenced
this pull request
Apr 28, 2026
…Args type Address review comments on PR #11344: - Drop _owns_session and the optional session= kwarg on DeploymentChatClient. Match BackendAIAuthClient: __init__ takes a pre-built session, factory method create() builds one, close() always closes. Removes the dual-ownership branch. - Introduce DeploymentChatClientArgs (frozen dataclass) for connection knobs (skip_ssl_verification, connect_timeout, read_timeout). Callers use DeploymentChatClient.create(args) instead of passing multiple kwargs to the constructor. - Rename chat_completion's 'request' parameter to 'body'. - Tests: rename the cache-entry helper to _make_entry, the chat-body helper to _make_body. Drop TestExternalSession since the new contract is 'whatever you pass to __init__ gets closed'.
jopemachine
commented
Apr 28, 2026
jopemachine
commented
Apr 28, 2026
The cache file holds endpoint URL, model name and a sync timestamp — no secrets. The 0600 chmod was copy-pasted from the config file path where it actually matters (plaintext API keys). Default umask applies to the cache; only save_chat_config keeps the chmod. Module/function docstrings updated and the corresponding cache permission test goes away.
…oning Token registration is purely user-side state — it should not block on the deployment's runtime status. Previously set_ went through _resolve_endpoint_entry which raises 'no endpoint_url yet' when the deployment is in DEPLOYING/PROVISIONING, dropping the user's token along with the cache write. Restructure set_: 1. Always fetch the deployment record (so a typo in deployment_id still surfaces a 404). 2. Save the token unconditionally when --token is provided. 3. Write the cache entry only when endpoint_url is already populated; otherwise warn that --default-model will be picked up on the first chat call once the deployment is READY. The chat command's _resolve_endpoint_entry is unchanged — chat still requires a usable endpoint to talk to.
- DeploymentChatCache/Config gain `save()` instance methods (paired with the existing `load()` classmethods); free functions in utils.py removed. - `_write_text_file` writes via tmp+rename and creates the file with the target permission directly, closing the brief world-readable window that `write_text() + chmod(0600)` left open on the config file. - `is_fresh()` flipped to `is_expired()` to align with the cache miss call site. - `_resolve_endpoint_entry` had a single caller and an unused `default_model_override` parameter; inlined into `chat`. - Renamed local `connection` to `connection_config` to match `V2ConnectionConfig`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ish naming - 401/403 from the inference endpoint now clears the stored API key for that deployment so the user is not silently retried with a known-bad token. The error message tells the user the cache was cleared. - Replace the ad-hoc ``dict[str, Any]`` chat body with ``ChatCompletionRequest`` (pydantic, ``extra="allow"``) so runtime- variant-specific knobs supplied via ``--params`` still flow through while the model/messages shape is enforced. - Rename ``chat_config_store`` → ``chat_config`` in the ``chat`` command and ``config`` inside the ``chat-config`` subcommands to match the reviewer's preferred naming and avoid shadowing the click group. - Clarify ``_ensure_dict`` wording: payloads that are valid JSON but not an object now report ``non-object payload (type=...)`` instead of the misleading ``non-JSON response``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…der, named timeout consts - Rename ``api_key`` to ``token`` across CLI flag binding, local variables, client method signatures, error messages, and the ``chat-config show`` summary label so the user-facing vocabulary matches the storage method names (``get_token``/``set_token``). - Replace the length-leaking ``sk-***...***xxxx``-style mask with a fixed ``********`` placeholder that never reveals the token's prefix, suffix, or length. - Pull ``DeploymentChatClientArgs`` magic numbers into named module constants (``DEFAULT_CONNECT_TIMEOUT_SEC``, ``DEFAULT_READ_TIMEOUT_SEC``). - Update the affected test names and assertions accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edential convention - Drop the bespoke tmp-and-rename / 0600-permission helper used for ``deployment_chat_config.json``. The existing CLI credential store (``client/cli/v2/config_cmd.py``) writes plain TOML without atomic semantics or explicit permissions; the chat config now matches that convention rather than introducing a stricter parallel one. - Introduce ``write_json_file`` in ``utils.py`` so the cache and config models share a single, plain ``mkdir`` + ``write_text`` helper. - Drop the ``test_config_save_enforces_0600`` test along with the no-longer-needed ``os``/``stat`` imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chat tests with aioresponses
- Collapse ``_read_payload``/``_ensure_dict`` into a single read-and-parse
block inside ``_request``: parse ``resp.text()`` as JSON in one step,
surface ``BackendAPIError`` (with the raw body in ``detail``) when the
status is already a 4xx/5xx, and only raise ``BackendClientError``
when a 2xx body is unparsable. The clarified comment now names
Backend.AI's app-proxy as the layer that produces non-JSON 5xx pages.
- Remove ``--path`` from ``./bai deployment chat``. The CLI body is
fixed to OpenAI-shaped ``{model, messages}`` via
``ChatCompletionRequest``, so a custom path never paired with a
matching custom body — keeping the option encouraged the
misconception that arbitrary inference contracts could be driven
through this command. The SDK still accepts ``path`` as a kwarg for
programmatic callers.
- Migrate ``test_deployment_chat_client.py`` from a real ``aiohttp.web``
test server to ``aioresponses``-based mocks, matching the existing
client-test convention (see ``tests/unit/client/test_resource_usage.py``).
Headers and JSON body are asserted via ``m.requests``. New
coverage: HTML 5xx now produces a ``BackendAPIError`` whose
``detail`` carries the upstream body verbatim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndAIAppProxyClient - Add :class:`BackendAIAppProxyClient` to ``client/v2/base_client.py``: a ``ClientConfig``-driven base for SDK-side, direct-to-deployment HTTP traffic. It owns the aiohttp session, ``_request`` (with Bearer-token auth, app-proxy-aware JSON parsing, status-to-exception mapping), URL normalization, and the lifecycle hooks. The name is deliberately distinct from ``manager/clients/appproxy/client.py``'s ``AppProxyClient`` (control plane: coordinator admin API with ``X-BackendAI-Token``); this base sits in the SDK and handles the data plane (per-deployment Bearer-token traffic). - Trim ``DeploymentChatClient`` to a single OpenAI Chat Completions method on top of the new base. Drop the ABC layer / separate ``OpenAICompatibleChatClient`` / ``DeploymentChatClientArgs`` / per-module timeout constants — those duties now live on ``BackendAIAppProxyClient`` and ``ClientConfig``. The path constant is renamed ``_OPENAI_COMPATIBLE_CHAT_PATH`` to make the contract explicit at the call site. - Rename ``DeploymentChatAuthError`` → ``DeploymentAuthError`` since the 401/403 mapping now lives on the AppProxy base and is no longer chat-specific. - Update the CLI to build a ``ClientConfig`` from ``V2ConnectionConfig`` and instantiate ``DeploymentChatClient`` directly. Tests follow the same construction path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…polish help text - ``DeploymentChatCache.remove`` → ``pop`` and ``DeploymentChatConfig.clear_token`` → ``pop_token`` so the names match the underlying ``dict.pop`` semantics (return value indicates whether something was actually removed). - Inline the ``TOKEN_PLACEHOLDER`` constant into ``mask_token`` — the literal only has one call site. - Reword ``./bai deployment chat-config set --token`` help text: "Omit when the deployment is open to public" instead of the previous runtime-startup phrasing. - Update tests for the renames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at/ subdirectory Match the existing per-feature subdirectory layout used by ``./bai login`` (``~/.backend.ai/session/cookie.dat`` + ``session/config.json``): - ``~/.backend.ai/deployment_chat.json`` → ``~/.backend.ai/deployment_chat/cache.json`` - ``~/.backend.ai/deployment_chat_config.json`` → ``~/.backend.ai/deployment_chat/config.json`` Drops the ``deployment_chat_`` filename prefix duplication and lets future chat-related files (logs, sessions, etc.) land naturally under the same directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… omitted Previously ``./bai deployment chat`` errored out when neither ``--model`` nor a cached ``default_model`` was provided. Now the CLI calls ``GET /v1/models`` on the deployment's inference endpoint, picks the first ``id`` (matches webui ChatCard.tsx fallback), and caches it as the deployment's ``default_model`` so subsequent ``chat`` calls reuse it. Add ``DeploymentChatClient.list_models()`` returning a typed ``ListModelsResponse`` so the CLI consumes ``models_response.data[0].id`` instead of dict-drilling. Hoist the ``DeploymentAuthError`` handler to the whole ``async with`` block (auth handling is identical for both ``/v1/models`` and ``/v1/chat/completions``) and drop the per-call ``BackendAPIError`` handlers — ``_run_async`` already formats them.
…under entry
Introduce ``DeploymentChatConfigEntry { token, model }`` so per-deployment
user state lives in one nested record (mirrors ``DeploymentChatCacheEntry``)
instead of two parallel ``tokens`` / ``models`` dicts.
Resolution order in ``chat`` becomes: ``--model`` flag > ``config.model``
(user-set, ``config.json``) > ``cache.default_model`` (auto, ``cache.json``)
> ``GET /v1/models[0].id`` (auto-fetched and cached). Both fields can
co-exist; the user-set value always wins, matching the user's
"config는 사용자, cache는 자동" mental model.
CLI surface changes:
- Rename ``chat-config set --default-model`` to ``--model``; the flag now
writes to ``config.json`` (user store) instead of ``cache.json`` (auto
store), so the new name matches the field it sets.
- Drop the manager fetch from ``chat-config set`` — both token and model
go to ``config.json`` only, so the command stays usable while the
deployment is still provisioning or the manager is unreachable.
- Rename ``chat-config clear-config`` to ``chat-config clear``; clears the
whole user config entry (token + model) for that deployment.
- Keep ``chat-config clear-cache`` for invalidating the auto-managed cache
entry (``endpoint_url``, ``default_model``, ``last_synced_at``) on demand
rather than waiting for the 24h TTL.
- ``chat-config show`` now prints both the user-set ``model`` and the
auto-cached ``default_model`` so the resolved value is clear at a glance.
…es to one line
Replace the manual ``self.deployments.get(id) or DeploymentChatConfigEntry()``
+ ``self.deployments[id] = entry`` dance with a ``defaultdict``-backed store
so ``set_token`` / ``set_model`` reduce to a single bracket assignment.
Pydantic v2 cannot infer a default factory for a ``defaultdict`` whose value
is a ``BaseModel`` subclass, so the field annotation uses
``Annotated[..., Field(default_factory=...)]`` per the explicit form
``PydanticSchemaGenerationError`` directs callers to. Without it, importing
the module raises at class-construction time:
Unable to infer a default factory for keys of type
DeploymentChatConfigEntry. Only set, bool, str, tuple, dict, int,
frozenset, float, list are supported, other types require an explicit
default factory set using DefaultDict[..., Annotated[..., Field(
default_factory=...)]]
Read paths (``get`` / ``get_token`` / ``get_model`` / ``pop_*``) still go
through ``dict.get`` / ``dict.pop`` so a missing-key lookup never plants a
stale empty entry.
…onfig The block was restating things the code already says (method names already imply read vs write paths) and explaining pydantic boilerplate that the import-time error message itself points at, so it was net noise.
Relocate the wire-format and persistence Pydantic models added in this
PR into the shared `common/` tree so any backend.ai component can
consume them, not just the CLI:
- OpenAI-compat wire DTOs (`ChatCompletionMessage`,
`ChatCompletionRequest`, `ListModelsResponse`, `ModelEntry`) →
`common/dto/clients/openai_compat/{request,response}.py`,
paralleling the existing `common/dto/clients/prometheus/` layout for
third-party HTTP service contracts.
- Chat persistence data types (`DeploymentChatCache(Entry)`,
`DeploymentChatConfig(Entry)`, `CACHE_ENTRY_TTL`) →
`common/data/deployment_chat/types.py` as pure Pydantic models with
no I/O coupling.
Disk load/save lives in a new
`client/cli/v2/deployment/chat/storage.py` (`load_chat_cache`,
`save_chat_cache`, `load_chat_config`, `save_chat_config`) so the data
types stay free of `client.cli` imports — `common/` MUST NOT depend on
component-specific packages per `common/dto/CLAUDE.md`. The previous
`DeploymentChatCache.load`/`.save` classmethods that pulled in
`client.cli.v2.deployment.chat.utils` are removed in favor of these
free functions, eliminating the backward dependency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e in Reverse the relocation done in 69fd070: - `DeploymentChatCache(Entry)` and `DeploymentChatConfig(Entry)` (and `CACHE_ENTRY_TTL`) move from `common/data/deployment_chat/` back to `client/cli/v2/deployment/chat/types.py`. - `load_chat_*` / `save_chat_*` free functions go away; the corresponding `.load()` / `.save()` classmethods are restored on `DeploymentChatCache` / `DeploymentChatConfig`. - `client/cli/v2/deployment/chat/storage.py` is removed — the typed models own their own disk format directly. `common/dto/clients/openai_compat/{request,response}.py` (the OpenAI-compat wire DTOs) are left in place, since those are reused by the SDK and may grow more component consumers.
Address review feedback on PR #11344 — the OpenAI-compat chat endpoint treats each turn as a "message" with role/content, so the user-facing CLI argument is more naturally named `message`. Update the click argument declaration, the function parameter, the help text, and the request-body construction site. The JSON key on the wire stays `content` (that's the OpenAI spec); only the local variable / argument name changes.
…onfig `chat-config show` was printing both the cache (auto-managed `endpoint_url` / `default_model` / `last_synced_at`) and the user's config (`token` / `model`) in one block, which blurred the responsibility split between the two files. Trim the command to print only the config entry it owns. Drop ``DeploymentChatFormatter.print_summary``/``entry_lines`` (the only consumers of the cache half) in favor of a dedicated ``print_config(deployment_id, entry)``. Update the formatter test to match.
…mand group The auto-managed cache and the user-managed config are two separate files (``cache.json`` vs ``config.json``); having a `clear-cache` subcommand under `chat-config` mixed the two responsibilities. Replace ``./bai deployment chat-config clear-cache`` with a dedicated ``chat-cache`` group: - ``./bai deployment chat-cache show <id>`` — print the cached endpoint metadata (``endpoint_url``, ``default_model``, ``last_synced_at``) for inspection / debugging. - ``./bai deployment chat-cache clear <id>`` — drop the cache entry, forcing the next ``chat`` call to refetch endpoint and re-derive the default model. ``DeploymentChatFormatter`` gains ``print_cache(deployment_id, entry)`` to render the cache view; the `chat-config clear` docstring is updated to reference the new path.
…args Address review feedback (PR #11412 discussion r3165318334) — the deployment id values flowing through the chat data classes, the formatter, and the click handlers represent a deployment, not a generic UUID. Switch the static signatures to ``ai.backend.common.identifier.deployment.DeploymentID`` (a ``NewType(UUID)``) so type checkers can distinguish deployment ids from other UUIDs without any runtime cost. The click ``type=click.UUID`` parser still emits a plain ``UUID`` at runtime; ``DeploymentID`` is structurally identical, so the wrap is implicit and no conversion is needed at the boundary.
…history None vs empty - Extract `_OpenAICompatModel` base class so all OpenAI-compat response DTOs share a single `ConfigDict(extra="allow")` declaration instead of repeating it on each subclass. - In `history_show`, distinguish "no history record" (`messages is None`) from the invariant-violating "record exists but empty list" case so the CLI message reflects the actual state instead of conflating both as falsy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python's `pop` convention (`dict.pop`, `list.pop`) implies the popped value is returned, but these methods return a plain `bool` because every caller only needs "did anything actually get removed?" Renaming so the method names match what the calls do: - `DeploymentChatCache.pop` → `delete` (removes the entry) - `DeploymentChatConfig.pop` → `delete` (removes the entry) - `DeploymentChatConfig.pop_token` → `clear_token` (nulls the field, drops the entry only when both fields are unset) - `DeploymentChatConfig.pop_model` → `clear_model` (same shape as `clear_token`) `pop_token`/`pop_model` were already misnomers — they null one field rather than fully popping the entry, so `clear_*` reflects the actual behavior. Return types stay `bool` since no caller uses the popped value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lient._request` Address PR #11344 review: split JSON parsing and payload validation into a dedicated method so `_request` only orchestrates the HTTP call and status handling.
`PrometheusQueryPresetRepository.preview_template` was rewired to call `PrometheusClient.preview_query_template` in #11274, but the component test added in #11482 still mocked the now-unused `query_instant`. The real client method falls through and returns an `AsyncMock`, so the PrometheusResponse model fails validation and the API returns 500. Mock the method actually called so the preview-endpoint tests cover the success path and the FailedToGetMetric → PrometheusQueryEvaluationFailed mapping again.
f269ef6 to
3e59aaf
Compare
fregataa
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📚 Stacked PRs
This PR is part of a 2-PR stack. Merge in order:
feat(BA-5528): add deployment chat CLI for vLLM-backed model services← you are herefeat(BA-5903): persist deployment chat history and replay as request contextSummary
./bai deployment chat <id> "<content>"for one-shot OpenAI-compatible chat against deployed inference services. Requests are sent directly to the deployment's inference endpoint with optionalAuthorization: Bearer <token>(the value the runtime — vLLM/SGLang/NIM/TGI/custom — was started with), bypassing the Backend.AI manager. Use--paramsto forward runtime-variant-specific sampling knobs../bai deployment chat-config set/show/clear/clear-cacheto register, inspect, and remove per-deployment chat state.modelwhen the user did not specify one: the CLI callsGET /v1/modelson the inference endpoint, picksdata[0].id, and caches it ascache.default_modelfor subsequent calls (matches the webuiChatCard.tsxfallback). The user is no longer required to runchat-config set --modelbefore the firstchat.~/.backend.ai/deployment_chat/, grouped per-feature (matching the existing~/.backend.ai/session/layout used by./bai login):cache.json— auto-managed:endpoint_url,default_model(auto-fetched from/v1/models),last_synced_at(24-h TTL).config.json— user-managed: per-deployment{ token, model }entries. The user'smodeltakes precedence overcache.default_model.path.write_text()to match the existing CLI credential-storage convention (client/cli/v2/config_cmd.py). On401/403from the inference endpoint, the cached token for that deployment is cleared and the user is prompted to re-register.BackendAIAppProxyClientbase inclient/v2/base_client.pyfor direct-to-deployment HTTP traffic (Bearer-token auth, app-proxy-aware JSON parsing) and a thinDeploymentChatClientsubclass exposingchat_completion()andlist_models()(returning a typedListModelsResponse).Model resolution order
When the runtime needs a
modelfield for achatcall, the CLI walks this list and stops at the first hit:--model <name>on thechatcommand line.config.<deployment-id>.model— the user's pinned model inconfig.json.cache.<deployment-id>.default_model— the auto-derived value incache.json.GET /v1/modelson the inference endpoint, takingdata[0].id. The result is written tocache.default_modelso subsequent calls skip the round trip.This means a fresh deployment works with zero configuration as long as the runtime serves
/v1/models; you only needchat-config set --modelfor multi-model deployments where[0]is not the right pick.Command usage
chat-config setwrites toconfig.jsononly — it does not contact the manager, so it stays usable while the deployment is still provisioning or the manager is unreachable.chat-config clearandclear-cacheoperate on the two storage files independently: clearing user config never touches the cache, and vice versa.On-disk state
State lives under
~/.backend.ai/deployment_chat/so it stays grouped with the other Backend.AI CLI state directories.cache.json— auto-managed by the CLI; do not hand-edit.{ "deployments": { "d55e251a-3a70-408d-97a9-ca305502aa58": { "endpoint_url": "https://app-proxy.example.com/v1/some-deployment", "default_model": "llama-3-8b-instruct", "last_synced_at": "2026-04-29T12:34:56.789012+00:00" } } }endpoint_url— fetched from the manager'sdeployment.network_access.endpoint_urland refreshed on a 24-hour TTL.default_model— auto-derived fromGET /v1/modelson first use; never written bychat-config set.last_synced_at— UTC timestamp of the last manager fetch; entries pastCACHE_ENTRY_TTL(24 h) are treated as a cache miss.config.json— user-managed: one{ token, model }entry per deployment.{ "deployments": { "d55e251a-3a70-408d-97a9-ca305502aa58": { "token": "sk-runtime-token-here", "model": "llama-3-8b-instruct" } } }Either field may be
null—chat-config setupserts only the fields you pass, and an entry is dropped automatically once both fields are cleared. The token is also cleared automatically on401/403from the inference endpoint so the nextchatcall surfaces the re-register hint instead of silently re-sending a stale credential.Resolves BA-5528.