-
Notifications
You must be signed in to change notification settings - Fork 315
docs(ai-agents): design doc for context-aware next-step guidance #8057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
antriksh30
wants to merge
5
commits into
main
Choose a base branch
from
antrikshjain/agent-next-step-design
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+835
−0
Open
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
2329299
docs(ai-agents): design doc for context-aware next-step guidance
18bc658
docs(ai-agents): add separate design for doctor remote checks (7-11)
4358e68
docs(agent-nextsteps): address PR review feedback
575a450
docs(azure.ai.agents): address therealjohn doctor doc review
9595e06
docs(azure.ai.agents): address trangevi review and tighten spec voice
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
260 changes: 260 additions & 0 deletions
260
cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,260 @@ | ||
| # Design: `azd ai agent doctor` — Remote Checks (7–11) | ||
|
|
||
| ## Status | ||
|
|
||
| **Draft — for design review.** | ||
| Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975). | ||
| Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** | ||
|
|
||
| ## Goal | ||
|
|
||
| Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"* | ||
|
|
||
| Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix. | ||
|
|
||
| Non-goals: | ||
| - Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1. | ||
| - Re-implementing what `az` / Foundry portal already do well — we just point the user there when relevant. | ||
| - Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. | ||
|
|
||
| ## Background | ||
|
|
||
| The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions. | ||
|
|
||
| The remaining ~30% — and arguably the most frustrating cases — sit in remote land: | ||
| - Token expired silently → every command 401s with no obvious "you're logged out" hint. | ||
| - Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace. | ||
| - Model deployment got deleted out from under the project → first invoke fails with a cryptic 404. | ||
| - Azure-AI Developer role missing → vague 403. | ||
| - Agent deployed but unhealthy → user thinks `azd deploy` succeeded so it must be fine. | ||
|
|
||
| These all require live calls and credentials, which is why they're carved out into this separate design. | ||
|
|
||
| ## Scope of this doc | ||
|
|
||
| | Check | What it answers | | ||
| |---|---| | ||
| | 7 — Authentication | Am I logged in? Is the token still valid? | | ||
| | 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? | | ||
| | 9 — Model deployments | Do the model(s) my agent references exist on the project? | | ||
| | 10 — RBAC | Does my principal have the roles needed to deploy / invoke? | | ||
| | 11 — Agent status | Does the deployed agent actually exist and is it active? | | ||
|
therealjohn marked this conversation as resolved.
|
||
|
|
||
| ## Architecture | ||
|
|
||
| ### Component layout (additive only) | ||
|
|
||
| ``` | ||
| cli/azd/extensions/azure.ai.agents/internal/cmd/ | ||
| └── doctor/ | ||
| ├── checks_local.go ← (MVP) checks 1–6 | ||
| ├── checks_remote.go ← NEW: checks 7–11 | ||
| ├── runner.go ← orchestrates ordered execution | ||
| └── types.go ← Check, Result, Severity | ||
| ``` | ||
|
|
||
| Existing MVP files unchanged. Only `runner.go` learns to dispatch the new check set, gated by a flag (see [Execution Model](#execution-model)). | ||
|
|
||
| ### `Check` contract | ||
|
|
||
| Same shape as MVP — keeps the renderer, sort order, and `Next:` resolver code path identical. | ||
|
|
||
| ```go | ||
| type Check interface { | ||
| ID() string // stable, e.g. "auth.token" | ||
| Title() string // human label | ||
| Run(ctx context.Context) Result | ||
| } | ||
|
|
||
| type Result struct { | ||
| Status Status // Pass | Warn | Fail | Skip | ||
| Detail string // one-line description | ||
| Fix string // single command or instruction; empty for Pass | ||
| Links []string // doc links (optional) | ||
| Duration time.Duration // captured by runner, not the check | ||
| } | ||
| ``` | ||
|
|
||
| A remote check that can't run (e.g., not authenticated when running check 9) returns `Status: Skip` with `Detail: "skipped: authentication required (see check 7)"`. Skips never block subsequent checks — the user sees the full picture in one pass. | ||
|
|
||
| ## Execution Model | ||
|
|
||
| ### Default: local-only | ||
|
|
||
| `azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode. | ||
|
therealjohn marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Opt-in: `--remote` | ||
|
therealjohn marked this conversation as resolved.
Outdated
|
||
|
|
||
| `azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)). | ||
|
|
||
| Rationale for opt-in (vs. always-on or auto-detect): | ||
| - Remote checks need credentials and add seconds to a command users may run frequently. | ||
| - Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI. | ||
| - A flag is discoverable, reversible, and self-documenting. | ||
|
|
||
| We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about. | ||
|
|
||
| ### Dependency matrix | ||
|
|
||
| | Check | Depends on | Action if dependency fails | | ||
| |---|---|---| | ||
| | 7 — Auth | check 3 (env selected) | Skip with "select an env first" | | ||
| | 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip | | ||
| | 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip | | ||
| | 10 — RBAC | 7 (auth), 8 (reachability) | Skip | | ||
| | 11 — Agent status | 8, 9, 10 | Skip | | ||
|
|
||
| Skips are explicit in the rendered report. We never quietly drop a check. | ||
|
|
||
| ## Check Specifications | ||
|
|
||
| ### Check 7 — Authentication | ||
|
|
||
| **What it does:** acquires a token via `azidentity.NewAzureDeveloperCLICredential` (the same credential `agent_context.go` already uses), then introspects expiry. | ||
|
|
||
| **Pass:** "<UPN> · token valid for <N> minutes" | ||
| **Warn:** token expires in <5 min (suggest re-login proactively) | ||
| **Fail:** token acquisition error → `azd auth login` | ||
|
|
||
| **Why a separate fail vs warn:** users hit this all the time when a CLI session outlives a token; the warn variant tells them to re-login *before* a long-running deploy 401s. | ||
|
antriksh30 marked this conversation as resolved.
|
||
|
|
||
| ### Check 8 — Foundry project reachability | ||
|
|
||
| **What it does:** issues a single `GET <AZURE_AI_PROJECT_ENDPOINT>/agents?api-version=…&$top=1` with the credential from check 7. | ||
|
|
||
| **Pass:** "endpoint reachable (HTTP 200)" | ||
| **Fail:** maps the HTTP status to one of: | ||
| - `403` → check 10 (RBAC) will explain. Fix: see check 10 output. | ||
| - `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var. | ||
| - network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`". | ||
|
|
||
| The single-shot probe avoids paging and works on tiny test projects. Timeout: 10s, no retries — doctor is diagnostic, not resilient. | ||
|
|
||
| ### Check 9 — Model deployments | ||
|
|
||
| **What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally. | ||
|
therealjohn marked this conversation as resolved.
|
||
|
|
||
| **Pass:** "all <N> referenced models present" | ||
| **Fail:** "missing: <model-name> (referenced by <service>)" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link). | ||
|
|
||
| **Edge cases:** | ||
| - Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here. | ||
| - Agents with no model reference (orchestration-only) — pass with "no model references". | ||
|
|
||
| ### Check 10 — RBAC | ||
|
|
||
| **What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set: | ||
| - `Azure AI Developer` (or stronger) on the project — required for deploy + invoke. | ||
| - `Cognitive Services User` on the AI account — required for model usage. | ||
|
|
||
| **Pass:** "<role-1>, <role-2>" | ||
| **Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role. | ||
| **Fail:** missing both → command + portal link. | ||
|
|
||
| The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example: | ||
| ``` | ||
| fix: az role assignment create \ | ||
| --role "Azure AI Developer" \ | ||
| --assignee <principal-id> \ | ||
| --scope <project-resource-id> | ||
| ``` | ||
|
|
||
| **Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited. | ||
|
|
||
| ### Check 11 — Agent status | ||
|
therealjohn marked this conversation as resolved.
|
||
|
|
||
| **What it does:** for each service in `azure.yaml` with `host: ai-agent`, calls the Foundry agents-list endpoint and finds the matching agent by name. | ||
|
|
||
| **Pass:** "<agent-name>: active (v<N>)" | ||
| **Warn:** "<agent-name>: deploying (v<N>)" — running `monitor --follow` is suggested. | ||
| **Fail:** | ||
| - agent not found → `azd deploy` | ||
| - agent in failed state → `azd ai agent monitor --follow` (look at logs, not redeploy blindly) | ||
|
|
||
| This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly. | ||
|
|
||
| ## Output Format | ||
|
|
||
| Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible: | ||
|
|
||
| ``` | ||
| azd ai agent doctor --remote | ||
| ✓ PASS azd reachable | ||
| ✓ PASS azure.yaml valid | ||
| ✓ PASS azd environment selected | ||
| ✓ PASS agent service in azure.yaml | ||
| ✓ PASS AZURE_AI_PROJECT_ENDPOINT set | ||
| ✓ PASS agent.yaml valid | ||
| ✓ PASS authentication | ||
| alice@contoso.com · token valid for 47 minutes | ||
| ✗ FAIL Foundry project reachability | ||
| GET /agents → 403 | ||
| fix: see RBAC check below | ||
| - SKIP model deployments | ||
| skipped: reachability check failed | ||
| ✗ FAIL RBAC | ||
| principal alice@contoso.com has no role on project scope | ||
| fix: az role assignment create --role "Azure AI Developer" \ | ||
| --assignee 0000... --scope /subscriptions/.../projects/... | ||
| - SKIP agent status | ||
| skipped: reachability check failed | ||
|
|
||
| Next: az role assignment create ... -- grant yourself the required role | ||
| azd ai agent doctor --remote -- re-run to verify | ||
| ``` | ||
|
|
||
| The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands. | ||
|
|
||
| ## Performance & Side Effects | ||
|
|
||
| Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use. | ||
|
|
||
| | Check | Network calls | Worst-case time | | ||
| |---|---|---| | ||
| | 7 — Auth | 0 (cached token) – 1 (refresh) | 2s | | ||
| | 8 — Reach | 1 GET | 10s (timeout) | | ||
| | 9 — Models | 1 list | 5s | | ||
| | 10 — RBAC | 1 list-role-assignments | 5s | | ||
| | 11 — Status | 1 list-agents | 5s | | ||
|
|
||
| Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode. | ||
|
|
||
| ## Testing Strategy | ||
|
|
||
| - **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8). | ||
| - **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test. | ||
| - **Functional test** (recorded cassette via `mage record`) covering one happy-path full run. | ||
| - No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step. | ||
|
|
||
| ## Risks & Mitigations | ||
|
|
||
| | Risk | Mitigation | | ||
| |---|---| | ||
| | Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. | | ||
|
therealjohn marked this conversation as resolved.
Outdated
|
||
| | RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. | | ||
| | Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. | | ||
| | Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). | | ||
| | User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. | | ||
|
|
||
| ## Performance Budget | ||
|
|
||
| Local-only `doctor` (today): <100ms. Don't regress. | ||
| `--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s. | ||
| We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic. | ||
|
|
||
| ## Implementation Phases | ||
|
|
||
| 1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step. | ||
| 2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own. | ||
| 3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions. | ||
| 4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording. | ||
|
|
||
| Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4. | ||
|
|
||
| ## Appendix: Why not fold these into MVP? | ||
|
|
||
| Three reasons, in order of weight: | ||
|
|
||
| 1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit. | ||
| 2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it. | ||
| 3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.