Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
260 changes: 260 additions & 0 deletions cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# Design: `azd ai agent doctor` — Remote Checks (7–11)

## Status

**Draft — for design review.**
Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975).
Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.**

## Goal

Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"*

Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix.

Non-goals:
- Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1.
- Re-implementing what `az` / Foundry portal already do well — we just point the user there when relevant.
- Any modification to files outside `cli/azd/extensions/azure.ai.agents/`.

## Background

The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions.
Comment thread
therealjohn marked this conversation as resolved.
Outdated

The remaining ~30% — and arguably the most frustrating cases — sit in remote land:
- Token expired silently → every command 401s with no obvious "you're logged out" hint.
- Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace.
- Model deployment got deleted out from under the project → first invoke fails with a cryptic 404.
- Azure-AI Developer role missing → vague 403.
- Agent deployed but unhealthy → user thinks `azd deploy` succeeded so it must be fine.

These all require live calls and credentials, which is why they're carved out into this separate design.

## Scope of this doc

| Check | What it answers |
|---|---|
| 7 — Authentication | Am I logged in? Is the token still valid? |
| 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? |
| 9 — Model deployments | Do the model(s) my agent references exist on the project? |
| 10 — RBAC | Does my principal have the roles needed to deploy / invoke? |
| 11 — Agent status | Does the deployed agent actually exist and is it active? |
Comment thread
therealjohn marked this conversation as resolved.

## Architecture

### Component layout (additive only)

```
cli/azd/extensions/azure.ai.agents/internal/cmd/
└── doctor/
├── checks_local.go ← (MVP) checks 1–6
├── checks_remote.go ← NEW: checks 7–11
├── runner.go ← orchestrates ordered execution
└── types.go ← Check, Result, Severity
```

Existing MVP files unchanged. Only `runner.go` learns to dispatch the new check set, gated by a flag (see [Execution Model](#execution-model)).

### `Check` contract

Same shape as MVP — keeps the renderer, sort order, and `Next:` resolver code path identical.

```go
type Check interface {
ID() string // stable, e.g. "auth.token"
Title() string // human label
Run(ctx context.Context) Result
}

type Result struct {
Status Status // Pass | Warn | Fail | Skip
Detail string // one-line description
Fix string // single command or instruction; empty for Pass
Links []string // doc links (optional)
Duration time.Duration // captured by runner, not the check
}
```

A remote check that can't run (e.g., not authenticated when running check 9) returns `Status: Skip` with `Detail: "skipped: authentication required (see check 7)"`. Skips never block subsequent checks — the user sees the full picture in one pass.

## Execution Model

### Default: local-only

`azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode.
Comment thread
therealjohn marked this conversation as resolved.
Outdated

### Opt-in: `--remote`
Comment thread
therealjohn marked this conversation as resolved.
Outdated

`azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)).

Rationale for opt-in (vs. always-on or auto-detect):
- Remote checks need credentials and add seconds to a command users may run frequently.
- Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI.
- A flag is discoverable, reversible, and self-documenting.

We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about.

### Dependency matrix

| Check | Depends on | Action if dependency fails |
|---|---|---|
| 7 — Auth | check 3 (env selected) | Skip with "select an env first" |
| 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip |
| 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip |
| 10 — RBAC | 7 (auth), 8 (reachability) | Skip |
| 11 — Agent status | 8, 9, 10 | Skip |

Skips are explicit in the rendered report. We never quietly drop a check.

## Check Specifications

### Check 7 — Authentication

**What it does:** acquires a token via `azidentity.NewAzureDeveloperCLICredential` (the same credential `agent_context.go` already uses), then introspects expiry.

**Pass:** "<UPN> · token valid for <N> minutes"
**Warn:** token expires in <5 min (suggest re-login proactively)
**Fail:** token acquisition error → `azd auth login`

**Why a separate fail vs warn:** users hit this all the time when a CLI session outlives a token; the warn variant tells them to re-login *before* a long-running deploy 401s.
Comment thread
antriksh30 marked this conversation as resolved.

### Check 8 — Foundry project reachability

**What it does:** issues a single `GET <AZURE_AI_PROJECT_ENDPOINT>/agents?api-version=…&$top=1` with the credential from check 7.

**Pass:** "endpoint reachable (HTTP 200)"
**Fail:** maps the HTTP status to one of:
- `403` → check 10 (RBAC) will explain. Fix: see check 10 output.
- `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var.
- network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`".

The single-shot probe avoids paging and works on tiny test projects. Timeout: 10s, no retries — doctor is diagnostic, not resilient.

### Check 9 — Model deployments

**What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally.
Comment thread
therealjohn marked this conversation as resolved.

**Pass:** "all <N> referenced models present"
**Fail:** "missing: <model-name> (referenced by <service>)" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link).

**Edge cases:**
- Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here.
- Agents with no model reference (orchestration-only) — pass with "no model references".

### Check 10 — RBAC

**What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set:
- `Azure AI Developer` (or stronger) on the project — required for deploy + invoke.
- `Cognitive Services User` on the AI account — required for model usage.

**Pass:** "<role-1>, <role-2>"
**Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role.
**Fail:** missing both → command + portal link.

The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example:
```
fix: az role assignment create \
--role "Azure AI Developer" \
--assignee <principal-id> \
--scope <project-resource-id>
```

**Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited.

### Check 11 — Agent status
Comment thread
therealjohn marked this conversation as resolved.

**What it does:** for each service in `azure.yaml` with `host: ai-agent`, calls the Foundry agents-list endpoint and finds the matching agent by name.

**Pass:** "<agent-name>: active (v<N>)"
**Warn:** "<agent-name>: deploying (v<N>)" — running `monitor --follow` is suggested.
**Fail:**
- agent not found → `azd deploy`
- agent in failed state → `azd ai agent monitor --follow` (look at logs, not redeploy blindly)

This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly.

## Output Format

Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible:

```
azd ai agent doctor --remote
✓ PASS azd reachable
✓ PASS azure.yaml valid
✓ PASS azd environment selected
✓ PASS agent service in azure.yaml
✓ PASS AZURE_AI_PROJECT_ENDPOINT set
✓ PASS agent.yaml valid
✓ PASS authentication
alice@contoso.com · token valid for 47 minutes
✗ FAIL Foundry project reachability
GET /agents → 403
fix: see RBAC check below
- SKIP model deployments
skipped: reachability check failed
✗ FAIL RBAC
principal alice@contoso.com has no role on project scope
fix: az role assignment create --role "Azure AI Developer" \
--assignee 0000... --scope /subscriptions/.../projects/...
- SKIP agent status
skipped: reachability check failed

Next: az role assignment create ... -- grant yourself the required role
azd ai agent doctor --remote -- re-run to verify
```

The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands.

## Performance & Side Effects

Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use.

| Check | Network calls | Worst-case time |
|---|---|---|
| 7 — Auth | 0 (cached token) – 1 (refresh) | 2s |
| 8 — Reach | 1 GET | 10s (timeout) |
| 9 — Models | 1 list | 5s |
| 10 — RBAC | 1 list-role-assignments | 5s |
| 11 — Status | 1 list-agents | 5s |

Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode.

## Testing Strategy

- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8).
- **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test.
- **Functional test** (recorded cassette via `mage record`) covering one happy-path full run.
- No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step.

## Risks & Mitigations

| Risk | Mitigation |
|---|---|
| Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. |
Comment thread
therealjohn marked this conversation as resolved.
Outdated
| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. |
| Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. |
| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). |
| User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. |

## Performance Budget

Local-only `doctor` (today): <100ms. Don't regress.
`--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s.
We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic.

## Implementation Phases

1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step.
2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own.
3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions.
4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording.

Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4.

## Appendix: Why not fold these into MVP?

Three reasons, in order of weight:

1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit.
2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it.
3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path.
Loading
Loading