Commit ec6aa9e
feat(azure.ai.agents): add doctor check remote.foundry-endpoint (P5.1 C12)
Implements Check 8 from the doctor remote-checks design as the second
populated entry in the remote chain. The check proves that the
configured Foundry project endpoint is reachable AND that the bearer
token minted by C11 actually authorizes against it — a combined
"can we talk to the project at all" signal that gates every other
remote check (models / agent status / RBAC) from the design's
dependency matrix.
The probe issues a single `GET <endpoint>/agents?api-version=<v>&limit=1`
with a 10s timeout, no retries, using the same credential + scope
as the production runtime path (`agent_context.go:newAgentCredential`
+ `agent_api/operations.go`). The `limit=1` parameter matches the
production agent_api client exactly so a Pass here proves the same
query shape the runtime invoke flow uses (the earlier `$top=1`
choice was a divergence flagged by reviewers).
The status-code response is mapped 1:1 to user-actionable outcomes:
- 200 → Pass: "endpoint reachable (HTTP 200)"
- 401 → Fail: "token expired or scope mismatch" +
suggest `azd auth login` (link: auth login docs)
- 403 → Fail: "wrong tenant or insufficient RBAC" +
suggest re-auth in correct tenant and let
`remote.rbac` flag the role-assignment gap
(deliberately NO `azd auth login` here, but DO
carry a docs Link to the Foundry RBAC quickstart
so the suggestion is actionable — matches the
C11 "every actionable Fail carries Links"
convention)
- 404 → Fail: "endpoint is wrong or project is gone" +
suggest `azd provision` / `azd env set
AZURE_AI_PROJECT_ENDPOINT`
- 5xx → Fail: "service-side error" + suggest retry
- other → Fail: "unexpected HTTP <N>" + verbose hint
- transport err → Fail: "could not reach <host>: <first line>" +
network / VPN / firewall guidance
- ctx canceled → Skip (user aborted)
- 10s elapsed → Fail: "did not respond within 10s" + retry hint
Skip-cascade: `local.project-endpoint-set` AND `remote.auth`. The
former gives us the endpoint to probe; the latter gives us the
token to authenticate with. Skipping when either failed prevents
double-reporting the same root cause. `local.environment-selected`
is transitively cascaded via `local.project-endpoint-set`.
Implementation notes:
- The api-version is read from the package-level
`DefaultAgentAPIVersion` constant by the Cobra wiring in
`cmd/doctor.go` and passed in through `Dependencies.AgentAPIVersion`.
This honors the design's "single source of truth" requirement
while keeping the doctor package self-contained (no import cycle
against `internal/cmd`).
- `makeRealProbeFoundryEndpoint(apiVersion string) func(...)` is
a closure factory rather than a top-level function so the
api-version is captured at construction time without becoming a
global.
- `buildFoundryProbeURL` parses the user-supplied endpoint FIRST,
then mutates `u.Path` (trim-right + "/agents"), clears
`u.Fragment` / `u.RawFragment`, and only then sets RawQuery.
This prevents a stray `?api-version=evil` or `#fragment` in the
endpoint from displacing the `/agents` path segment — a
silent-misdiagnosis bug the prior `endpoint + "/agents"` string
concatenation could trigger. Regression tests now positively
assert `/agents?` is present in every successful build output.
The builder also returns an error for any endpoint that is not
an absolute HTTPS URL with a non-empty host, so a malformed
env value cannot leak a bearer token to the wrong scheme/host.
- A new `validateFoundryEndpoint` helper runs at the check-level
BEFORE the probe is invoked, BEFORE any token is acquired. A
non-HTTPS, relative, or otherwise-malformed
`AZURE_AI_PROJECT_ENDPOINT` surfaces a precise Fail with an
`azd env set AZURE_AI_PROJECT_ENDPOINT <https://...>` suggestion
instead of either a generic transport error (with the token
leak that would have come with it) or the builder's defensive
error wrapped in a less-helpful network-VPN-firewall message.
- Cancellation classification mirrors C11's pattern:
`errors.Is(ctx.Err(), context.Canceled)` → StatusSkip (user
aborted); `errors.Is(probeCtx.Err(), context.DeadlineExceeded)`
→ StatusFail (we hit our own 10s bound, not the parent ctx).
- Multi-line transport errors are reduced to their first line
via the shared `firstLine` helper from C11 so the resulting
Message stays readable.
- The `Details` map carries the endpoint, request URL, and
HTTP status code (when available) for `--output json` consumers
and `--unredacted` debugging. No raw tokens, no response body
excerpts.
- 24 tests cover skip-cascade (env-not-selected, endpoint-not-set,
auth-failed, AgentAPIVersion-missing), every status-code branch,
cancellation vs timeout disambiguation, URL builder safety
(junk query in endpoint, trailing slashes, fragment, blank
api-version, non-HTTPS / relative / malformed endpoint),
endpoint validation (HTTPS-only, non-empty host, well-formed),
helper functions (`endpointHost`, `readProjectEndpoint`,
`firstLine` reuse), a TLS httptest server smoke test asserting
the built URL lands on `/agents` on the wire, and a token-leak
sanity check on Details / Message / Suggestion strings.
Behavior: with this commit, `azd ai agent doctor` now produces two
remote checks (`remote.auth` from C11 + `remote.foundry-endpoint`
from C12) instead of just one. The full remote chain still requires
C13+ to be useful end-to-end, but every subsequent check can now
take a Pass on this one as proof that the project URL works.
Refs: Azure#7975, PR Azure#8057 design-spec
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 10f40f8 commit ec6aa9e
6 files changed
Lines changed: 1181 additions & 15 deletions
File tree
- cli/azd/extensions/azure.ai.agents/internal/cmd
- doctor
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| 91 | + | |
91 | 92 | | |
92 | 93 | | |
93 | 94 | | |
| |||
0 commit comments