Skip to content

Commit 10f40f8

Browse files
Antriksh JainCopilot
andcommitted
feat(azure.ai.agents): add doctor check remote.auth (P5.1 C11)
Implements Check 7 from the doctor remote-checks design as the first populated entry in the remote chain wired up by C10. The check proves that `azd auth token` can mint a token for the Foundry data-plane scope and surfaces the result with branched severity: - Pass: "<UPN> · token valid for <N> minutes" - Warn (< 5 min remaining): suggest `azd auth login` proactively - Fail (expired or acquisition error): suggest `azd auth login` with a learn.microsoft.com link to the `azd auth login` reference - Skip (user cancelled): no suggestion — Ctrl-C is not an auth bug - Fail (probe deadline): distinct "timed out" message that does NOT immediately accuse the user of being logged out Skip-cascade: `local.environment-selected`. Per the design dependency matrix, an env must be selected before remote checks have a project to test against; otherwise the remote chain would run against the wrong context. `local.grpc-extension` is intentionally NOT a precondition — the auth probe goes straight to azidentity and does not need the AzdClient gRPC channel. Implementation notes: - Uses `azidentity.NewAzureDeveloperCLICredential` (same as `agent_context.go:newAgentCredential`) so the probe mirrors the production credential exactly. - Requests the production data-plane scope `https://ai.azure.com/ .default` (same as `agent_api/operations.go`) so a Pass here matches what the runtime invoke flow needs — not a different scope that might succeed when the runtime would fail. - Wraps the probe in a 10s timeout to bound stuck shells. After the probe returns the check classifies `context.Canceled` / `context.DeadlineExceeded` separately from generic auth errors so user Ctrl-C maps to Skip (not Fail) and a probe timeout maps to a distinct Fail message instead of telling the user to log in again when the real cause is a stuck `azd` invocation. - UPN extraction is best-effort JWT payload decoding (stdlib only, no third-party JWT lib). Tries `upn`, `unique_name`, `preferred_username`, `email` in order. Decode failures return an empty UPN and never an error: the auth check cares about the token's validity, not how readable its claims are. - The raw access token is never logged, returned, or exposed outside `realProbeAuth`. - Wrong-tenant detection is intentionally NOT done here — that's the job of `remote.foundry-endpoint` (C12) where a 403 maps to a precise "wrong tenant or insufficient RBAC" suggestion. Surfacing the same failure mode here would produce false positives — flagging auth as broken when the user IS authenticated, just against the wrong tenant. - `formatTokenWindow` substitutes "less than 1 minute" for any sub-minute positive window so the Warn message can never read "token expires in 0 minutes" — that wording is indistinguishable from expiry to a reader scanning the report. - `firstLine` strips a trailing `\r` after splitting on `\n` so Windows CRLF stderr output from `os/exec`-invoked `azd` does not leak a carriage return into the doctor report message. - Every Fail branch that suggests `azd auth login` now carries the same `authLoginLink` (a single package constant), so all four suggestion paths point at the canonical MS Learn reference. Testability: - New `probeAuth` test seam on `Dependencies` matches the pattern used by `assembleState`. Production wiring leaves it nil; the check falls back to `realProbeAuth`. Tests inject deterministic `authProbeResult` values to exercise each branch. - 22 new tests cover: skip-cascade (Fail + Skip cases); default seam fallback; all severity branches (Pass, Warn, sub-minute Warn, expired Fail with Links, acquisition-error Fail with Links, cancellation Skip, deadline Fail); singular / plural and sub-minute formatting; 5-minute Warn/Pass boundary; UPN claim ordering / empty / whitespace / non-string variants; and JWT decode failure modes (empty token, wrong segment count, invalid base64, non-JSON payload, non-object JSON root). Files: - checks_auth.go (new): newCheckAuth + realProbeAuth + extractUPN + format helpers + authLoginLink constant - checks_auth_test.go (new): comprehensive table-driven coverage - checks_local.go: add `probeAuth` test seam to Dependencies - checks_remote.go: append newCheckAuth(deps) to the remote chain - checks_remote_test.go: replace the empty-slice contract test with one pinning the auth-only shape, so any future addition has to touch this one assertion Reviewer findings applied (3-reviewer pass: Opus xhigh + Sonnet 4.6 + GPT-5.5): - HIGH (Sonnet): Expired-token Fail was missing `Links` — fixed, every Fail with `azd auth login` suggestion now carries the reference link via the new `authLoginLink` constant. - MEDIUM (GPT-5.5): Cancellation / timeout was classified as generic auth failure — fixed, classified separately with appropriate Skip / Fail and distinct messages. - MEDIUM (Sonnet + Opus convergent): Sub-minute positive validity rendered as "token expires in 0 minutes" — fixed via `formatTokenWindow` substituting "less than 1 minute". - MEDIUM (Sonnet): `firstLine` left trailing `\r` on Windows CRLF — fixed with `strings.TrimRight(s[:i], "\r")` + CRLF test case. - LOW (Sonnet): Doc said "false negatives" where logic required "false positives" — corrected. - LOW (Opus): 5-minute Warn/Pass boundary was not directly tested — added `TestCheckAuth_WarnPassBoundaryAtFiveMinutes`. - LOW (Opus): Doc comment cited a non-existent design path — updated to the actual `.tmp/pr-8057/...` location. Preflight: gofmt -s -w . ✓, go vet ./... ✓, go test ./... -count=1 (all packages) ✓, golangci-lint run ./internal/cmd/... ✓, cspell ✓. Refs: Azure#7975 Refs: design `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md` Check 7 / dependency matrix lines 112-131 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 8846fd5 commit 10f40f8

5 files changed

Lines changed: 790 additions & 24 deletions

File tree

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
// Copyright (c) Microsoft Corporation. All rights reserved.
2+
// Licensed under the MIT License.
3+
4+
package doctor
5+
6+
import (
7+
"context"
8+
"encoding/base64"
9+
"encoding/json"
10+
"errors"
11+
"fmt"
12+
"strings"
13+
"time"
14+
15+
"github.com/Azure/azure-sdk-for-go/sdk/azcore/policy"
16+
"github.com/Azure/azure-sdk-for-go/sdk/azidentity"
17+
)
18+
19+
// authProbeTimeout caps the per-probe network call. The design's
20+
// Performance Budget section (`.tmp/pr-8057/azd-ai-agent-doctor-
21+
// remote-checks.md`) allows 2s for cached-token reads and one token-
22+
// refresh round trip; 10s gives generous headroom for slow shells
23+
// without making the user wait through a stuck `azd auth token`
24+
// invocation.
25+
const authProbeTimeout = 10 * time.Second
26+
27+
// authLoginLink is the MS Learn URL for the `azd auth login` command
28+
// reference, reused across every Fail/Warn branch whose suggestion
29+
// is "run `azd auth login`". Keeping it as a package constant ensures
30+
// every branch points to the same canonical doc and prevents drift
31+
// between the four `Result` returns below.
32+
const authLoginLink = "https://learn.microsoft.com/azure/developer/" +
33+
"azure-developer-cli/reference#azd-auth-login"
34+
35+
// authWarnThreshold is the validity floor below which the check warns
36+
// the user to re-login proactively. Set to 5 minutes so a long-running
37+
// deploy or invoke started immediately after the check has a fresh
38+
// token instead of 401'ing mid-flight.
39+
const authWarnThreshold = 5 * time.Minute
40+
41+
// authScope is the Azure resource scope requested for the probe token.
42+
// It matches the production scope used by `agent_api/operations.go`
43+
// (`https://ai.azure.com/.default`), so a Pass here exactly mirrors
44+
// what the agent invoke flow needs at runtime — not a different scope
45+
// that might succeed when the runtime scope would fail.
46+
const authScope = "https://ai.azure.com/.default"
47+
48+
// authProbeResult is the structured outcome of one auth probe. The
49+
// shape is dictated by the testability split: the production probe
50+
// (realProbeAuth) talks to azidentity and returns the same fields a
51+
// test seam fills synthetically.
52+
//
53+
// upn is the User Principal Name extracted from the access token's
54+
// `upn` / `unique_name` / `preferred_username` / `email` claim
55+
// (whichever is present first). Empty when none of those claims are
56+
// readable — token is still valid, the check just renders without
57+
// the identifier.
58+
//
59+
// validFor is the duration from probe time to token expiry. Zero or
60+
// negative means the token is expired (defensive — GetToken normally
61+
// refreshes before returning, but we surface this honestly if it
62+
// happens).
63+
//
64+
// err captures token acquisition failure. When non-nil the other
65+
// fields are zero-valued; callers must branch on err first.
66+
type authProbeResult struct {
67+
upn string
68+
validFor time.Duration
69+
err error
70+
}
71+
72+
// newCheckAuth produces Check `remote.auth`. It runs after the local
73+
// chain because its skip-cascade reads `local.environment-selected`'s
74+
// result from `prior` — without an active env there is no project to
75+
// reason about and the check Skips with "select an env first" rather
76+
// than running unconditionally.
77+
//
78+
// The check itself is intentionally narrow: it answers "does
79+
// `azd auth token` succeed and how long until the token expires?" and
80+
// nothing else. Wrong-tenant detection is the job of check
81+
// `remote.foundry-endpoint` (C12) where a 403 maps to the precise
82+
// "wrong tenant or insufficient RBAC" suggestion. Conflating the two
83+
// here would produce false positives — flagging auth as broken when
84+
// the user IS authenticated, just against the wrong tenant.
85+
//
86+
// Skip-cascade: only on `local.environment-selected`. Per the design
87+
// dependency matrix, an env must be selected before remote checks
88+
// have a project to test against. Other local checks (e.g.,
89+
// `local.grpc-extension`) do not gate auth — the probe uses
90+
// `azidentity.NewAzureDeveloperCLICredential` directly and does not
91+
// require an AzdClient.
92+
func newCheckAuth(deps Dependencies) Check {
93+
return Check{
94+
ID: "remote.auth",
95+
Name: "authentication",
96+
Remote: true,
97+
Fn: func(ctx context.Context, _ Options, prior []Result) Result {
98+
if priorBlocked(prior, "local.environment-selected") {
99+
return Result{
100+
Status: StatusSkip,
101+
Message: "skipped: select an azd environment first " +
102+
"(see check `local.environment-selected`).",
103+
}
104+
}
105+
106+
probe := deps.probeAuth
107+
if probe == nil {
108+
probe = realProbeAuth
109+
}
110+
probeCtx, cancel := context.WithTimeout(ctx, authProbeTimeout)
111+
defer cancel()
112+
res := probe(probeCtx)
113+
114+
if res.err != nil {
115+
// Classify cancellation / timeout separately so we
116+
// don't tell the user to run `azd auth login` when
117+
// the real cause is a cancelled doctor command or a
118+
// probe timeout. `errors.Is` correctly walks the
119+
// wrap chain that azidentity returns. We check the
120+
// outer ctx first so user-initiated cancellation
121+
// (Ctrl-C) shadows the timeout that would also fire.
122+
if errors.Is(ctx.Err(), context.Canceled) ||
123+
errors.Is(res.err, context.Canceled) {
124+
return Result{
125+
Status: StatusSkip,
126+
Message: "skipped: auth probe was cancelled before completion.",
127+
}
128+
}
129+
if errors.Is(probeCtx.Err(), context.DeadlineExceeded) ||
130+
errors.Is(res.err, context.DeadlineExceeded) {
131+
return Result{
132+
Status: StatusFail,
133+
Message: fmt.Sprintf(
134+
"token acquisition timed out after %s.",
135+
authProbeTimeout),
136+
Suggestion: "Retry `azd ai agent doctor`; if the timeout " +
137+
"persists, check your network or run " +
138+
"`azd auth login` to refresh the credential cache.",
139+
}
140+
}
141+
return Result{
142+
Status: StatusFail,
143+
Message: "token acquisition failed: " + firstLine(res.err.Error()),
144+
Suggestion: "Run `azd auth login` to authenticate.",
145+
Links: []string{authLoginLink},
146+
}
147+
}
148+
149+
if res.validFor <= 0 {
150+
return Result{
151+
Status: StatusFail,
152+
Message: composeAuthMessage(res.upn, "token has expired"),
153+
Suggestion: "Run `azd auth login` to refresh the token.",
154+
Links: []string{authLoginLink},
155+
}
156+
}
157+
minutes := int(res.validFor.Minutes())
158+
if res.validFor < authWarnThreshold {
159+
return Result{
160+
Status: StatusWarn,
161+
Message: composeAuthMessage(res.upn,
162+
"token expires in "+formatTokenWindow(res.validFor)),
163+
Suggestion: "Run `azd auth login` to refresh the token " +
164+
"before it expires.",
165+
Links: []string{authLoginLink},
166+
Details: map[string]any{
167+
"validForMinutes": minutes,
168+
"upn": res.upn,
169+
},
170+
}
171+
}
172+
return Result{
173+
Status: StatusPass,
174+
Message: composeAuthMessage(res.upn,
175+
"token valid for "+formatTokenWindow(res.validFor)),
176+
Details: map[string]any{
177+
"validForMinutes": minutes,
178+
"upn": res.upn,
179+
},
180+
}
181+
},
182+
}
183+
}
184+
185+
// realProbeAuth is the production implementation of the auth probe.
186+
// It constructs the same AzureDeveloperCLICredential the extension
187+
// uses at runtime (`internal/cmd/agent_context.go:103` —
188+
// `newAgentCredential`), requests a token for the production scope,
189+
// and decodes the access token's JWT payload for a UPN claim.
190+
//
191+
// We intentionally use an empty AzureDeveloperCLICredentialOptions
192+
// (no TenantID override) so the probe targets the user's home tenant.
193+
// Wrong-tenant scenarios are detected by check `remote.foundry-endpoint`
194+
// (C12) where a 403 against the project endpoint maps to a precise
195+
// "wrong tenant or insufficient RBAC" suggestion — surfacing the same
196+
// failure mode here would double-report and dilute the diagnosis.
197+
//
198+
// The function never logs the raw token. JWT parsing is delegated to
199+
// `extractUPN`, which returns empty on any decode failure (the token
200+
// is still valid; the check just renders without the identifier).
201+
func realProbeAuth(ctx context.Context) authProbeResult {
202+
cred, err := azidentity.NewAzureDeveloperCLICredential(
203+
&azidentity.AzureDeveloperCLICredentialOptions{},
204+
)
205+
if err != nil {
206+
return authProbeResult{err: fmt.Errorf("create credential: %w", err)}
207+
}
208+
tok, err := cred.GetToken(ctx, policy.TokenRequestOptions{
209+
Scopes: []string{authScope},
210+
})
211+
if err != nil {
212+
return authProbeResult{err: err}
213+
}
214+
return authProbeResult{
215+
upn: extractUPN(tok.Token),
216+
validFor: time.Until(tok.ExpiresOn),
217+
}
218+
}
219+
220+
// extractUPN best-effort decodes a JWT's payload and returns the first
221+
// non-empty UPN-like claim. Order: `upn`, `unique_name`,
222+
// `preferred_username`, `email`. Returns "" on any parse error or
223+
// when none of the claims are present — never an error: the auth
224+
// check cares about the token's validity, not how readable its
225+
// claims are. The raw token is never returned, logged, or otherwise
226+
// exposed by this function.
227+
func extractUPN(token string) string {
228+
parts := strings.Split(token, ".")
229+
if len(parts) != 3 {
230+
return ""
231+
}
232+
payload, err := base64.RawURLEncoding.DecodeString(parts[1])
233+
if err != nil {
234+
return ""
235+
}
236+
var claims map[string]any
237+
if err := json.Unmarshal(payload, &claims); err != nil {
238+
return ""
239+
}
240+
for _, key := range []string{"upn", "unique_name", "preferred_username", "email"} {
241+
if v, ok := claims[key].(string); ok {
242+
if s := strings.TrimSpace(v); s != "" {
243+
return s
244+
}
245+
}
246+
}
247+
return ""
248+
}
249+
250+
// composeAuthMessage formats the user-visible Message for the auth
251+
// check, prepending the UPN (when known) so the report identifies the
252+
// authenticated identity at a glance — matching the design's example
253+
// "<UPN> · token valid for <N> minutes".
254+
func composeAuthMessage(upn, body string) string {
255+
if upn == "" {
256+
return body
257+
}
258+
return upn + " · " + body
259+
}
260+
261+
// formatMinutes renders a minute count with correct singular /
262+
// plural unit. "1 minute" vs "47 minutes" reads less awkward than a
263+
// fixed "minute(s)" suffix in the doctor report.
264+
func formatMinutes(n int) string {
265+
if n == 1 {
266+
return "1 minute"
267+
}
268+
return fmt.Sprintf("%d minutes", n)
269+
}
270+
271+
// formatTokenWindow renders a positive validity duration for the
272+
// user-visible Pass / Warn messages. For sub-minute windows we
273+
// substitute "less than 1 minute" so the message can never read
274+
// "0 minutes" — that wording is indistinguishable from expiry to a
275+
// reader scanning the report quickly and would obscure the Warn
276+
// severity. Sub-second windows are rounded up to the same bucket.
277+
// Callers must have already classified `<= 0` as Fail before calling
278+
// this function.
279+
func formatTokenWindow(d time.Duration) string {
280+
if d < time.Minute {
281+
return "less than 1 minute"
282+
}
283+
return formatMinutes(int(d.Minutes()))
284+
}
285+
286+
// firstLine returns s up to the first newline (exclusive) with any
287+
// trailing carriage return stripped. Used to elide multi-line stack
288+
// traces returned by azidentity (which on Windows commonly uses CRLF
289+
// because `azd` is invoked via `os/exec`); the doctor report should
290+
// be one line per failure, and the trailing suggestion already tells
291+
// the user what to do.
292+
func firstLine(s string) string {
293+
if i := strings.IndexByte(s, '\n'); i >= 0 {
294+
return strings.TrimRight(s[:i], "\r")
295+
}
296+
return s
297+
}

0 commit comments

Comments
 (0)