Skip to content

Commit d478803

Browse files
antriksh30Antriksh JainCopilottrangevi
authored
Azure AI Agents: add next-step guidance and doctor diagnostics (#8198)
* feat(azure.ai.agents): scaffold nextstep package and isTerminal helper Add the foundation for context-aware `Next:` guidance described in PR #8057: - New `internal/cmd/nextstep` package with `Suggestion`, `State`, `ServiceState`, `AuthState` types and a format-agnostic `PrintNext` writer that aligns commands on the longest entry and caps output at one primary + one secondary line. - Add an `isTerminal(fd uintptr) bool` helper in `internal/cmd/helpers.go` wrapping `golang.org/x/term`; promote that module from indirect to direct in `go.mod`. - Register `nextstep` in the repo cspell dictionary. No callers yet; resolvers, state assembly, and command wiring land in subsequent commits. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): add AssembleState to nextstep package Introduces nextstep.AssembleState, a single best-effort probe of the current azd world that the resolver (next commit) will read from. It captures three things the design relies on: 1. Whether AZURE_AI_PROJECT_ENDPOINT is set in the active environment (HasProjectEndpoint). 2. The agent services declared in azure.yaml, in alphabetical order (Services). 3. For each service, whether azd recorded a successful deploy. The signal is AGENT_<KEY>_VERSION non-empty in the env, matching the convention written by registerAgentEnvironmentVariables in service_target_agent.go. KEY is derived via the same spaces+hyphens-to-underscore upper-case transform getServiceKey uses (lines 222-226 of service_target_agent.go). Probes are best-effort: transport errors are collected and returned alongside a partial State so resolvers can still degrade gracefully (e.g., suggest azd init when project load fails). A small Source interface decouples the assembler from *azdext.AzdClient so tests can be hand-rolled fakes; production wraps the real client via NewSource. WithAuthProbe / WithOpenAPIProbe options are plumbed but inert until commit 1.3 / 1.4 land keeps the public API stable from day one so callers and tests don't need rewriting later. Plan refs closed: D4 (IsDeployed rule). Closes the data-gathering half of Phase 1 commit 1.2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): scope nextstep state to azure.ai.agent services AssembleState's collectServices iterated every azure.yaml service, so a project with mixed hosts (e.g. one agent + one containerapp web tier) would have leaked the web service into nextstep's view and triggered spurious AGENT_<KEY>_VERSION env lookups for it. Filter at the boundary on Host == agentHost (mirrors the cmd.AiAgentHost literal; intentional duplication to keep nextstep importable from cmd without a cycle). Tests: existing fixtures updated to use the canonical host; new 'non-agent services are filtered out' case pins the behavior; TestAgentHostConstant pins the literal to guard against drift. Resolves the F1 finding from the cross-pollinated review of 5ab18b7d1 (3/3 reviewer consensus). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): add nextstep resolver, OpenAPI extractor, and error vocabulary Phase 1 commit 1.3. Three pure-Go source files plus tests, all under `internal/cmd/nextstep/`. No callers yet; nothing prints. Wires the remaining "decide what to print" machinery so Phase 2 commits can swap out the hardcoded hint blocks in init/run/invoke/show/deploy. resolver.go Pure decision functions over *State, one per command outcome: - ResolveAfterInit - ResolveAfterRun - ResolveAfterInvoke (success + typed failure) - ResolveAfterShow - ResolveAfterDeploy Filesystem and OpenAPI-cache access flow through caller-injected closures (cachedPayload, readmeExists) so the resolver stays pure and unit-testable. No I/O, no globals. openapi.go - ExtractInvokeExample(spec []byte) string: walks paths./invocations.post.requestBody.content.application/json with explicit $ref short-circuit at both requestBody and schema levels. Resolution order: content.example -> schema.example -> generated from required+properties[*].example -> "". Silent on any miss. - ReadCachedOpenAPISpec(configDir, agentName, suffix): mirrors the writer-side path shape from helpers.go (fetchOpenAPISpec) so the two stay in lockstep. Returns (nil, nil) on os.ErrNotExist. error_codes.go Typed wire-level vocabulary, sourced verbatim from the vienna platform's authoritative enums: - UserErrorCode (HostedAgentVersionManager.cs) - SessionErrorCode (Session/Exceptions/SessionErrorCode.cs) - AgentVersionStatus (Contracts/V2/Generated/Agents/.../...Status.cs) Plus RemediationForUserErrorCode / RemediationForSessionErrorCode helpers returning the platform's troubleshooting URL + suggestion text. Surfaces codes verbatim; no re-classification. The platform appends its own aka.ms TSG link via WithTroubleshootingInfo, so the extension just passes Code + Message through. Strategy delta D5 (will be recorded in STRATEGY-DELTA.md): the plan assumed cache path .azd/agent-cache/<env>-<agent>-openapi.json; the actual writer in helpers.go:317-374 uses <configDir>/openapi-<safeName>-<suffix>.json where safeName runs strings.ReplaceAll on "..", "/", and "\\". The reader mirrors that shape byte-for-byte so the two halves never drift. Tests cover every branch in every resolver, every $ref-short-circuit path in the extractor, the writer/reader sanitization contract, every remediation arm in the error_codes mapping, and pin every wire-level string against the platform contract (so a typo in a Go const can't silently diverge from what the service emits). Closes plan items C5, C11 (foundation). Sets up the Phase 2 callers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): apply consensus fixes to nextstep resolver Three findings emerged from the 3-model code review of commit 0b395756f (Opus 4.7 xhigh, Sonnet 4.6, GPT-5.5) and were corroborated via cross-pollination across the reviewers. Three were adopted; one was dropped after the author empirically tested the affected shell. F-A: shell-escape single quotes in OpenAPI-derived payloads. resolver.go lines 104 and 313 wrapped state.OpenAPIPayload / cached payload in single quotes via raw concatenation. The payload comes from json.Marshal in ExtractInvokeExample, which does not escape apostrophes, so an OpenAPI example such as {"q":"don't"} terminated the surrounding single-quoted shell argument and broke the suggested invoke command. Introduce shellEscapeSingleQuoted using the POSIX '\'' idiom and route both sites through it. Cross-pollinated: 3 of 3 reviewers concurred. F-B: honor ServiceState.Protocol in ResolveAfterShow Active branch. The Active case unconditionally passed ProtocolResponses to invokeCommandFor, so an invocations-protocol agent was suggested the responses-style "Hello!" payload (which the agent rejects). Look up the service via findService and default to ProtocolResponses only on miss. Existing test asserted only a substring containing "azd ai agent invoke echo", which passed for either payload that is why the bug slipped past code review on 1.3. Replace the substring assertion with exact matches and add explicit subtests for invocations vs responses. Cross-pollinated: 3 of 3 concurred. F-D: populate ServiceState.Protocol from agent.yaml in collectServices. The Protocol field was declared in types.go but never written by the production code path, so F-B's lookup would have silently fallen back to ProtocolResponses for every agent in real use. Add loadServiceProtocol(projectPath, relativePath) that reads <root>/<rel>/agent.yaml, parses agent_yaml.ContainerAgent, and picks ProtocolResponses when declared (broadest compatibility), ProtocolInvocations when only invocations is declared, or "" on any error. All failure modes are silent the resolver degrades to responses-default rather than surfacing transient I/O errors through the next-step hint. Cross-pollinated: Opus, Sonnet, and GPT-5.5 all confirmed the field was production-dead. F-C dropped: bash !" history expansion. Sonnet flagged that "Hello!" would trigger bash history expansion. Opus empirically refuted by running bash 5.1.16: !" is not a history designator and bash leaves it literal. GPT-5.5 confirmed on cross-pollination. No change. Tests: TestResolveAfterRun gains an apostrophe-in-payload case. TestResolveAfterDeploy gains an apostrophe-in-payload case. TestResolveAfterShow Active row split into an explicit substring assertion plus three subtests asserting protocol-driven payload selection. TestLoadServiceProtocol covers single/multi/empty/malformed manifests and missing files. TestAssembleState_PopulatesProtocolFromAgentYaml exercises the end-to-end path on a temp dir. No user-visible change yet; resolvers remain wired only to themselves. Phase 2 will surface the corrected suggestions to real users when init.go is the first caller. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): clarify shellEscapeSingleQuoted doc comment The previous doc comment named the POSIX escape idiom literally using backtick-delimited examples that included backslash-apostrophe sequences. Those byte sequences proved fragile through PowerShell heredoc / editor format-on-save round-trips, and ended up showing U+201D smart-quotes in the committed file instead of the intended ASCII characters. A user reading the comment would also have been misled: the names given (after the smart-quote substitution) did not match what the function actually emits on line 397. Rewrites the comment in prose, anchoring the byte-pattern reference to the implementation line (which uses a Go raw string so the literal cannot be mangled). Also restates the PowerShell adaptation guidance in terms of PowerShell's own two-consecutive-apostrophes convention instead of referencing the POSIX byte pattern. 3-of-3 reviewer consensus on the underlying finding (Sonnet flagged the original; Opus and GPT-5.5 cross-pollinated confirmation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): drop stale line reference in shellEscapeSingleQuoted doc The previous doc rewrite pointed at "line 397" for the byte pattern, but in the committed file line 397 is mid-paragraph prose about json.Marshal. The actual implementation line moved to 404 once the prose rewrite expanded the comment by six lines. A reader following the cross-reference would land in the wrong place. Drops the line-number reference in favor of "the implementation below uses a Go raw string for that sequence so its byte pattern is stable across edits." Hard-coded line numbers inside the same file are inherently fragile and should be avoided. 3-of-3 reviewer consensus on the stale-reference finding: GPT-5.5 and Opus and Sonnet independently flagged it on the 787145acc review pass. Fix mirrors what all three reviewers suggested. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): silent fetchOpenAPISpec + wire cache-only OpenAPI probe Refactors fetchOpenAPISpec so callers control the "OpenAPI spec saved to %s" output, and wires the previously-placeholder WithOpenAPIProbe option in the nextstep package to actually populate State.HasOpenAPI / OpenAPIPayload from the on-disk cache the invoke flow writes. Closes critique items C5 (silent fetch) and C6 (probe wiring) from the implementation plan. No user-visible behavior change in this commit; the "OpenAPI spec saved to ..." line still surfaces on fresh writes from invoke, and stays silent on cache hits and errors. helpers.go - fetchOpenAPISpec now returns (specFile string, fresh bool). fresh==true means this call wrote a new spec to disk; fresh==false means cache hit OR any failure. Callers print the "saved to" line gated on fresh; future callers (doctor, run-time probe) that want silence simply ignore the bool. The print is no longer inside the helper. invoke.go - Both call sites (local fresh fetch, remote conditional fetch) now emit the "OpenAPI spec saved to %s" line themselves via the (path, fresh) return. Behavior is byte-identical to before; only the ownership of the print moved. nextstep/state.go - WithOpenAPIProbe(enabled bool) becomes WithOpenAPIProbe(agentName, suffix string). Empty agentName or suffix disables the probe (the zero value). - assembleState now runs a strictly cache-only OpenAPI lookup when the probe is enabled and the project + env name are both known. configDir is computed as filepath.Join(project.Path, ".azure", envName) the same directory fetchOpenAPISpec writes into, so reader and writer stay in lockstep without an extra round-trip to the gRPC source. Cache miss, malformed spec, no extractable payload all silently leave HasOpenAPI=false and the resolver falls back to the protocol-generic <payload> literal. nextstep/state_test.go - TestOptionsApplyCleanly updated for the new WithOpenAPIProbe shape. - TestWithOpenAPIProbe_EmptyArgsDisableProbe pins the disabled-default semantics (empty agentName / suffix means probe is off). - TestAssembleState_WithOpenAPIProbe_PopulatesPayloadFromCache exercises the happy path: a real on-disk spec under .azure/<env>/ produces a populated State.OpenAPIPayload via ExtractInvokeExample. - TestAssembleState_WithOpenAPIProbe_MissingCacheLeavesPayloadUnset pins the cache-miss fallback. - TestAssembleState_WithOpenAPIProbe_DisabledWhenAgentEmpty proves an on-disk cache is ignored when the option is called with an empty agentName, so callers can centrally disable the probe. Records strategy delta D9 (fetchOpenAPISpec silencing shape) and D10 (WithOpenAPIProbe shape) in .tmp/pr-8057/STRATEGY-DELTA.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): detect missing env vars during nextstep state assembly `State.MissingInfraVars` / `State.MissingManualVars` were declared in commit 1.2 but never populated; the resolver branches in commit 1.3 that consume them only ever saw nil slices. This commit adds the detection step inside `assembleState` so the resolver can suggest the right next action when the user has unprovisioned `${VAR}` references in any agent.yaml. What the helper does - For every azure.ai.agent service in `azure.yaml`, opens the matching `<projectPath>/<svc.RelativePath>/agent.yaml` and walks the `environment_variables` block. - Extracts unique `${VAR}` references via a small package-level regex (`envVarRefPattern`). The optional `(?::-[^}]*)?` non-capturing tail tolerates POSIX-style defaults like `${VAR:-fallback}` without pulling them into the captured name. - Looks each name up against the current azd environment. Names whose value is set are skipped. Names whose value is unset get partitioned: - leading `AZURE_` -> `MissingInfraVars` (`azd provision` outputs in the AI Foundry templates uniformly start with this prefix: `AZURE_AI_*`, `AZURE_OPENAI_*`, `AZURE_SUBSCRIPTION_*`, etc.) - everything else -> `MissingManualVars` (`azd env set` candidates) - Results are deduplicated cross-service (so two services referencing `${AZURE_AI_PROJECT_ENDPOINT}` collapse to one entry) and returned sorted ascending, matching the existing `slices.Sorted` style. Error / partial-state behavior - agent.yaml read or parse errors are silent (return nil refs). The resolver falls back to its default branch rather than emitting guidance about variables we cannot prove are needed. - `src.EnvValue` transport errors append to `*errs` so the snapshot caller can surface them in --debug output, but never abort. This mirrors the existing `isDeployed` contract. - `detectMissingVars` is only invoked when both `project != nil` and `envName != ""`; otherwise both lists stay nil and the existing resolver code paths are unaffected. Why classification is `AZURE_` prefix only The heuristic is intentionally coarse. Documented in the helper godoc: misclassifying a manual var as infra at worst points the user at `azd provision` instead of `azd env set`; the inverse still yields an actionable hint. A future commit can swap in a richer rule (consult `main.bicep` outputs, project-level allow-list) without touching the public API of `AssembleState`. Why split this from the init.go wiring (commit 2.2) The resolver's "no MissingVars" branch suggests `azd ai agent run`, which fails for an unprovisioned env. Wiring init.go without first populating MissingVars would be a behavior regression versus the old hardcoded `azd up` hint. Splitting also keeps each commit reviewable in isolation: 2.1 is pure state-assembly logic with no command wiring, 2.2 is a small swap-in at the call site. Tests added in state_test.go - TestExtractAgentYamlEnvRefs: table with 7 cases covering bare refs, defaulted refs, multiple-refs-per-value, cross-value dedupe, no env block, literal-only values, malformed YAML. - TestExtractAgentYamlEnvRefs_MissingFileOrArgs: empty args + missing manifest all return nil. - TestAssembleState_PopulatesMissingVars: end-to-end via assembleState with a real agent.yaml fixture mixing set + unset infra + manual vars. - TestAssembleState_MissingVarsDedupedAcrossServices: two services with overlapping refs collapse to one entry each list. - TestAssembleState_AllVarsSetLeavesMissingEmpty: regression guard for the "everything provisioned" path. - TestAssembleState_MissingVarTransportErrorSurfaced: EnvValue errors propagate to errs slice without crashing or mis-populating. No production caller of `AssembleState` exists yet, so runtime behavior is unchanged. Commit 2.2 swaps init.go to call the resolver, at which point the populated state takes effect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): exclude defaulted env refs from missing-vars detection Before this change, an agent.yaml ref written as `${VAR:-fallback}` would classify VAR as missing whenever it was unset in the azd environment, and the resolver would prompt the user to `azd provision` or `azd env set` it. That hint is misleading: the deploy-time expander (drone/envsubst, used by service_target_agent.go) honors the `:-` default, so the deploy succeeds with the fallback value and the user has no real action to take. Fix: make the regex's default-tail group capturing (`(:-[^}]*)?`) and skip matches where group 2 is non-empty. Bare `${VAR}` still surfaces as missing when unset, matching the runtime requirement. Bare-dash `${VAR-fallback}` (POSIX "if unset, use fallback") continues to be silently dropped — its deploy-time semantics also carry a fallback, so the same user-visible result holds. Tests: * `TestExtractAgentYamlEnvRefs` table: rename + flip "reference with default tail captured as bare name" -> "reference with default tail is skipped" (want: nil). Add "bare ref alongside defaulted ref returns only the bare one". * New `TestAssembleState_DefaultedRefsAreExcludedFromMissingVars` end- to-end: agent.yaml mixes one bare unset ref (must surface) with two defaulted unset refs (must NOT surface, including the manual-vars bucket). Confirms the partition stays correct when only AZURE_AI_ refs would have surfaced through the infra heuristic. Reviewer consensus (2/3): Sonnet's option (b) — drop the regex-broadening half of its companion finding and keep this change minimal. GPT-5.5 originated the misleading-hint observation; Sonnet cross-pollinated and recommended this exact path. Opus REJECTed with the position that the deploy-time hint is "wrong but right" (template intent), which holds for template-supplied AZURE_ refs but breaks for manual vars such as `${MY_API_KEY:-dev-fallback}`. Tie-breaker: the manual-vars case. Verified clean against gofmt, go vet, go build, go test ./internal/cmd/nextstep/..., golangci-lint, cspell, copyright-check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): wire init success path to nextstep resolver Replace the hardcoded `azd up` / `azd deploy <svc>` conditional at init.go:1592-1607 with a call to nextstep.AssembleState + ResolveAfterInit + PrintNext. The resolver inspects the active azd environment plus each azure.ai.agent service's agent.yaml to emit context-aware guidance: - MissingInfraVars -> `azd provision` + trailing `azd deploy` - MissingManualVars -> up to 3 `azd env set <KEY> <value>` lines - clean -> `azd ai agent run` + trailing `azd deploy` First user-visible behavior change in this PR. The legacy AZURE_AI_PROJECT_ID dichotomy is replaced by the more informative missing-vars partition; the new trailing line is the generic `azd deploy` (no service-name suffix) per the design spec. State-assembly errors are intentionally ignored: the resolver degrades gracefully on partial state per the design spec. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): reserve trailing slot in nextstep renderer for follow-up nudges 3-of-3 reviewer consensus on commit 077c550ba surfaced that PrintNext silently truncates ResolveAfterInit's trailing `azd deploy` line when there are 2+ missing manual vars. The resolver assigns the trailing nudge Priority 90 but the renderer sorts ascending and caps at maxRendered=2; once the manual-vars branch emits 2 or 3 `azd env set` lines (priorities 20-22) the deploy nudge is the first thing dropped. Fix: add a Trailing flag to Suggestion. renderBlock now partitions on the flag and reserves one of its maxRendered slots for the lowest- priority trailing entry. Primary suggestions fill the remaining slots in ascending Priority order, as before. ResolveAfterInit marks its `azd deploy` footer Trailing:true; other resolvers are unchanged (none of them currently emit a structural footer). Net effect for end users finishing `azd ai agent init` with N missing manual variables: N=1 -> `azd env set X` + `azd deploy` (unchanged) N=2 -> `azd env set A` + `azd deploy` (was: A + B, deploy lost) N=3 -> `azd env set A` + `azd deploy` (was: A + B, deploy lost) The user is named one missing variable plus the deploy nudge. The previous behavior was equally lossy -- it just dropped the wrong thing. Naming every missing var would need a higher maxRendered, which trades the design's two-line UX cap for completeness; the design spec chose the cap, so the fix preserves it. Coverage: - TestPrintNext gains "trailing suggestion survives truncation", "trailing-only block renders as the single line", and "multiple Trailing entries collapse to the lowest-priority one". - TestResolveAfterInit (table) + TestResolveAfterInit_ManualVarsCapAtThree now assert `out[len(out)-1].Trailing == true`. Reviewer provenance: Opus 4.7 xhigh (High, empirically reproduced), Sonnet 4.6 (Medium), GPT-5.5 (Medium) all independently surfaced the same truncation bug on the b39188643..077c550ba diff; 3/3 consensus, no cross-pollination needed. Opus's Option B (sticky tail) is the approach implemented here -- the alternatives (cap manual-var lines to 1; introduce a renderer-limit parameter) either lose more user info or pollute the API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents/nextstep): trailing collision now keeps the most-deferred footer Cross-pollinated 3-model review on 8fa72db2a flipped the Trailing-tiebreaker policy. Previous implementation used "first Trailing wins" (lowest Priority on ascending sort). That defeats the regression-prevention purpose of the sticky- tail fix: if a future resolver accidentally flags a Priority < 90 entry as Trailing, current code silently drops the intended `azd deploy` footer (Priority 90) the exact regression 2.2.1 was meant to prevent. Switch to "last Trailing wins" (highest Priority on ascending sort = most- deferred footer). Mistake-likelihood is asymmetric: copy-pasting Trailing onto a low-priority hint is plausible; inventing a higher-than-deploy Priority is not. Reviewers ratifying last-wins: Sonnet 4.6 (original finder, Medium), GPT-5.5 (swung after cross-pollination), Opus 4.7 xhigh (reversed his initial Q5 ratification after cross-pollination). 3/3 consensus. Changes: - format.go: remove `if trailing == nil` guard so every Trailing entry overwrites; the loop terminates with a pointer to the highest-Priority Trailing entry. - types.go: docstring now spells out "highest Priority wins on collision". - format_test.go: rename "multiple Trailing entries collapse" case and pin `tail-b` (Priority 90) as the survivor instead of `tail-a` (Priority 80). Preflight: gofmt clean, go vet clean, go build clean, nextstep tests pass (10.9s), cmd tests pass (15.5s), golangci-lint 0 issues, cspell 0 issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents/run): resolver-driven Next: block on local run Replace the hardcoded `azd ai agent invoke --local "Hello!"` follow-up hint with the new nextstep package. The resolver picks a protocol- appropriate sample payload (`{"message": "Hello!"}` for invocations protocol agents; `"Hello!"` literal for responses protocol agents) and, when a cached OpenAPI spec from a prior `invoke` is available, replaces the default with the exact request body the agent expects. A tip line pointing at the OpenAPI doc is appended when the cache is empty. Smoke-tested against the hello-world-python-invocations sample (Foundry bring-your-own template). Output: Next: azd ai agent invoke --local '{"message": "Hello!"}' -- send a sample request to the running agent curl http://localhost:<port>/invocations/docs/openapi.json -- tip: inspect the spec to learn the agent's exact payload Starting agent on http://localhost:18347 (Ctrl+C to stop) The `After startup, in another terminal, try:` preamble is dropped in favor of consistency with the `init` success path (`init.go:1607`): the `Next:` header + the `Starting agent on <url> (Ctrl+C to stop)` line directly below convey the temporal ordering. If user testing shows confusion, a follow-up commit can wire `After startup...` text back in via the resolver's Description column. Out of scope: the `<port>` placeholder in the curl tip is a known documentation-grade hole. Substituting the live port requires plumbing `flags.port` through `State.Port` and the resolver deferred to keep this commit small. Preflight: gofmt clean, vet clean, build clean, cmd tests pass (14.9s), golangci-lint 0 issues, cspell 0 issues. Smoke-tested end-to-end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): align local OpenAPI cache key + restore "After startup" preamble Two 3/3-consensus findings from the multi-model review of commit 2.3 (12aa2bb2d). F1 (HIGH, 3/3 consensus) - local OpenAPI cache filename mismatch: invoke.go:520 wrote the on-disk cache using the composite agentKey (e.g. openapi-localhost:8088_<projectHash>_agents_hello-world_versions_ latest_local-local.json), while run.go reads it via nextstep.WithOpenAPIProbe(serviceName, "local") which expands to the plain name (openapi-hello-world-local.json). The two filenames could never match, so state.HasOpenAPI was permanently false and the "subsequent runs surface the cached OpenAPI sample" path in ResolveAfterRun was dead code. Fix: extract resolveLocalAgentName from resolveLocalAgentKeyWithPort and use the plain name at the cache write site. The session/conversation store at invoke.go:504 keeps the composite key (it needs the port + projectHash to avoid cross-project collisions in the shared config store); only the cache file was buggy. The split matches the existing remote-write pattern at invoke.go:629 (remote already passes the plain name) and adds an explanatory comment block at the asymmetry site. F2 (Medium, 3/3 consensus) - "In another terminal" signaling restored: commit 2.3 dropped the explicit "After startup, in another terminal, try:" preamble in favor of init.go-style uniformity. But init exits and hands the prompt back, while run holds the foreground TTY for the agent. Without the preamble, top-down readers see the Next: block before the "Starting agent on http://..." line and have no clue the current terminal is about to be busy. Common failure mode: paste the suggested invoke into the same terminal, Ctrl+C the agent, and ask what just happened. Restore the preamble (8 words, pure revert, proven UX). Reviewer trace: - Sonnet 4.6 surfaced both findings on the first pass. - GPT-5.5 independently surfaced F1, ratified F2 fix-shape (Option C: restore preamble) on cross-pollination. - Opus 4.7 xhigh missed both on first pass (checked URL endpoint symmetry but not filename-key symmetry on F1; anchored on commit message intent on F2) and reversed to AGREE on cross-pollination. Verification: - Preflight clean: gofmt, go vet, go build, full cmd tests (14.7s, nextstep 3.0s), golangci-lint 0 issues, cspell 0 issues. - Live smoke test against hello-world-python-invocations sample: `azd ai agent run --port 18348` now prints "After startup, in another terminal, try:" followed by the Next: block + the "Starting agent on http://..." line in the expected order. - Code inspection confirms cache writer (after fix) and reader both use the plain service name, so filenames align. Files: 3 changed, +36 / -11. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): resolve local agent service once in invocationsLocal (no double prompt) 3/3-consensus regression from the multi-model review of commit 2.3.1 (f4a7f68aa). Severity: Medium. The 2.3.1 refactor collapsed the previous single `resolveLocalAgentKey` call at `invoke.go:498` into two paired calls: the existing one PLUS a new `resolveLocalAgentName` at line 504. Both funnel through `resolveLocalAgentName` (helpers.go:161), which unconditionally calls `resolveAgentServiceFromProject` even when the result is only needed for the `name == ""` branch. In the interactive multi-agent case (project with >=2 azure.ai.agent services in azure.yaml + no `--no-prompt`), this fires `azdClient.Prompt().Select` TWICE. The CLI validation at `invoke.go:125-131` rejects `--local` + a positional name, so every invoke that reaches `invocationsLocal` enters with `name=""` the double prompt is reliably hit, not a corner case. Worse, the two prompts are independent. If a user picks different services on the two prompts (alphabetic list, not anchored to the previous choice), `agentKey` (used by `resolveStoredID` for the session/conversation store) refers to service A while `agentName` (used by `fetchOpenAPISpec` for the OpenAPI cache filename) refers to service B. The session ID resolved against A is then used to invoke service B's `/invocations` endpoint silent cross-service state corruption. In `--no-prompt` + multi-service, the second `resolveLocalAgentName` fails inside `resolveAgentServiceFromProject`, the error is swallowed at helpers.go:165, and `agentName` falls back to `"local"`. The session store still gets the correctly resolved composite key but the cache filename mismatches re-introducing a flavor of the original F1 bug. (Note: this `--no-prompt` failure mode pre-existed before 2.3.1 as well; it's a known gap, not a new regression.) Fix: resolve the agent service ONCE via `resolveLocalAgentName` and derive the composite key locally via `buildLocalAgentKey`. Net change at the call site is two lines (collapse from two paired calls to one resolve + one local derive). The 5-line comment block is expanded to explain why both values are needed and why we resolve once. `DefaultPort` is retained (not switched to `a.flags.port`) to preserve pre-2.3.1 session-store semantics a port switch would change cross-invocation session compatibility and is out of scope here. Reviewer trace: - Opus 4.7 xhigh surfaced the finding on the ratification review of f4a7f68aa. - Sonnet 4.6 AGREE on cross-pollination; argued HIGH severity over a `--no-prompt` regression that turned out to pre-exist 2.3.1 (so we land at Medium). - GPT-5.5 AGREE on cross-pollination; suggested using `a.flags.port` instead of `DefaultPort` deferred because that would change session-store semantics (per-port isolation) and needs its own consensus. Verification: - Preflight clean: gofmt, vet, build, full cmd tests (14.6s), nextstep tests (3.8s), golangci-lint 0, cspell 0. - Live smoke test against hello-world-python-invocations sample: session prefix `[localhost:8088/9c184b88be3f8efb/agents/hello- world-python-invocations/versions/latest/local]` is byte- identical to the pre-fix output, and the resurfaced session ID (882e2e6a-...) matches the prior invocation confirms the composite key is unchanged and session-store backward compat is preserved. 1 file, +12 / -6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): wire invoke.go success paths to nextstep.ResolveAfterInvoke Phase 2 commit 2.4. Replaces the four `invoke` success-path returns with calls to `nextstep.ResolveAfterInvoke` `PrintNext`, so the `Next:` block at the end of a successful invoke is policy-driven (InvokeLocal `azd deploy`; InvokeRemote `azd ai agent show <name>` + `azd ai agent monitor --follow`) instead of silent. Adds a small file-local helper `(a *InvokeAction).emitInvokeSuccessNextStep` so all four success paths funnel through one place keeps the call sites symmetric and makes the future failure-path commit a single edit point. State is intentionally nil at every success call site: `ResolveAfterInvoke`'s success branches (`resolveInvokeSuccess` at resolver.go:160) don't read State, and `AssembleState` is not free (Project + CurrentEnvName + per-service EnvValue gRPC roundtrips for `nextstep.WithOpenAPIProbe`). The companion follow-up commit that wires invoke-failure paths will assemble state at the failure site, where it actually feeds `RemediationForSessionErrorCode`. Touch points (4 success returns rewritten from `return foo(...)` to `if err := foo(...); err != nil { return err }; emit(); return nil`): - `responsesLocal`: both JSON-and-not-JSON success branches. - `responsesRemote`: post-`readSSEStream` success. - `invocationsLocal`: post-`handleInvocationResponse` success. - `invocationsRemote`: post-`handleInvocationResponse` success. For the InvokeLocal branches, the resolver's success path ignores agentName (returns the same `azd deploy` line regardless), so the helper is called with empty agentName at the two local sites. This also dodges the resolve-once-derive-both concern: no new `resolveLocalAgentName` calls are added in `responsesLocal`, so there's no risk of re-introducing the double-prompt regression that commit 2.3.2 fixed for `invocationsLocal`. Out of scope (deferred): - Failure-path wiring (`SessionErrorCode` parsing from `x-adc-response-details` header + body, then `RemediationForSessionErrorCode` mapping). Tracked as commit 2.5. - `nextstep.AssembleState` calls at the failure sites. Same commit. Verification: - Preflight clean: gofmt, vet, build, full cmd tests (14.3s), nextstep tests (cached), golangci-lint 0, cspell 0. - Smoke-tested against hello-world-python-invocations sample: 1. Local invocations protocol emits `Next: azd deploy -- the local invoke worked ship it to Azure`. 2. Remote invocations protocol emits two-line block with `azd ai agent show hello-world-python-invocations` (the resolved agent name) + `azd ai agent monitor --follow`. - Session ID `882e2e6a-...` resurfaced under the byte-identical composite key `[localhost:8088/9c184b88be3f8efb/agents/ hello-world-python-invocations/versions/latest/local]` confirms backward compat across 2.3.1 2.3.2 2.4. - `responsesLocal` / `responsesRemote` not exercised on the wire (sample uses invocations protocol, not responses); verified by code inspection that the two return points were rewritten correctly and the helper call is reached on success. 1 file, +38 / -4. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): TTY-gate Next: emission in invoke success helper emitInvokeSuccessNextStep wrote nextstep.PrintNext to os.Stdout unconditionally, violating the call-site-gating contract documented in three places in the nextstep package: - nextstep/types.go:19-22 -- "Output discipline lives at the call sites: the package never writes to os.Stdout directly and never inspects --output flags. Callers gate on the isTerminal helper..." - nextstep/format.go:30-33 -- "PrintNext does not inspect TTY state or output-format flags -- those decisions live at the call site..." - helpers.go:810-811 -- isTerminal's own doc: "Used to gate human-only output such as the next-step guidance block." Symptom: `azd ai agent invoke ... > file` / `... | tee log` / CI-captured stdout received the trailing "Next:" block mixed in with the agent's reply, corrupting files and logs the user reasonably expected to contain only the model output. printAgentResponse's fallback path (invoke.go raw-body branch + json.MarshalIndent dump) is particularly affected: it emits structured-ish data that the Next: block then invalidates. Fix: one-line `if !isTerminal(os.Stdout.Fd()) { return }` at the top of the helper. All four success paths funnel through this one helper (the original 2.4 design), so a single gate covers every invocation mode (responses local/remote, invocations local/remote). No behavior change on TTY stdout. Smoke-tested against hello-world-python-invocations sample: - direct TTY: Next: block emits as before (2-line "show + monitor") - file redirect (`invoke > out.txt`): Next: block suppressed; only the agent's streamed reply lands in the file - pipe (`invoke ... | jq`): same -- block suppressed Scope deliberately limited to invoke. Two other call sites have the same pre-existing omission: - init.go:1608 -- theoretical only; init is interactive and writes no machine-readable output. Can be folded into a separate cleanup commit. - run.go:180 -- coupled to a "After startup, in another terminal, try:" preamble at run.go:179. TTY-gating just the PrintNext call would leave a dangling sentence. Needs a small design pass and its own commit -- not folded here. Code-review consensus on commit eb01d184b: - Sonnet 4.6 raised the finding at HIGH severity. - GPT-5.5 independently raised it at Medium. - Opus 4.7 (xhigh, cross-pollination pass) confirmed the bug, agreed Medium is the right severity (no `invoke --output json` today; existing output already not jq-clean), and explicitly recommended invoke-only scope -- echoed the rationale above for deferring init.go and run.go. Pre-flight: gofmt clean, vet, build, golangci-lint 0, cspell 0, cmd-package tests (14.5s) + nextstep tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents): wire invoke failure paths to ResolveAfterInvoke Phase 2 commit 2.5 surfaces the platform's recommended remediation when an invoke fails. Mirrors the 2.4 success-helper pattern with one new file-local helper and four wire-up sites. New helper: emitInvokeFailureNextStep(mode, agentName, sessionCode) funnels all four failure paths through one place. It builds an InvokeFailure{SessionCode: SessionErrorCode(sessionCode)} and passes it to ResolveAfterInvoke; the resolver's failure branch turns each known SessionErrorCode into the canonical remediation line (with optional secondary action) via RemediationForSessionErrorCode, and falls back to `azd ai agent monitor --tail 100` for empty or unknown codes. Local-invoke failures pass empty agentName + empty sessionCode and get a single `see local server output` line per the resolver's InvokeLocal branch. Wire-up sites in invoke.go: * responsesLocal (HTTP 4xx/5xx branch) emit before fmt.Errorf * responsesRemote (HTTP 4xx/5xx branch) extract x-adc-response-details from resp.Header before reading body, then emit * invocationsLocal (handleInvocationResponse err) local server doesn't set the header; pass "" * invocationsRemote (handleInvocationResponse err) capture x-adc-response-details from resp.Header BEFORE calling handleInvocationResponse (handler reads the body, header survives) Decisions baked in: * State is nil at every failure site. resolveInvokeFailure's signature (_ *State, ...) reflects that it doesn't read State today. Avoids the gRPC AssembleState roundtrip at the exact moment the user is staring at an error. If a future failure branch grows state-aware behavior, switch to AssembleState at that one site. * Output order: Next: BEFORE the error message (host renders error last, via SilenceErrors=true + ReportError). Mirrors git's `hint: ... error: ...` pattern. Smaller diff than the alternative (sentinel-error + silent-stderr + bespoke printing) and acceptable on an interactive terminal: Trace ID -> response body -> Next: block -> Error: line. * Separate helper from emitInvokeSuccessNextStep. Keeps the 2.4 success call sites byte-for-byte unchanged (already 3/3 reviewer-clean) and saves reviewers from re-verifying them. * TTY-gate inherited at the helper boundary (same isTerminal check that 2.4.1 added on the success helper). Pipe/redirect/CI capture suppresses the human-only block; the error itself still flows through stderr. NOT in scope (deliberate): * Connect-failure paths (responsesLocal:335-340, responsesRemote:495-497, invocationsLocal:586-591, invocationsRemote:699-701). Existing error messages already include actionable guidance like `Start it with: azd ai agent run`; Next: would be redundant. * Agent-error envelopes (200 OK with error-shaped JSON or SSE error event in handleInvocationSync / handleInvocationSSE). These are agent-level errors, not platform errors; the platform's SessionErrorCode vocabulary doesn't apply. Separate follow-up if user feedback indicates value. Smoke-tested against the deployed hello-world-python-invocations sample: * remote 4xx (invalid agent name) -> Trace ID -> blank line -> `Next: azd ai agent monitor --tail 100 -- inspect recent container logs for the failure` -> ERROR (host stderr). Pipe/redirect to file: Next: line correctly suppressed; ERROR still flows through stderr. Confirmed via Select-String against redirected stdout file. * success path unchanged: `azd ai agent invoke 'Hello!'` still streams tokens and emits the 2.4 `Next: azd ai agent show ... + monitor --follow` block at the end. 1 file, +46/-3. Pre-flight clean (gofmt, go vet, go build, cmd-package tests 14.2s, golangci-lint 0, cspell 0). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): gate invocations-protocol failure Next: by HTTP status Follow-up to b2e58f8fc (Phase 2 commit 2.5). All three reviewers (Opus 4.7 xhigh, Sonnet 4.6, GPT-5.5) independently flagged the same Medium-severity bug: the new wire-up at invocationsLocal:640 and invocationsRemote:754 fires emitInvokeFailureNextStep on every non-nil return from handleInvocationResponse, but that function returns errors for THREE distinct cases: 1. HTTP 4xx/5xx platform failures (invoke.go:782-785) 2. Agent error envelope in 200 OK JSON (invoke.go:819-821, via handleInvocationSync) 3. Agent error in SSE error event (invoke.go:868-870, via handleInvocationSSE) Cases 2 and 3 are agent-level errors carrying no x-adc-response-details header, so sessionCode is "" and the resolver falls to the empty-code branch: `azd ai agent monitor --tail 100 -- inspect recent container logs for the failure`. The agent process is healthy in those cases its logs likely contain nothing useful; the issue is in the request payload or the agent code. Per Opus's cross-protocol review, the responses protocol's analogous agent-level errors (printAgentResponse failed status at invoke.go:1175-1177, readSSEStream failed / error events at invoke.go:1127-1129 and :1148-1150) are correctly NOT wired, so the invocations protocol was inconsistent with itself and with the commit's stated architectural decision 5. Fix: gate the emit on resp.StatusCode >= 400 at both invocations sites. Adds a 5-line doc comment at the remote site explaining the rationale and a one-line cross-reference at the local site. The responses-protocol wire-ups (responsesLocal:391, responsesRemote:548) already short-circuit on resp.StatusCode >= 400 before reaching handleInvocationResponse, so they don't need the guard. Consensus pipeline: - GPT-5.5: proposed call-site guard exactly as applied here. - Sonnet 4.6: proposed call-site guard or accept-and-document. - Opus 4.7 (xhigh): proposed moving the emit INSIDE handleInvocationResponse's 4xx branch or accept-and-document. The "move inside" option would require restructuring handleInvocationResponse (it's a free function, not a method on *InvokeAction; can't reference a.emitInvokeFailureNextStep without adding a callback or making it a method). Call-site guard wins on minimal-diff grounds and matches 2/3 explicit endorsement. 1 file, +14/-2. Pre-flight clean (gofmt, go vet, go build, cmd-package tests 14.6s, golangci-lint 0, cspell 0). Live smoke: 4xx (invalid agent name) still emits Next: above the error; success path unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents/nextstep): match AgentVersionStatus wire values to API The Foundry Hosted Agents API returns AgentVersionObject.Status as lowercase (verified empirically: 'azd ai agent show' returns 'active'). The resolver's AgentVersionStatus constants were title-case, so ResolveAfterShow's typed switch never matched live data and every successful show would have hit the 'unknown / transitional' fallback branch. Lowercase the five constants (creating/active/failed/deleting/deleted) and the matching keys in the wire-drift test. Doc comment on the type now points at agent_api/models.go and the empirical evidence. No resolver logic change; no other callers. Found during commit 2.6 (show.go wire-up) implementation, before any user-visible exposure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(azure.ai.agents/cmd): wire show.go to nextstep resolver Adds context-aware `Next:` guidance to `azd ai agent show`. **Table output (--output table)**: render the existing field table, then on TTY emit a blank line + `nextstep.PrintNext`. Pipes and file redirects suppress the block (consistent with invoke's TTY gate from commits 2.4.1 and 2.5.1). **JSON output (--output json, the default)**: surface the same guidance under a new optional `next_step` envelope field. JSON is for machines: emitted unconditionally regardless of TTY. The envelope wrapper type omits the resolver's internal `Priority` and `Trailing` renderer hints - consumers only need `{command, description}`. **State assembly**: `a.resolveNextStep` calls `nextstep.AssembleState` once (best-effort), overrides `state.AgentStatus` with the live `version.Status` returned by the API, then calls `ResolveAfterShow( state, a.serviceName)`. Passes `info.ServiceName` (azure.yaml service name) rather than `info.AgentName` so `findService` matches `state.Services[].Name` for protocol lookup; the CLI's invoke command re-resolves either name, so the suggested command works in both common and divergent-name configurations. **Backward compat**: `next_step` is `omitempty`; existing JSON parsers continue to work. The existing `TestPrintAgentVersionJSON_*` test keeps passing. `TestPrintAgentVersionJSON_NoLinks` gains one assertion for the omit-when-nil contract. New `TestShowResultJSON_NextStepEnvelope` locks the envelope shape (single suggestion, exact keys, no priority/trailing leak). Updated `printShowResultTable` signature to accept `[]nextstep.Suggestion` (one new arg). The two test call sites pass `nil` to assert the no-suggestions path renders cleanly. Smoke verified against the deployed `hello-world-python-invocations` sample (status=active). JSON: `next_step.suggestions[0].command` is `azd ai agent invoke hello-world-python-invocations '{"message": "Hello!"}'` - the protocol-aware payload pulled from the OpenAPI cache, confirming the lowercased AgentVersionStatus constants from commit 2.5.2 are necessary and working end-to-end. Non-TTY table path suppresses the block (PowerShell tool not a TTY). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents/cmd/show): drop dead nil guard + doubled blank line Consensus fix-up on commit 2.6 (be72cdba2), addressing two findings that reached 3/3 reviewer agreement (both Low severity). S1 — dead nil guard in `resolveNextStep`: `AssembleState` (nextstep/state.go:174-215) unconditionally initializes `state := &State{}` and has a single return path. The `if state == nil` guard at the old line 206 was therefore unreachable. Removing it prevents future contributors from inferring a non-existent contract branch on `AssembleState`. The package's godoc already promises a non-nil partial state on every call. Updated the function's doc comment to cite this explicitly. Opus-1 — doubled blank line between table and Next: block on TTY: `printShowResultTable` was emitting `fmt.Println()` immediately before `nextstep.PrintNext`. But `PrintNext` → `renderBlock` already prepends its own leading `\n` (`nextstep/format.go:106-108`, "Leading blank line separates the block from preceding output."). Combined with the tabwriter's trailing `\n` on the last row, the result on TTY was three line terminators before "Next:" — two visible blank lines, not one. Verified by Opus xhigh against Format-Hex output. Sibling sites (`init.go:1607-1608`, `invoke.go`'s `emitInvokeSuccessNextStep`) call `PrintNext` directly without a preceding `Println` and produce a single blank-line separator. Both fixes are minimal — total diff is 5 inserted / 7 deleted, comment text adjusted to capture the contract going forward. Preflight: gofmt clean, vet clean, build clean, show tests pass (4.89s), golangci-lint 0 issues. Skipping the 3-reviewer pass on this commit per the precedent from 2.2.2 / 2.3.2 (trivial fix-ups don't need another full review pass). The remaining 3 findings from the 2.6 review (G1 ServiceName/AgentName divergence; G2 WithOpenAPIProbe wire-up; S2 untested status override) land in commit 2.6.2, which is more substantive and does get a review pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): correctness fixes from 2.6 cross-pollinated review Lands the 3/3-consensus correctness findings from the cross-pollinated review of `2930395cf..be72cdba2`. Splits the trivial cleanup (commit 2.6.1) from substantive behavior changes (this commit) so the diffs are independently reviewable. G1 (Medium, 3/3 with Opus's prior dismissal reversed) — ServiceName vs AgentName divergence: ResolveAfterShow previously took a single `agentName` parameter and used it for both `findService` (which keys on the azure.yaml service name) and `invokeCommandFor` (which embeds the deployed Foundry agent name in the suggested URL). When the two diverge (typical when deploy appends a suffix — `<service>-<suffix>` is a common Foundry naming pattern), the emitted suggestion `azd ai agent invoke <serviceName> ...` produces a URL path `/agents/<serviceName>/...` that 404s on the Foundry API. Fix: split the signature to `ResolveAfterShow(state, serviceName, agentName)`. Active branch uses serviceName for protocol lookup and agentName for command emission. Unknown-status fallback uses serviceName (matches show.go's lookup contract — `resolveAgentService` matches by service name). show.go now passes both via a new `agentName` field on ShowAction populated from `info.AgentName` alongside the existing `serviceName`. Option B chosen over modifying invoke.go's gated `if name == "" && info.AgentName != ""` translation because the latter would change CLI semantics for users passing Foundry names positionally. G2 (Medium, 3/3) — OpenAPI cache wiring: show.go's `resolveNextStep` previously called `AssembleState` without `WithOpenAPIProbe`, so `state.HasOpenAPI` was always false and the Active-branch invoke suggestion always used the protocol-generic literal. Fix: pass `nextstep.WithOpenAPIProbe(agentName, "remote")` (matches `invoke.go:725`'s `fetchOpenAPISpec(... name, "remote", ...)` cache write convention). `invokeCommandFor` now accepts `*State` and prefers `shellEscapeSingleQuoted(state.OpenAPIPayload)` over the protocol literal when `state.HasOpenAPI && state.OpenAPIPayload != ""`. Mirrors the pattern at `resolver.go:104-106` in `ResolveAfterRun`. Best-effort silent fallback: when the cache is empty (no prior `invoke` populated it, or cache lookup errored), the protocol-generic literal is emitted unchanged. Same UX contract as `ResolveAfterRun`. S2 (Low, 2/3 after Sonnet's Medium downgraded) — resolveNextStep end-to-end wiring test: Extracts `resolveNextStepFromSource` as the testable core of show.go's `resolveNextStep` method, taking a `nextstep.Source` directly instead of building one from `*azdext.AzdClient`. Production path (`(*ShowAction).resolveNextStep`) calls it with `NewSource(a.azdClient)`; tests inject a `fakeShowSource`. New public `AssembleStateFromSource` in the nextstep package wraps the existing private `assembleState` function. Three wiring tests added in show_test.go: - ActiveBranch_InvocationsProtocol: writes a real agent.yaml under a `t.TempDir` project root and verifies AssembleState reads it, detecting the invocations protocol, and ResolveAfterShow emits the protocol-aware payload with the Foundry agent name (locks both G1 and G2's no-cache fallback path). - UnknownStatusFallsBackToServiceName: locks G1's fallback choice. - NonActiveBranches: sanity-checks the remaining status branches don't depend on either name. Plus three more cases added to TestResolveAfterShow_* in resolver_test.go: DivergentNames (locks G1 directly), and ActiveConsumesOpenAPICache subcases for plain payload, apostrophe escaping, and empty-payload fallback (locks G2 directly). Scope discipline (not in this commit): - The OpenAPI cache for the deployed `hello-world-python-invocations` sample is empty in this session — the smoke run after this commit shows the protocol-generic literal `'{"message": "Hello!"}'` falling through unchanged, confirming graceful degradation. Cache hit path is covered by unit test. - Pre-existing api-version bug on remote OpenAPI URL is tracked separately ("Next up" #1 in plan.md). Pre-flight: gofmt clean, vet clean, full extension test suite green (cmd 14.2s, nextstep cached, all other packages green), golangci-lint 0 issues, cspell 0 issues. Live smoke against `hello-world-python- invocations`: `azd ai agent show --output json` still returns the expected `next_step` envelope with the protocol-aware command. 5 files changed, +173/-37. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): correct divergent-name invoke suggestion (G3) Background ---------- Commit 84bfc741f (2.6.2) split ResolveAfterShow's signature into (state, serviceName, agentName) and emitted the deployed Foundry agent name as the positional of the suggested invoke command. The stated rationale was: invoke's remote URL path embeds the agent name verbatim, so passing the azure.yaml service name there would yield a 404 from Foundry. That rationale was right about the URL path but missed the upstream failure point. All three 2.6.2 reviewers re-traced the consumer end to end and reached consensus that the previously rejected fix (unconditionally translate inside invoke.go) was the correct one. G3 — Medium, 3/3 consensus (GPT-5.5, Sonnet 4.6, Opus xhigh) ------------------------------------------------------------ Trace of the broken case (azure.yaml services: { echo }, env AGENT_ECHO_NAME=echo-deployed-x7q9): 1. Resolver emits: azd ai agent invoke echo-deployed-x7q9 '<payload>' 2. InvokeAction.Run calls a.resolveProtocol(ctx) FIRST, before any URL is constructed (invoke.go:167). 3. resolveProtocol falls through to: resolveAgentProtocol(ctx, azdClient, "echo-deployed-x7q9", ...) (invoke.go:273). 4. resolveAgentProtocol delegates to resolveAgentService (helpers.go:728). 5. resolveAgentService loops projectResponse.Project.Services and matches by s.Name == name (helpers.go:562). No svc.Name equals "echo-deployed-x7q9", so svc stays nil and the function returns: "no azure.ai.agent service named 'echo-deployed-x7q9' found in azure.yaml" 6. Error propagates back to Run. invocationsRemote / responsesRemote are never called. The URL-path correctness 2.6.2 paid for never fires. The 2.6.2 G1 fix was a half-measure: the signature split gave the resolver access to serviceName, but line 253 still passed agentName to invokeCommandFor — handing the resolver a knife and choosing to stab itself. The new TestResolveAfterShow_DivergentNames test in 2.6.2 only asserted the emitted *string*, not what happens when that string is fed back into Run. Fix (Option A — both halves are load-bearing together) ------------------------------------------------------ 1. resolver.go — emit serviceName as the positional. The signature reverts to ResolveAfterShow(state *State, serviceName string). agentName is no longer needed by the resolver because: - protocol lookup keys on serviceName via findService (already) - the positional is now serviceName (this commit) - the OpenAPI probe runs in show.go BEFORE the resolver and populates state.OpenAPIPayload from the cache; the cache key still uses agentName, but that's an internal contract owned by show.go and invoke.go, not the resolver's API surface. 2. invoke.go — flip the translation gate in BOTH protocol-specific remote functions: invocationsRemote (invoke.go:663-665): before: if name == "" && info.AgentName != "" after: if info.AgentName != "" responsesRemote (invoke.go:425-427): before: if name == "" && info.AgentName != "" after: if info.AgentName != "" The flip is safe by construction: - When user passes the SERVICE name positionally, the lookup at helpers.go:560 succeeds, info.AgentName is populated, the gate fires, name is translated to the deployed Foundry name, and the URL is correct. - When user passes the DEPLOYED Foundry name positionally (legacy behavior; never reaches this path in practice today because resolveProtocol fails first), the lookup at line 560 fails, err != nil, the entire if-block is skipped, the gate is never reached, and behavior is unchanged. - When names match (no divergence), translation is a no-op. Opus's retrospective: the rejection reason cited in 2.6.2 ("would change CLI semantics for users passing Foundry names positionally") does not hold — those users hit the err != nil branch and never reach the gate. Why not the alternatives the reviewers also analyzed: - Option B: add `--protocol <protocol>` to the suggestion. Skips protocol resolution, but resolveAgentServiceFromProject still fails silently inside the protocol-specific remote, leaving agentEndpoint = "" — silent loss of session persistence. A regression in disguise. - Option C: fall back to AgentName search in resolveAgentService. Wider blast radius. Affects `run` / `init` / `monitor` / files / session — none of which this PR has any business touching. End-to-end trace with the fix ----------------------------- azure.yaml services: { echo }, AGENT_ECHO_NAME=echo-deployed-x7q9 Resolver emits: azd ai agent invoke echo '<payload>' 1. resolveProtocol → resolveAgentService("echo") matches svc.Name → protocol resolved ✓ 2. invocationsRemote called with name = "echo" 3. resolveAgentServiceFromProject("echo") succeeds → info.AgentName = "echo-deployed-x7q9", info.AgentEndpoint set 4. New gate fires → name = "echo-deployed-x7q9" 5. URL: …/agents/echo-deployed-x7q9/endpoint/protocols/… ✓ 6. agentEndpoint != "" → session persistence active ✓ 7. Cache key for fetchOpenAPISpec uses the post-translation name (Foundry name), aligned with show.go's WithOpenAPIProbe(agentName, "remote") read key ✓ Files changed (5, +55/-43) -------------------------- M cli/azd/extensions/azure.ai.agents/internal/cmd/invoke.go — flip gate at lines 425 and 664 (2 char-level changes) M cli/azd/extensions/azure.ai.agents/internal/cmd/nextstep/resolver.go — ResolveAfterShow signature: drop agentName parameter — Active branch: invokeCommandFor receives serviceName — invokeCommandFor: rename agentName→name in signature/doc; behavior unchanged — doc comment rewritten to explain that the resolver emits service name end-to-end and invoke translates internally M cli/azd/extensions/azure.ai.agents/internal/cmd/nextstep/resolver_test.go — 11 call sites updated to drop second arg — TestResolveAfterShow_DivergentNames Active subcase flipped: was "echo-suffix-abc123", is now "svc-echo" — test doc-comment rewritten M cli/azd/extensions/azure.ai.agents/internal/cmd/show.go — ResolveAfterShow call site drops agentName — agentName field on ShowAction is retained: still used by resolveNextStepFromSource for the OpenAPI probe M cli/azd/extensions/azure.ai.agents/internal/cmd/show_test.go — TestResolveNextStepFromSource_ActiveBranch_InvocationsProtocol assertion flipped to service name Pre-flight ---------- ✓ gofmt -s -d clean ✓ go vet ./... clean ✓ go build ./... clean ✓ go test ./... full suite green (cmd 11.1s, nextstep 1.7s, all other packages green) ✓ golangci-lint 0 issues ✓ cspell 0 issues ✓ live invoke smoke azd ai agent invoke hello-world-python- invocations '{"message": "Hello!"}' against the deployed sample — full SSE stream, agent responded "Hello! How can I assist you today?" (same-name path; divergent-name path locked by unit tests) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(azure.ai.agents): close symmetric G3 in invoke-success suggestion (G4) Background ---------- Commit 211d1f334 (2.6.3) fixed the divergent-name path in one direction: `azd ai agent show` now emits `azd ai agent invoke <serviceName> ...` and the gate flip in invoke.go translates to the deployed Foundry name internally. Three reviewers (Opus xhigh, Sonnet 4.6, GPT-5.5) ran on 211d1f334. GPT and Sonnet each flagged a LOW doc-comment issue (S4 — stale `ShowAction.agentName` doc; deferred to a doc-cleanup commit). Opus xhigh surfaced an unrelated MEDIUM finding (G4) that the other two missed. On cross-pollination, Sonnet and GPT both independently traced it and endorsed it at MEDIUM with the same proposed fix. 3/3 consensus. G4 — Medium, 3/3 consensus -------------------------- 2.6.3's gate flip translates the local variable `name` *in place* from the azure.yaml service name to the deployed Foundry agent name (the correct value for the URL path). But that post-translation `name` is then passed to `emitInvokeSuccessNextStep(mode, name)` in both protocol-specific remote functions, which feeds `nextstep.ResolveAfterInvoke` → `resolveInvokeSuccess`. The resolver embeds the value verbatim: primary = fmt.Sprintf("azd ai agent show %s", agentName) Trace of the broken case (azure.yaml: { echo }, env AGENT_ECHO_NAME=echo-deployed-x7q9): 1. User runs the resolver's recommended azd ai agent invoke echo '<payload>' (correct after 2.6.3). 2. invocationsRemote: gate fires → name = "echo-deployed-x7q9". 3. HTTP call → URL is correct → SSE stream returns. 4. Post-success block: emitInvokeSuccessNextStep(mode, "echo-deployed-x7q9") → resolver emits: Next: azd ai agent show echo-deployed-x7q9 (confirm health) azd ai agent monitor --follow (stream logs) 5. User follows that first suggestion. 6. show.go:85 → resolveAgentServiceFromProject( ctx, azdClient, "echo-deployed-x7q9", ...) → resolveAgentService → helpers.go:560-569 matches s.Name == "echo-deployed-x7q9" against azure.yaml → no match → error: "no azure.ai.agent service na…
1 parent d6cf42b commit d478803

77 files changed

Lines changed: 22858 additions & 152 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

cli/azd/extensions/azure.ai.agents/cspell.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,11 @@ words:
3939
- CLIENTSECRET
4040
- curr
4141
- dataagent
42+
- envkey
4243
- exterrors
4344
- helloworld
4445
- hostedagent
46+
- hostedagents
4547
- kval
4648
- logstream
4749
- mcpservertoolalwaysrequireapprovalmode
@@ -56,5 +58,18 @@ words:
5658
- protocolversionrecord
5759
- Qdrant
5860
- Toolsets
61+
- underscoped
5962
- Vnext
6063
- webp
64+
# Doctor / next-step terms
65+
- nextstep
66+
- nextsteps
67+
- undeployed
68+
- unredacted
69+
- UNKN
70+
- inlines
71+
- Remediations
72+
- remediations
73+
- uppercases
74+
- parseable
75+
- azd's

cli/azd/extensions/azure.ai.agents/go.mod

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ require (
3030

3131
require github.com/denormal/go-gitignore v0.0.0-20180930084346-ae8ad1d07817
3232

33+
require golang.org/x/term v0.41.0
34+
3335
require (
3436
dario.cat/mergo v1.0.2 // indirect
3537
github.com/AlecAivazis/survey/v2 v2.3.7 // indirect
@@ -110,7 +112,6 @@ require (
110112
golang.org/x/exp v0.0.0-20260112195511-716be5621a96 // indirect
111113
golang.org/x/net v0.52.0 // indirect
112114
golang.org/x/sys v0.42.0 // indirect
113-
golang.org/x/term v0.41.0 // indirect
114115
golang.org/x/text v0.35.0 // indirect
115116
golang.org/x/time v0.14.0 // indirect
116117
)
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
// Copyright (c) Microsoft Corporation. All rights reserved.
2+
// Licensed under the MIT License.
3+
4+
package cmd
5+
6+
import (
7+
"context"
8+
"fmt"
9+
"io"
10+
"os"
11+
"path/filepath"
12+
13+
"azureaiagent/internal/cmd/doctor"
14+
"azureaiagent/internal/cmd/nextstep"
15+
"azureaiagent/internal/pkg/paths"
16+
"azureaiagent/internal/version"
17+
18+
"github.com/azure/azure-dev/cli/azd/pkg/azdext"
19+
"github.com/spf13/cobra"
20+
)
21+
22+
// doctorFlags are the Cobra-bound flags for `azd ai agent doctor`.
23+
type doctorFlags struct {
24+
localOnly bool
25+
unredacted bool
26+
}
27+
28+
func newDoctorCommand() *cobra.Command {
29+
flags := &doctorFlags{}
30+
31+
cmd := &cobra.Command{
32+
Use: "doctor",
33+
Short: "Diagnose problems with an azd ai agent project.",
34+
Long: `Diagnose problems with an azd ai agent project.
35+
36+
Runs a sequence of local and remote checks against the current azd project,
37+
reporting on each one and (when all checks pass) suggesting the next
38+
command to run. Use this when you have lost terminal context or hit a
39+
confusing error and want a complete picture of the project's state.
40+
41+
Exit codes:
42+
0 — at least one check passed and no checks failed
43+
1 — any check failed
44+
2 — all checks were skipped (e.g. preconditions unmet)`,
45+
Example: ` # Run the full check suite
46+
azd ai agent doctor`,
47+
Args: cobra.NoArgs,
48+
RunE: func(cmd *cobra.Command, args []string) error {
49+
ctx := azdext.WithAccessToken(cmd.Context())
50+
logCleanup := setupDebugLogging(cmd.Flags())
51+
defer logCleanup()
52+
53+
// `--debug` (persistent root flag) also toggles the verbose per-check
54+
// detail block in the doctor report.
55+
debug := isDebug(cmd.Flags())
56+
57+
// Let `local.grpc-extension` report client creation failures so
58+
// downstream checks can skip instead of duplicating the error.
59+
azdClient, clientErr := azdext.NewAzdClient()
60+
if azdClient != nil {
61+
defer azdClient.Close()
62+
}
63+
64+
deps := doctor.Dependencies{
65+
AzdClient: azdClient,
66+
AzdClientErr: clientErr,
67+
ExtensionVersion: version.Version,
68+
AgentAPIVersion: DefaultAgentAPIVersion,
69+
}
70+
71+
opts := doctor.Options{
72+
LocalOnly: flags.localOnly,
73+
Unredacted: flags.unredacted,
74+
}
75+
76+
report, err := runAndRenderDoctorText(ctx, deps, opts, azdClient, os.Stdout, debug)
77+
if err != nil {
78+
return err
79+
}
80+
81+
// Use os.Exit to preserve doctor's 0/1/2 exit-code contract;
82+
// Cobra/azdext would otherwise collapse all errors to 1.
83+
// os.Exit skips defers, so do not add cleanup-critical defers here.
84+
code := doctor.ExitCode(report)
85+
if code == 0 {
86+
return nil
87+
}
88+
os.Exit(code)
89+
return nil // unreachable
90+
},
91+
}
92+
93+
cmd.Flags().BoolVar(
94+
&flags.localOnly, "local-only", false,
95+
"Skip remote (network-dependent) checks. "+
96+
"Useful when offline, behind a proxy, or for a fast local triage.",
97+
)
98+
cmd.Flags().BoolVar(
99+
&flags.unredacted, "unredacted", false,
100+
"Show raw principal IDs, scope ARNs, and UPNs in the report.",
101+
)
102+
103+
return cmd
104+
}
105+
106+
// runAndRenderDoctorText streams human-readable output as checks complete.
107+
// `debug` switches between the default concise rendering and the verbose
108+
// per-check Message/Suggestion/Links block.
109+
func runAndRenderDoctorText(
110+
ctx context.Context,
111+
deps doctor.Dependencies,
112+
opts doctor.Options,
113+
azdClient *azdext.AzdClient,
114+
w io.Writer,
115+
debug bool,
116+
) (doctor.Report, error) {
117+
renderer := newDoctorRenderer(w, debug)
118+
if err := renderer.writeHeader(); err != nil {
119+
return doctor.Report{}, err
120+
}
121+
122+
report, trailing, err := runDoctorWithObserver(
123+
ctx,
124+
deps,
125+
opts,
126+
azdClient,
127+
func(result doctor.Result) error {
128+
return renderer.writeCheck(result)
129+
},
130+
)
131+
if err != nil {
132+
return report, err
133+
}
134+
135+
showNext := len(trailing) > 0 && writerIsTerminal(w)
136+
if err := renderer.writeFooter(report, trailing, showNext); err != nil {
137+
return report, err
138+
}
139+
return report, nil
140+
}
141+
142+
func runDoctorWithObserver(
143+
ctx context.Context,
144+
deps doctor.Dependencies,
145+
opts doctor.Options,
146+
azdClient *azdext.AzdClient,
147+
observer doctor.ResultObserver,
148+
) (doctor.Report, []nextstep.Suggestion, error) {
149+
// Keep local checks first so remote checks can inspect their prior
150+
// results for skip-cascade decisions.
151+
checks := append(doctor.NewLocalChecks(deps), doctor.NewRemoteChecks(deps)...)
152+
runner := doctor.Runner{Checks: checks}
153+
report, err := runner.RunWithObserver(ctx, opts, observer)
154+
if err != nil {
155+
return report, nil, err
156+
}
157+
158+
// Show trailing Next: only on clean reports; otherwise it competes with
159+
// the failing check's remediation.
160+
if doctor.ExitCode(report) != 0 {
161+
return report, nil, nil
162+
}
163+
164+
trailing := resolveDoctorTrailing(ctx, azdClient)
165+
return report, trailing, nil
166+
}
167+
168+
// resolveDoctorTrailing returns the doctor's trailing Next block, or nil on
169+
// error. It chooses deployed-agent suggestions when any service is deployed;
170+
// otherwise it reuses the post-init guidance.
171+
func resolveDoctorTrailing(ctx context.Context, azdClient *azdext.AzdClient) []nextstep.Suggestion {
172+
if azdClient == nil {
173+
return nil
174+
}
175+
176+
state, _ := nextstep.AssembleStateFromSource(ctx, nextstep.NewSource(azdClient))
177+
if len(state.Services) == 0 {
178+
// Avoid repeating the missing-service guidance already reported by
179+
// `local.agent-service-detected`.
180+
return nil
181+
}
182+
183+
if anyServiceDeployed(state.Services) {
184+
// Filter to deployed services so the generated invoke/show commands
185+
// stay copy-paste correct.
186+
return nextstep.ResolveAfterDeploy(
187+
filterDeployedServices(state),
188+
doctorCachedPayload(ctx, azdClient),
189+
doctorReadmeExists(ctx, azdClient),
190+
)
191+
}
192+
193+
return nextstep.ResolveAfterInit(state)
194+
}
195+
196+
func anyServiceDeployed(services []nextstep.ServiceState) bool {
197+
for _, s := range services {
198+
if s.IsDeployed {
199+
return true
200+
}
201+
}
202+
return false
203+
}
204+
205+
// filterDeployedServices returns a shallow clone with only deployed services.
206+
func filterDeployedServices(state *nextstep.State) *nextstep.State {
207+
if state == nil {
208+
return nil
209+
}
210+
clone := *state
211+
clone.Services = make([]nextstep.ServiceState, 0, len(state.Services))
212+
for _, s := range state.Services {
213+
if s.IsDeployed {
214+
clone.Services = append(clone.Services, s)
215+
}
216+
}
217+
return &clone
218+
}
219+
220+
// doctorCachedPayload returns a remote-cache lookup closure for ResolveAfterDeploy.
221+
// It returns "" on failure and tries deployed Foundry agent names before
222+
// falling back to azure.yaml service names.
223+
func doctorCachedPayload(ctx context.Context, azdClient *azdext.AzdClient) func(string) string {
224+
var envName string
225+
if azdClient != nil {
226+
if envResp, err := azdClient.Environment().GetCurrent(ctx, &azdext.EmptyRequest{}); err == nil &&
227+
envResp != nil && envResp.Environment != nil {
228+
envName = envResp.Environment.Name
229+
}
230+
}
231+
232+
return func(serviceName string) string {
233+
if azdClient == nil || serviceName == "" {
234+
return ""
235+
}
236+
configPath, err := resolveConfigPath(ctx, azdClient)
237+
if err != nil {
238+
return ""
239+
}
240+
configDir := filepath.Dir(configPath)
241+
242+
if envName != "" {
243+
nameKey := fmt.Sprintf("AGENT_%s_NAME", toServiceKey(serviceName))
244+
if v, err := azdClient.Environment().GetValue(ctx, &azdext.GetEnvRequest{
245+
EnvName: envName,
246+
Key: nameKey,
247+
}); err == nil && v != nil && v.Value != "" && v.Value != serviceName {
248+
if spec, err := nextstep.ReadCachedOpenAPISpec(configDir, v.Value, "remote"); err == nil {
249+
if payload := nextstep.ExtractInvokeExample(spec); payload != "" {
250+
return payload
251+
}
252+
}
253+
}
254+
}
255+
256+
spec, err := nextstep.ReadCachedOpenAPISpec(configDir, serviceName, "remote")
257+
if err != nil {
258+
return ""
259+
}
260+
return nextstep.ExtractInvokeExample(spec)
261+
}
262+
}
263+
264+
// doctorReadmeExists returns a readmeExists closure for ResolveAfterDeploy.
265+
// Only canonical "README.md" casing is checked to match rendered guidance.
266+
func doctorReadmeExists(ctx context.Context, azdClient *azdext.AzdClient) func(string) bool {
267+
projectRoot := resolveProjectPath(ctx, azdClient)
268+
return func(relativePath string) bool {
269+
if projectRoot == "" {
270+
return false
271+
}
272+
readmePath, err := paths.JoinAllowRoot(projectRoot, relativePath, "README.md")
273+
if err != nil {
274+
return false
275+
}
276+
_, err = os.Stat(readmePath)
277+
return err == nil
278+
}
279+
}

0 commit comments

Comments
 (0)