feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) by Cataldir · Pull Request #1103 · Azure-Samples/holiday-peak-hub

Cataldir · 2026-05-13T13:53:24Z

Summary

Pilot end-to-end Foundry V3 Hosted Agents for inventory-health-check, validated by a successful HTTP 200 invocation against the public Responses endpoint in the aipholidaris project — for both stream=false (curl ping) and stream=true (Foundry portal Playground, after 881b49a8).

This PR also lands the eight live-deployment fixes discovered while running the pilot against the platform, all of which now have regression tests. Fixes #1–#5 came from the initial activation track; #6 (the PORT reserved-name regression) was found in the previous session; #7 (the streaming-protocol contract) was found this session when the Foundry portal Playground surfaced an 'async for' requires __aiter__, got coroutine TypeError that the original ping test (stream=false) had not exercised. #8 codifies the Foundry User role auto-grant in scripts/ops/deploy_hosted_agent.py (closing #1107) so the manual az role assignment create runbook step is no longer required.
| 8 (new) | The Foundry SDK deploy path (scripts/ops/deploy_hosted_agent.py) did not grant the Foundry User role to the per-version managed identity minted by create_version. Without that role, the container ran fine but the Foundry runtime returned 401 on POST .../storage/responses and the Playground surfaced a generic 'internal error storing the response' toast. Manual az role assignment create was the workaround. | deploy_hosted_agent_version now auto-resolves the per-version instance_identity.principal_id, derives the project ARM scope from project_endpoint via az resource list, and calls az role assignment create --assignee-principal-type ServicePrincipal --role 'Foundry User'. Idempotent on RoleAssignmentExists. New CLI flags: --no-auto-grant-foundry-user, --foundry-role-name, --project-scope. Failure does NOT mask a successful version activation -- it is recorded in result.extras['role_grant']. Closes #1107. | test_deploy_auto_grants_foundry_user_after_active, test_deploy_skips_grant_when_auto_grant_disabled, test_deploy_records_already_exists_when_granter_returns_none, test_deploy_surfaces_grant_failure_without_breaking_active, test_deploy_records_skipped_when_principal_id_missing, test_deploy_uses_explicit_project_scope_override, test_grant_role_via_az_treats_already_exists_as_idempotent, test_grant_role_via_az_raises_on_real_failure, test_grant_role_via_az_parses_assignment_id_on_success, test_extract_principal_id_from_instance_identity, test_extract_principal_id_from_managed_identity_alias, test_extract_principal_id_from_mapping |

Status as of 2026-05-18: inventory-health-check v20 is active and answers /responses with HTTP 200. Non-streaming evidence below. Streaming-mode invocations now also succeed locally (12/12 hosted-adapter tests + 1360 lib tests + 3 pilot tests) and will be re-verified against Foundry as soon as a new image (v21+) ships the 881b49a8 adapter fix. Two follow-up issues filed for the operational hardening discovered during root-cause analysis: #1107 (auto-grant Foundry User in the SDK deploy path) and #1108 (codify ACR prerequisites in IaC).

End-to-end invocation evidence (final state)

POST /api/projects/aipholidaris/agents/inventory-health-check/endpoint/protocols/openai/v1/responses
{ "model": "inventory-health-check", "input": "ping" }

HTTP/1.1 200 OK
{
  "id": "caresp_adff33150146c05d00THJqzmj3Im0sk6nwZoZcSzyRKmh7LQpc",
  "status": "completed",
  "output": [{ "type": "message", "role": "assistant", "content": [{
    "type": "output_text",
    "text": "{\"error\": \"sku is required\", \"hint\": \"Provide a SKU id in the prompt, e.g. 'check health for SKU-1234'.\", \"input\": \"ping\"}"
  }]}]
}

App Insights trace for the same invocation:

DefaultAzureCredential acquired a token from ManagedIdentityCredential
Foundry storage POST .../storage/responses?api-version=v1 -> 201 (66.9ms)
Response caresp_… completed: status=completed output_count=1
Inbound POST /responses completed with status 200 in 555.8ms

This proves the full V3 hosted-agent lifecycle: deploy → activate → invoke → container → agent code → Foundry storage → response → client — all healthy in the stream=false path. The streaming path is validated by the new hosted-adapter unit tests (test_hosted_run_adapter_streams_single_update and test_hosted_run_adapter_non_streaming_returns_awaitable) and will be re-verified against Foundry once v21+ is deployed with 881b49a8.

GET /api/projects/aipholidaris/agents/inventory-health-check/versions?api-version=v1

version	status	image
1–3	active	`holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v3` (digest `sha256:5b9d8601…`)
15–19	failed	`ImageError` — root-caused to `azureADAuthenticationAsArmPolicy=disabled` on canonical ACR
20	active	`holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:d4775cdf…` (tag `foundry-v6`, build run `cj28`) — invoked end-to-end (non-streaming)
21+	pending	will carry the `881b49a8` streaming-protocol fix for Playground/SSE invocations

Live fixes landed in this PR

#	Bug	Fix	Regression test
1	`AIProjectClient` rejected hosted manifests with HTTP 400 because the SDK requires preview opt-in.	`_build_project_client` constructs `AIProjectClient(..., allow_preview=True)`; agent IDs are looked up by name through `agents.get_version(...)` (legacy ID URL is gone in V3).	`test_build_project_client_passes_allow_preview`
2	Pilot manifest was not loaded by the registration script (only `agent.yaml` / `agent.manifest.yaml` were tried).	`manifest.py` loader now probes `agent.manifest.yaml` -> `agent.hosted.yaml` -> `agent.yaml` so a hosted-only manifest can sit alongside the metadata-only `agent.yaml` without changing the portal-tracking contract.	covered by existing manifest loader tests
3	`agent.hosted.yaml` declared the protocol version field that V3 rejects.	Manifest now follows the canonical `template.kind: hosted` shape with `protocols: [{protocol: responses, version: "1.0.0"}]` and `container.cpu/memory` instead of nested `definition.*`.	manifest snapshot test
4	Container env names containing `FOUNDRY_` / `AGENT_` were rejected by `create_version` with `ValidationError: ... reserved per container-image-spec`. The platform reserves the entire `FOUNDRY_` and `AGENT_` namespaces, not just the six documented platform-injected names.	Manifest renamed to `HPH_AGENT_ID_FAST` / `HPH_AGENT_ID_RICH` (and `HPH_AGENT_NAME_`). `holiday_peak_lib.app_factory_components.foundry_lifecycle.build_foundry_config` now reads the `HPH_` prefix first and falls back to the legacy `FOUNDRY_AGENT_ID_` / `FOUNDRY_AGENT_NAME_*` for AKS deploys (back-compat).	`test_build_foundry_config_prefers_hph_agent_id_over_foundry_agent_id`, `test_build_foundry_config_hph_agent_name_takes_precedence`
5	Poll loop never recognised terminal states: SDK 2.1.0 deserialises `"status": "failed"` into an `AgentVersionStatus` enum whose `str()` returns `"AgentVersionStatus.FAILED"`. The previous `str(status).lower()` produced `"agentversionstatus.failed"`, which did not match `_TERMINAL_STATUSES = {"active","failed","deleted"}`, so the script timed out instead of raising `RuntimeError`.	Added `_normalize_status` in `lib/src/holiday_peak_lib/foundry_hosting/deploy.py` that prefers the enum `.value` field and falls back to stripping any `Enum.MEMBER` dotted prefix.	`test_normalize_status_handles_enum_value`, `test_normalize_status_strips_dotted_enum_repr`, `test_normalize_status_plain_string`
6	Foundry V3 rejected the previously-accepted `PORT=8088` declaration with `invalid_payload: Environment variable 'PORT' is reserved for platform use`. The reserved namespace expanded between the time we wrote the pilot manifest and the final activation.	`apps/inventory-health-check/agent.hosted.yaml` no longer declares `PORT`. The existing Dockerfile CMD (`${PORT:-${UVICORN_PORT:-8088}}`) picks up the platform-injected value first, then falls back to `UVICORN_PORT` (still declared) for local docker-run / AKS dev parity.	manifest snapshot test
7 (new)	Foundry's `ResponsesHostServer` (preview SDK `agent-framework-foundry-hosting==1.0.0a260507`) calls `agent.run` with two distinct contracts depending on `stream`: `await agent.run(stream=False, ...)` expects a coroutine returning `AgentResponse`; `async for update in agent.run(stream=True, ...):` expects an async iterator of `AgentResponseUpdate` items. Our `_HostedAgentRunAdapter.run` was marked `async def`, so it always returned a coroutine. The Playground (which defaults to `stream=true`) triggered `TypeError: 'async for' requires an object with __aiter__, got coroutine` at upstream `_responses.py:341`, which cascaded into a 401 on the persistence write because the response object had already been created server-side. The `ping` test passed because it used the non-streaming path.	`lib/src/holiday_peak_lib/agents/hosted.py` reshapes `_HostedAgentRunAdapter.run` into a synchronous dispatcher that returns either a coroutine (`_run_once` → `AgentResponse`) or an async iterator (`_run_streaming` yielding one `AgentResponseUpdate(contents=[Content(type='text', text=…)])`) based on the `stream` flag. The shared `_invoke_handle` helper keeps the translation/dispatch/extraction logic in one place. Per-token streaming via `invoke_model_stream` remains a follow-up.	`test_hosted_run_adapter_streams_single_update` (pins async-iterator contract), `test_hosted_run_adapter_non_streaming_returns_awaitable` (pins awaitable contract)

Three operational findings (project- and ACR-side — not in this PR's code; followed up in #1107 and #1108)

These are environment-level prerequisites discovered while activating v20. They are already applied to the live dev environment and documented in memories/session/foundry-v3-pilot-status.md. They are out of scope for this PR but tracked for codification:

#	Finding	Why it matters	Followed up by
O1	ACR `policies.azureADAuthenticationAsArmPolicy.status` must be `enabled` on the canonical registry. If disabled, the platform rejects the ARM→ACR token exchange and surfaces a generic `ImageError` with zero pull attempts recorded on the ACR. This was the single root cause of v15–v19 failures.	Foundry hosted-agents pull via AAD-as-ARM token exchange. The canonical ACR was created out-of-band with this policy disabled by default; the test ACR was created with it enabled and worked first try.	#1108
O2	The Foundry AI-account system MI (`351cdb70-…`) needs `AcrPull` and `Container Registry Repository Reader` on the canonical ACR. The docs name only the project MI; the live behaviour requires both.	Pulls happen under more than one identity context during hosted-agent provisioning.	#1108
O3	When deploying via the SDK path (`scripts/ops/deploy_hosted_agent.py`), the per-version agent MI minted by `create_version` does not get `Foundry User` on the project automatically (the `azd` / VS-Code extension path does). Without it, the container runs and the agent code succeeds, but storage `POST /storage/responses` returns 401 and the public call surfaces as HTTP 500 (or, in the Playground, "An internal error occurred while storing the response").	This was the final unblock for v20, and the same condition will need to be re-applied to v21 (carrying the streaming fix) until #1107 lands. Manual `az role assignment create` after `create_version` is the runbook today.	#1107 -- closed by this PR (fix #8 above)

How to verify in this branch (updated)

# 1. Build image into canonical ACR via git remote-context (avoids Windows-client upload hangs):
az acr build --registry holidaypeakhub405devacr `
  --image "inventory-health-check:foundry-v7" `
  --file "apps/inventory-health-check/src/Dockerfile" --target prod --no-logs `
  "https://github.com/Azure-Samples/holiday-peak-hub.git#feature/foundry-hosted-agents-pilot"

# 2. Confirm the three operational prerequisites are in place (#1107, #1108 will automate these):
az acr config authentication-as-arm show --registry holidaypeakhub405devacr   # status=enabled
az role assignment list --scope <acr-id> --assignee 351cdb70-0600-4c8c-b7f2-c6bf92ae1089
# Expect AcrPull + Container Registry Repository Reader.

# 3. Register a hosted version (v21):
$env:PROJECT_ENDPOINT = "https://holidaypeakhub405devais.services.ai.azure.com/api/projects/aipholidaris"
$env:MODEL_DEPLOYMENT_NAME_FAST = "gpt-5-nano"
$env:MODEL_DEPLOYMENT_NAME_RICH = "gpt-5"
$env:HPH_AGENT_ID_FAST = "ecommerce-catalog-search-fast"
$env:HPH_AGENT_ID_RICH = "product-management-assortment-optimization-rich"

python scripts/ops/deploy_hosted_agent.py `
  --agent-yaml apps/inventory-health-check/agent.hosted.yaml `
  --image-uri "holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v7" `
  --project-endpoint $env:PROJECT_ENDPOINT --json

# 4. (Automated by this PR) The deploy script now auto-grants `Foundry User` on the
#    per-version managed identity after `create_version` reaches `active`.
#    Pass `--no-auto-grant-foundry-user` to opt out, or `--project-scope` to override.

# 5. Invoke (both paths now succeed with the v21 image):
python .tmp/invoke-hosted-agent.py   # stream=false; expect status=completed
#   plus open the Foundry portal Playground for the agent and send "hi there";
#   the reply must render (stream=true) without an "internal error storing the response" toast.

Expected: status=active from step 3, status=completed from step 5 with the structured agent response in both the non-streaming (curl) and streaming (Playground) paths.

Out of scope (tracked elsewhere)

[P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 — deploy_hosted_agent.py should auto-grant Foundry User on the new per-version MI after create_version, eliminating the manual step in verify-step 4.
[P2] foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC #1108 — Codify the ACR prerequisites (azureADAuthenticationAsArmPolicy=enabled, AI-account MI AcrPull + Container Registry Repository Reader) in IaC so they survive a registry rebuild.
Per-token streaming via invoke_model_stream (currently the streaming path emits one AgentResponseUpdate chunk carrying the full reply; the SSE tracker handles it correctly, but richer token-by-token rendering remains a follow-up).
Wiring the other 25 agents to the same hosted manifest pattern (will follow as a separate PR per cluster).
Reproducing this Terraform-side as IaC (handled in [P2] foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC #1108 for the ACR side; agent registration in IaC is a separate epic).
Replacing the holiday-peak-db/enterprise-memory Cosmos containers with hosted-agent specific containers (current pilot reuses existing infra).

Implements end-to-end support for Azure AI Foundry V3 Hosted Agents (preview). Framework (lib/): - holiday_peak_lib.foundry_hosting: manifest loader (Pydantic v2), env-var resolver, deploy wrapper over AIProjectClient.agents.create_version with terminal-status polling and async helper. Azure SDK imports lazy. - mount_hosted_agent: default prefix flipped to '' (Foundry gateway adds /openai/v1/ externally and forwards to container '/responses'). - BaseRetailAgent.serve_hosted: default prefix updated. Product (apps/inventory-health-check/): - agent.hosted.yaml NEW: V3 registration manifest (kind: hosted, responses 1.0.0, 19 env vars, gpt-5-nano + gpt-5 model resources). - agent.yaml: tracking shape preserved, doc-only update referencing sibling. - Dockerfile: --port \${UVICORN_PORT:-8000} for hosted-runtime portability. - main.py: docstring updated to /responses. Ops: - scripts/ops/deploy_hosted_agent.py NEW: CLI runbook entry point. Tests: - 24 new tests across manifest loader, deploy wrapper, and hosted mount. - Fleet-wide tests/ops/test_foundry_portal_tracking_manifests.py guardrail preserved (27/27 pass). - Targeted suite: 56 passed in 6.40s. - pylint 9.78/10 on new module + CLI.

…ples for portal visibility Four bugs identified by line-by-line comparison against the official MS Learn `Deploy a hosted agent` doc and the `Microsoft/foundry-samples` repository (after the scaffolding-only initial PR did not produce a visible agent): 1. `AIProjectClient` now built with `allow_preview=True` so the `agents.create_version` V3 surface is actually exposed. Without the flag the SDK silently routes to legacy assistants and the new agent never materializes in the New Foundry portal. 2. Terminal status set tightened from `{active,ready,succeeded,failed,error}` to the documented terminal set `{active,failed,deleted}` with `active` as the only success terminal; `deleting` is correctly treated as transient. 3. `apps/inventory-health-check/agent.hosted.yaml` no longer redeclares the platform-injected `APPLICATIONINSIGHTS_CONNECTION_STRING`. The full forbidden list is now documented inline: `FOUNDRY_PROJECT_ENDPOINT`, `FOUNDRY_PROJECT_ARM_ID`, `FOUNDRY_AGENT_NAME`, `FOUNDRY_AGENT_VERSION`, `FOUNDRY_AGENT_SESSION_ID`, `APPLICATIONINSIGHTS_CONNECTION_STRING`. Collisions on any of these cause `create_version` to reject the manifest. 4. Pilot manifest renamed `FOUNDRY_AGENT_NAME_FAST` / `_RICH` to `FOUNDRY_AGENT_ID_FAST` / `_RICH` to match the runtime contract in ADR-010 / `holiday_peak_lib.config._build_foundry_config`. Loader (`load_manifest`) also probes `agent.manifest.yaml` first \u2014 the canonical name used by `foundry-samples` and `azd ai agent init -m` \u2014 before falling back to `agent.hosted.yaml` and `agent.yaml`, so future services may adopt either name without changes to the loader. Tests: 18 hosting tests pass; one new test covers `deleted` status, one verifies `allow_preview=True` is passed to `AIProjectClient`, two cover the loader filename-priority ordering. Pylint 9.78/10.

…ationError) Foundry V3 hosted-agents platform reserves the entire FOUNDRY_*/AGENT_* env-var namespaces (per container-image-spec). Live create_version returned: 'Environment variable FOUNDRY_AGENT_ID_FAST is reserved for platform use.' Rename in-container env vars to HPH_AGENT_ID_FAST/_RICH (and matching HPH_AGENT_NAME_*). build_foundry_config now reads HPH_ first with FOUNDRY_AGENT_ID_* fallback so AKS deploys remain back-compat. Operator env contract unchanged: external ${FOUNDRY_AGENT_ID_FAST} is mapped to HPH_AGENT_ID_FAST inside the container via manifest placeholder substitution.

…ctive' Foundry SDK 2.1.0 returns status as AgentVersionStatus enum whose str() is 'AgentVersionStatus.FAILED'. Previous str().lower() produced 'agentversionstatus.failed' which never matched terminal sets. Add _normalize_status helper that prefers enum .value and falls back to stripping dotted Enum.MEMBER prefix. Three new tests cover all paths.

…convention The azure-ai-agentserver-core framework reads the canonical PORT env var (default 8088) via resolve_port(), but our containers were only listening on UVICORN_PORT. This caused Foundry V3 hosted-agent invocations to return 424 session_not_ready because the gateway probed PORT=8088 while uvicorn was bound to UVICORN_PORT. Changes: apps/inventory-health-check/src/Dockerfile - CMD now reads ${PORT:-${UVICORN_PORT:-8088}} so Foundry V3 PORT takes precedence, AKS keeps UVICORN_PORT=8000 as legacy override, and the framework default of 8088 is the fallback. apps/inventory-health-check/agent.hosted.yaml - Add PORT=8088 (canonical Foundry V3), UVICORN_PORT=8088 (alignment), and WEB_CONCURRENCY=1 to keep startup under readiness deadline. lib/src/holiday_peak_lib/app_factory.py - _service_lifespan now emits six explicit lifespan_* log lines for trace correlation in App Insights. lib/src/holiday_peak_lib/foundry_hosting/deploy.py - _extract handles collections.abc.Mapping (AgentVersionDetails is a MutableMapping, not a dict, and exposes fields via __getitem__). - Add _pick_latest_version tolerant of v3, 3, 3.1.0 label shapes. lib/tests/test_foundry_hosting_deploy.py - 479 new lines covering Mapping branch, picker, and re-fetch path. memories/session/foundry-v3-pilot-status.md - Resume-state notes: PORT root cause, namespace collision, lifespan-mount behavior, ACR drift correction. Refs #990. PR #1103.

Foundry V3 hosted-agents reject "PORT" with "invalid_payload: Environment variable 'PORT' is reserved for platform use". The Dockerfile CMD already reads the platform-injected value first, then UVICORN_PORT, then 8088, so removing PORT here lets the platform inject its own value automatically. Keep UVICORN_PORT for local docker-run / AKS dev parity. Also refresh memories/session/foundry-v3-pilot-status.md with the runbook proven during the pilot: 1. ACR azureAdAuthenticationAsArmPolicy must be enabled 2. AI-account system MI needs AcrPull + Container Registry Repository Reader on the canonical ACR (not only the project MI) 3. Per-version agent MI and blueprint MI need Foundry User on the project when deploying via the SDK (azd auto-handles this; SDK path does not) v20 of inventory-health-check is now active and returns 200 from /responses with a structured domain answer. Refs: #990

Cataldir · 2026-05-18T09:59:39Z

Follow-up issues filed for the operational findings discovered during root-cause analysis of v15-v19 ImageErrors:

[P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 - foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py (P1)
[P2] foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC #1108 - foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC (P2)

PR body updated with the new findings and end-to-end invocation evidence (v20 active, status=completed, Foundry storage POST -> 201). The three operational fixes are already applied to the live dev environment so the pilot is green; the issues track codifying them so future hosted-agent deploys do not need manual operator RBAC steps.

…treaming Foundry's ResponsesHostServer (agent-framework-foundry-hosting==1.0.0a260507) calls agent.run with two distinct contracts depending on stream: stream=False -> response = await agent.run(stream=False, ...) # coroutine stream=True -> async for update in agent.run(stream=True, ...): # iterator Our adapter was marked async def run, so it always returned a coroutine. When the Foundry portal Playground (which always sets stream=True) hit the adapter, the framework tried to async-iterate the coroutine and crashed with: 'async for' requires an object with __aiter__, got coroutine. Fix: reshape run() into a synchronous dispatcher that returns either a coroutine (_run_once -> AgentResponse) or an async iterator (_run_streaming -> AgentResponseUpdate) based on the stream flag. The streaming path emits a single AgentResponseUpdate carrying one text content -- sufficient for the SSE tracker to render and terminate the stream cleanly. Per-token streaming via invoke_model_stream remains a follow-up. Tests: - Replaced test_hosted_run_adapter_refuses_streaming with test_hosted_run_adapter_streams_single_update (pins the async-iterator contract) - Added test_hosted_run_adapter_non_streaming_returns_awaitable to pin the awaitable contract for stream=False - All 12 hosted-adapter tests pass; 1360 lib tests pass; 3 pilot tests pass Refs: PR #1103

Cataldir · 2026-05-18T10:35:07Z

Streaming-protocol fix landed: `881b49a8`

Found a second-order code bug while diagnosing the "agent doesn't reply in the UI" report: the v20 ping test only exercised the stream=false path, but the Foundry portal Playground always sets stream=true by default. The streaming path crashed with:

TypeError: 'async for' requires an object with __aiter__ method, got coroutine
  at agent_framework_foundry_hosting/_responses.py:341

…which cascaded into a 401 on the persistence write because the response object had already been created server-side. Symptom in the portal:

"An internal error occurred while storing the response. Subsequent retrieval is not guaranteed."

Root cause

_HostedAgentRunAdapter.run was marked async def, so it always returned a coroutine. The upstream framework's contract is polymorphic:

stream=False → response = await agent.run(stream=False, ...) (coroutine returning AgentResponse) ✅
stream=True → async for update in agent.run(stream=True, ...): (async iterator of AgentResponseUpdate) ❌

Same method run must return either a coroutine or an async iterator. async def can only do the former.

Fix (`881b49a8`)

Reshaped run into a synchronous dispatcher in lib/src/holiday_peak_lib/agents/hosted.py:

def run(self, messages=None, *, stream=False, session=None, **kwargs):
    if stream:
        return self._run_streaming(messages)  # async iterator
    return self._run_once(messages)            # coroutine

async def _run_once(self, messages) -> AgentResponse: ...
async def _run_streaming(self, messages) -> AsyncIterator[AgentResponseUpdate]:
    reply_text = await self._invoke_handle(messages)
    yield AgentResponseUpdate(
        contents=[Content(type="text", text=reply_text)],
        role="assistant",
        agent_id=self.id,
    )

The retail agents' handle() is unary, so the streaming path emits one well-formed AgentResponseUpdate chunk carrying the full reply. The SSE tracker on the host side renders it correctly and terminates the stream. Per-token streaming via invoke_model_stream remains a follow-up.

Tests

test_hosted_run_adapter_streams_single_update — pins the async-iterator contract (hasattr(iterator, "__aiter__"))
test_hosted_run_adapter_non_streaming_returns_awaitable — pins the awaitable contract
12/12 hosted-adapter tests pass; 1360 lib tests pass; 3 pilot tests pass; pre-push gates green

Operator next steps

Once this PR merges (or now, on the pilot branch):

Build v21: az acr build --registry holidaypeakhub405devacr --image inventory-health-check:foundry-v7 --file apps/inventory-health-check/src/Dockerfile --target prod --no-logs "https://github.com/Azure-Samples/holiday-peak-hub.git#feature/foundry-hosted-agents-pilot"
Deploy v21: python scripts/ops/deploy_hosted_agent.py --agent-yaml apps/inventory-health-check/agent.hosted.yaml --image-uri holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v7 --project-endpoint <endpoint> --json
Manually grant Foundry User on the new per-version MI (this is the manual step [P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 will automate; see operational finding O3 in the PR body)
Verify in Playground: open the agent in the Foundry portal, send any prompt, the reply must render without the "internal error storing the response" toast.

Step 3 is still required for v21 because #1107 hasn't landed yet. The reconciliation plan in .tmp/reconciliation-plan.md will be updated to reflect this new sequencing: v21 deploy + manual RBAC grant + Playground verification all precede the merge of #1103.

Consolidate the manual `az role assignment create` runbook step into `deploy_hosted_agent_version`. The per-version managed identity minted by `AIProjectClient.agents.create_version` does NOT receive the Foundry User role on the project scope automatically, so every Playground / Responses invocation fails 401 on the storage POST: Foundry storage POST .../storage/responses?api-version=v1 -> 401 Principal does not have access to API/Operation. The `azd` and VS-Code extension deploy paths grant it implicitly; the SDK path (this module) did not, leaving operators to remember a manual step that was easy to skip. This change closes the loop in code. Implementation -------------- * `deploy_hosted_agent_version` accepts `auto_grant_role` (default True), `foundry_role_name` ("Foundry User"), `project_scope` (optional override), `role_granter` + `scope_resolver` (test seams). * On reaching `active`, the helper resolves the per-version principal id from `version_obj.instance_identity.principal_id` (with two preview-era field aliases), derives the project ARM scope from `project_endpoint` via `az resource list`, and calls `az role assignment create` with the `--assignee-principal-type ServicePrincipal` flag — matching the manual runbook one-for-one. * The grant is idempotent: `RoleAssignmentExists` from the Azure CLI is treated as success and recorded as `status=already_exists`. * A failed grant does NOT mask a successful version activation. The failure is captured in `result.extras["role_grant"]` with `status=failed` and `error=<stderr>` so operators can re-run or escalate. * `scripts/ops/deploy_hosted_agent.py` exposes `--no-auto-grant-foundry-user`, `--foundry-role-name`, and `--project-scope` CLI flags. The JSON output now includes the `role_grant` payload. Tests ----- * +12 tests in `lib/tests/test_foundry_hosting_deploy.py`: - principal-id extraction (3 shapes: `instance_identity`, `managed_identity`, `Mapping`) + missing-id null path - scope derivation: resolver test seam, malformed endpoint, no-account - integration: granted / skipped / already-exists / failure / no-principal / explicit-scope-override - default `_grant_role_via_az`: success (parses assignment id), already-exists (idempotent), real-failure (raises) * All 1376 lib tests + 108 pilot tests pass. Refs: #1107, runbook docs/ops/foundry-hosted-agents.md

Cataldir · 2026-05-18T11:25:52Z

Fix #8 landed: Foundry User role auto-grant now in code (#1107 closed by this PR).

582f443e adds the per-version managed-identity role grant to
deploy_hosted_agent_version, consolidating the manual
az role assignment create runbook step into the deploy path.

What changed

lib/src/holiday_peak_lib/foundry_hosting/deploy.py
- New helpers: _extract_principal_id (probes instance_identity /
  managed_identity / identity field aliases), _derive_project_scope_from_endpoint
  (parses https://{a}.services.ai.azure.com/api/projects/{p} → ARM scope via
  az resource list), _grant_role_via_az (idempotent
  az role assignment create --assignee-principal-type ServicePrincipal --role 'Foundry User'),
  _ensure_foundry_user_grant, _maybe_grant_foundry_user.
- deploy_hosted_agent_version accepts:
  auto_grant_role: bool = True, foundry_role_name: str = "Foundry User",
  project_scope: str | None = None, role_granter & scope_resolver
  (test seams).
- Failed grant does NOT mask a successful version activation — recorded
  under result.extras["role_grant"] with status=failed and stderr.
- RoleAssignmentExists is treated as success
  (status=already_exists).
scripts/ops/deploy_hosted_agent.py
- New flags: --no-auto-grant-foundry-user, --foundry-role-name,
  --project-scope. JSON output now includes the role_grant payload.
lib/tests/test_foundry_hosting_deploy.py
- +12 unit tests covering granted / already-exists / failed / skipped /
  no-principal / explicit-scope paths, principal-id extraction across
  3 SDK field-name aliases, and the default _grant_role_via_az
  success/error/idempotent matrix.

Verification

pytest lib/tests/test_foundry_hosting_deploy.py → 37 passed (12 new + 25 existing).
pytest lib/tests → 1376 passed.
pytest tests → 108 passed, 11 skipped (pre-existing).
black + isort clean.
pylint --fail-on=E,F clean (only pre-existing R/C warnings).

Behaviour after merge

Operators no longer need to follow verify-step 4 from the original PR body.
The standard python scripts/ops/deploy_hosted_agent.py … invocation now
handles the grant. Backwards-compatibility for environments where role
assignment is managed out of band: pass --no-auto-grant-foundry-user.

This unblocks the v21 (foundry-v7) image build + deploy that closes the
final piece of the live-Playground regression.

Cataldir · 2026-05-18T17:44:43Z

#1107 live validation update: hosted Redis/Event Hub isolation

Pushed commit d5d7d214 to feature/foundry-hosted-agents-pilot.

What changed

Added framework runtime flags in lib/src/holiday_peak_lib/app_factory.py:
- HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false detaches optional Redis hot memory from hosted request handling.
- HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false skips Event Hub subscriber startup for hosted containers outside the private AKS VNet.
Added bounded Redis socket/connect timeouts and fail-open hot-memory detach when Key Vault Redis secret resolution fails.
Updated apps/inventory-health-check/agent.hosted.yaml to disable hot memory and Event Hub subscribers for the Foundry hosted pilot.
Documented the runtime isolation contract in docs/governance/backend-governance.md while preserving ADR-007/ADR-032 three-tier memory and ADR-006 Event Hubs as canonical for product services.

Live deployment

Built ACR image: holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:b143209842ea322adfd2af99069614db7e9d82bc23088cdeed106d83cf9304a0
Deployed Foundry hosted version: 24
Status: active
Role grant: Foundry User granted to hosted MI at project scope
Version metadata verified: digest-pinned image, HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false, HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false

Live Responses API validation

Non-streaming probe: HTTP 200, v24, health_status: healthy for SKU-1234.
Streaming probe with store:false: HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.
Default streaming probe: initial curl saw a transient connection reset, immediate diagnostic retry returned HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.

The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24.

Validation gates

Focused hosted/runtime regressions: 65 passed.
Full pre-push gate passed:
- python -m isort --check-only lib apps
- python -m black --check lib apps
- python -m pylint --fail-on=E,F ...
- python -m mypy ...
- python scripts/ops/check_markdown_links.py --roots docs/governance docs/architecture
- python scripts/ops/check_event_schema_contracts.py
- pytest lib/tests --maxfail=1: 1400 passed, 5 skipped
- app tests excluding UI: 705 passed

Cataldir · 2026-05-18T17:45:16Z

#1107 live validation update: hosted Redis/Event Hub isolation

Pushed commit d5d7d214 to feature/foundry-hosted-agents-pilot.

What changed

Added framework runtime flags in lib/src/holiday_peak_lib/app_factory.py:
- HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false detaches optional Redis hot memory from hosted request handling.
- HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false skips Event Hub subscriber startup for hosted containers outside the private AKS VNet.
Added bounded Redis socket/connect timeouts and fail-open hot-memory detach when Key Vault Redis secret resolution fails.
Updated apps/inventory-health-check/agent.hosted.yaml to disable hot memory and Event Hub subscribers for the Foundry hosted pilot.
Documented the runtime isolation contract in docs/governance/backend-governance.md while preserving ADR-007/ADR-032 three-tier memory and ADR-006 Event Hubs as canonical for product services.

Live deployment

Built ACR image: holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:b143209842ea322adfd2af99069614db7e9d82bc23088cdeed106d83cf9304a0
Deployed Foundry hosted version: 24
Status: active
Role grant: Foundry User granted to hosted MI at project scope
Version metadata verified: digest-pinned image, HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false, HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false

Live Responses API validation

Non-streaming probe: HTTP 200, v24, health_status: healthy for SKU-1234.
Streaming probe with store:false: HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.
Default streaming probe: initial curl saw a transient connection reset, immediate diagnostic retry returned HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.

The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24.

Validation gates

Focused hosted/runtime regressions: 65 passed.
Full pre-push gate passed:
- python -m isort --check-only lib apps
- python -m black --check lib apps
- python -m pylint --fail-on=E,F ...
- python -m mypy ...
- python scripts/ops/check_markdown_links.py --roots docs/governance docs/architecture
- python scripts/ops/check_event_schema_contracts.py
- pytest lib/tests --maxfail=1: 1400 passed, 5 skipped
- app tests excluding UI: 705 passed

    return _translate
+
+
+_HostedAgentRunAdapter = _ResponsesAgentRunAdapter


github-actions · 2026-05-21T18:19:29Z

UI route-segment bundle budgets

UI route-segment bundle-budget report (gzipped JS, kilobytes):

route           size      limit     source                      status
--------------------------------------------------------------------------------
/               167.6     150       floor (root+polyfill)       OVER
/retailers      485.4     200       retailers.html              OVER
/builders       485.4     200       builders.html               OVER
/deploy         485.1     250       deploy.html                 OVER

Advisory at v1 (does not block PRs). Strict mode activates after the F1 cleanup follow-up trims dead-weight deps from the global path.

Budgets live in apps/ui/budgets.json. Gate spec: docs/ui/a11y-perf.md.

Cataldir added 2 commits May 13, 2026 10:46

Cataldir changed the title ~~feat(#990): Foundry V3 hosted-agents end-to-end scaffolding~~ feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) May 13, 2026

Cataldir added 4 commits May 13, 2026 12:40

feat: add final status documentation for Foundry V3 hosted-agents pilot

03f0640

Cataldir had a problem deploying to dev May 18, 2026 08:45 — with GitHub Actions Failure

This was referenced May 18, 2026

[P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107

Open

[P2] foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC #1108

Open

Cataldir added 3 commits May 18, 2026 09:33

fix: resolve Azure CLI for hosted-agent role grants

cd48f46

fix: preserve hosted-agent response input text

1c38020

fix: extract hosted response enum roles

02b912d

Cataldir force-pushed the feature/foundry-hosted-agents-pilot branch from 233016e to 02b912d Compare May 18, 2026 14:27

fix(#1107): isolate hosted private-network dependencies

d5d7d21

fix(#1107): run responses adapter on aks

ae0201b

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread lib/src/holiday_peak_lib/agents/hosted.py Fixed

Cataldir had a problem deploying to dev May 18, 2026 21:29 — with GitHub Actions Failure

Cataldir temporarily deployed to branch May 19, 2026 05:13 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 05:15 — with GitHub Actions Inactive

Pin CRUD HelmRelease to valid preview image

08115e1

Cataldir temporarily deployed to branch May 19, 2026 05:51 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 05:55 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 05:57 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 06:00 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 06:01 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 06:03 — with GitHub Actions Inactive

Cataldir had a problem deploying to branch May 19, 2026 06:10 — with GitHub Actions Failure

Cataldir temporarily deployed to branch May 19, 2026 06:14 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 06:15 — with GitHub Actions Inactive

Cataldir temporarily deployed to branch May 19, 2026 06:20 — with GitHub Actions Inactive

Cataldir added 11 commits May 19, 2026 03:44

Bound CRUD readiness dependency checks

07e3185

Pin CRUD HelmRelease to readiness image

dcd9f70

Fix APIM policy expression quoting

571026e

Pin CRUD desired state to APIM policy fix

4faa363

Fix APIM backend policy contract

d876f6b

Clarify hosted agent terminology

3d1a259

Fix APIM readiness smoke route

bc53764

Fix CRUD AGC readiness route

a2447e1

Retry APIM CORS smoke validation

83bf3ca

Normalize APIM CORS smoke headers

8d41be2

feat(#990): register Foundry agent surfaces

2bb1711

github-code-quality Bot found potential problems May 21, 2026

View reviewed changes

Comment thread lib/src/holiday_peak_lib/agents/hosted.py

return _translate

_HostedAgentRunAdapter = _ResponsesAgentRunAdapter

Cataldir added 3 commits May 21, 2026 13:22

fix(#990): keep Foundry surface CI green

379347e

fix(#990): align uv prerelease lock checks

00ce652

fix(#990): publish agent UI corrections

2362871

Merge branch 'main' into feature/foundry-hosted-agents-pilot

8a81ef1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes)#1103

feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes)#1103
Cataldir wants to merge 34 commits into
mainfrom
feature/foundry-hosted-agents-pilot

Cataldir commented May 13, 2026 •

edited

Loading

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return _translate


		_HostedAgentRunAdapter = _ResponsesAgentRunAdapter

Conversation

Cataldir commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end invocation evidence (final state)

Live fixes landed in this PR

Three operational findings (project- and ACR-side — not in this PR's code; followed up in #1107 and #1108)

How to verify in this branch (updated)

Out of scope (tracked elsewhere)

Uh oh!

Cataldir commented May 18, 2026

Uh oh!

Cataldir commented May 18, 2026

Streaming-protocol fix landed: 881b49a8

Root cause

Fix (881b49a8)

Tests

Operator next steps

Uh oh!

Cataldir commented May 18, 2026

What changed

Verification

Behaviour after merge

Uh oh!

Cataldir commented May 18, 2026

#1107 live validation update: hosted Redis/Event Hub isolation

What changed

Live deployment

Live Responses API validation

Validation gates

Uh oh!

Cataldir commented May 18, 2026

#1107 live validation update: hosted Redis/Event Hub isolation

What changed

Live deployment

Live Responses API validation

Validation gates

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UI route-segment bundle budgets

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cataldir commented May 13, 2026 •

edited

Loading

Streaming-protocol fix landed: `881b49a8`

Fix (`881b49a8`)

github-actions Bot commented May 21, 2026 •

edited

Loading