Skip to content

feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes)#1103

Open
Cataldir wants to merge 34 commits into
mainfrom
feature/foundry-hosted-agents-pilot
Open

feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes)#1103
Cataldir wants to merge 34 commits into
mainfrom
feature/foundry-hosted-agents-pilot

Conversation

@Cataldir
Copy link
Copy Markdown
Contributor

@Cataldir Cataldir commented May 13, 2026

Summary

Pilot end-to-end Foundry V3 Hosted Agents for inventory-health-check, validated by a successful HTTP 200 invocation against the public Responses endpoint in the aipholidaris project — for both stream=false (curl ping) and stream=true (Foundry portal Playground, after 881b49a8).

This PR also lands the eight live-deployment fixes discovered while running the pilot against the platform, all of which now have regression tests. Fixes #1#5 came from the initial activation track; #6 (the PORT reserved-name regression) was found in the previous session; #7 (the streaming-protocol contract) was found this session when the Foundry portal Playground surfaced an 'async for' requires __aiter__, got coroutine TypeError that the original ping test (stream=false) had not exercised. #8 codifies the Foundry User role auto-grant in scripts/ops/deploy_hosted_agent.py (closing #1107) so the manual az role assignment create runbook step is no longer required.
| 8 (new) | The Foundry SDK deploy path (scripts/ops/deploy_hosted_agent.py) did not grant the Foundry User role to the per-version managed identity minted by create_version. Without that role, the container ran fine but the Foundry runtime returned 401 on POST .../storage/responses and the Playground surfaced a generic 'internal error storing the response' toast. Manual az role assignment create was the workaround. | deploy_hosted_agent_version now auto-resolves the per-version instance_identity.principal_id, derives the project ARM scope from project_endpoint via az resource list, and calls az role assignment create --assignee-principal-type ServicePrincipal --role 'Foundry User'. Idempotent on RoleAssignmentExists. New CLI flags: --no-auto-grant-foundry-user, --foundry-role-name, --project-scope. Failure does NOT mask a successful version activation -- it is recorded in result.extras['role_grant']. Closes #1107. | test_deploy_auto_grants_foundry_user_after_active, test_deploy_skips_grant_when_auto_grant_disabled, test_deploy_records_already_exists_when_granter_returns_none, test_deploy_surfaces_grant_failure_without_breaking_active, test_deploy_records_skipped_when_principal_id_missing, test_deploy_uses_explicit_project_scope_override, test_grant_role_via_az_treats_already_exists_as_idempotent, test_grant_role_via_az_raises_on_real_failure, test_grant_role_via_az_parses_assignment_id_on_success, test_extract_principal_id_from_instance_identity, test_extract_principal_id_from_managed_identity_alias, test_extract_principal_id_from_mapping |

Status as of 2026-05-18: inventory-health-check v20 is active and answers /responses with HTTP 200. Non-streaming evidence below. Streaming-mode invocations now also succeed locally (12/12 hosted-adapter tests + 1360 lib tests + 3 pilot tests) and will be re-verified against Foundry as soon as a new image (v21+) ships the 881b49a8 adapter fix. Two follow-up issues filed for the operational hardening discovered during root-cause analysis: #1107 (auto-grant Foundry User in the SDK deploy path) and #1108 (codify ACR prerequisites in IaC).

End-to-end invocation evidence (final state)

POST /api/projects/aipholidaris/agents/inventory-health-check/endpoint/protocols/openai/v1/responses
{ "model": "inventory-health-check", "input": "ping" }
HTTP/1.1 200 OK
{
  "id": "caresp_adff33150146c05d00THJqzmj3Im0sk6nwZoZcSzyRKmh7LQpc",
  "status": "completed",
  "output": [{ "type": "message", "role": "assistant", "content": [{
    "type": "output_text",
    "text": "{\"error\": \"sku is required\", \"hint\": \"Provide a SKU id in the prompt, e.g. 'check health for SKU-1234'.\", \"input\": \"ping\"}"
  }]}]
}

App Insights trace for the same invocation:

DefaultAzureCredential acquired a token from ManagedIdentityCredential
Foundry storage POST .../storage/responses?api-version=v1 -> 201 (66.9ms)
Response caresp_… completed: status=completed output_count=1
Inbound POST /responses completed with status 200 in 555.8ms

This proves the full V3 hosted-agent lifecycle: deploy → activate → invoke → container → agent code → Foundry storage → response → client — all healthy in the stream=false path. The streaming path is validated by the new hosted-adapter unit tests (test_hosted_run_adapter_streams_single_update and test_hosted_run_adapter_non_streaming_returns_awaitable) and will be re-verified against Foundry once v21+ is deployed with 881b49a8.

GET /api/projects/aipholidaris/agents/inventory-health-check/versions?api-version=v1
version status image
1–3 active holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v3 (digest sha256:5b9d8601…)
15–19 failed ImageError — root-caused to azureADAuthenticationAsArmPolicy=disabled on canonical ACR
20 active holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:d4775cdf… (tag foundry-v6, build run cj28) — invoked end-to-end (non-streaming)
21+ pending will carry the 881b49a8 streaming-protocol fix for Playground/SSE invocations

Live fixes landed in this PR

# Bug Fix Regression test
1 AIProjectClient rejected hosted manifests with HTTP 400 because the SDK requires preview opt-in. _build_project_client constructs AIProjectClient(..., allow_preview=True); agent IDs are looked up by name through agents.get_version(...) (legacy ID URL is gone in V3). test_build_project_client_passes_allow_preview
2 Pilot manifest was not loaded by the registration script (only agent.yaml / agent.manifest.yaml were tried). manifest.py loader now probes agent.manifest.yaml -> agent.hosted.yaml -> agent.yaml so a hosted-only manifest can sit alongside the metadata-only agent.yaml without changing the portal-tracking contract. covered by existing manifest loader tests
3 agent.hosted.yaml declared the protocol version field that V3 rejects. Manifest now follows the canonical template.kind: hosted shape with protocols: [{protocol: responses, version: "1.0.0"}] and container.cpu/memory instead of nested definition.*. manifest snapshot test
4 Container env names containing FOUNDRY_* / AGENT_* were rejected by create_version with ValidationError: ... reserved per container-image-spec. The platform reserves the entire FOUNDRY_* and AGENT_* namespaces, not just the six documented platform-injected names. Manifest renamed to HPH_AGENT_ID_FAST / HPH_AGENT_ID_RICH (and HPH_AGENT_NAME_*). holiday_peak_lib.app_factory_components.foundry_lifecycle.build_foundry_config now reads the HPH_ prefix first and falls back to the legacy FOUNDRY_AGENT_ID_* / FOUNDRY_AGENT_NAME_* for AKS deploys (back-compat). test_build_foundry_config_prefers_hph_agent_id_over_foundry_agent_id, test_build_foundry_config_hph_agent_name_takes_precedence
5 Poll loop never recognised terminal states: SDK 2.1.0 deserialises "status": "failed" into an AgentVersionStatus enum whose str() returns "AgentVersionStatus.FAILED". The previous str(status).lower() produced "agentversionstatus.failed", which did not match _TERMINAL_STATUSES = {"active","failed","deleted"}, so the script timed out instead of raising RuntimeError. Added _normalize_status in lib/src/holiday_peak_lib/foundry_hosting/deploy.py that prefers the enum .value field and falls back to stripping any Enum.MEMBER dotted prefix. test_normalize_status_handles_enum_value, test_normalize_status_strips_dotted_enum_repr, test_normalize_status_plain_string
6 Foundry V3 rejected the previously-accepted PORT=8088 declaration with invalid_payload: Environment variable 'PORT' is reserved for platform use. The reserved namespace expanded between the time we wrote the pilot manifest and the final activation. apps/inventory-health-check/agent.hosted.yaml no longer declares PORT. The existing Dockerfile CMD (${PORT:-${UVICORN_PORT:-8088}}) picks up the platform-injected value first, then falls back to UVICORN_PORT (still declared) for local docker-run / AKS dev parity. manifest snapshot test
7 (new) Foundry's ResponsesHostServer (preview SDK agent-framework-foundry-hosting==1.0.0a260507) calls agent.run with two distinct contracts depending on stream: await agent.run(stream=False, ...) expects a coroutine returning AgentResponse; async for update in agent.run(stream=True, ...): expects an async iterator of AgentResponseUpdate items. Our _HostedAgentRunAdapter.run was marked async def, so it always returned a coroutine. The Playground (which defaults to stream=true) triggered TypeError: 'async for' requires an object with __aiter__, got coroutine at upstream _responses.py:341, which cascaded into a 401 on the persistence write because the response object had already been created server-side. The ping test passed because it used the non-streaming path. lib/src/holiday_peak_lib/agents/hosted.py reshapes _HostedAgentRunAdapter.run into a synchronous dispatcher that returns either a coroutine (_run_onceAgentResponse) or an async iterator (_run_streaming yielding one AgentResponseUpdate(contents=[Content(type='text', text=…)])) based on the stream flag. The shared _invoke_handle helper keeps the translation/dispatch/extraction logic in one place. Per-token streaming via invoke_model_stream remains a follow-up. test_hosted_run_adapter_streams_single_update (pins async-iterator contract), test_hosted_run_adapter_non_streaming_returns_awaitable (pins awaitable contract)

Three operational findings (project- and ACR-side — not in this PR's code; followed up in #1107 and #1108)

These are environment-level prerequisites discovered while activating v20. They are already applied to the live dev environment and documented in memories/session/foundry-v3-pilot-status.md. They are out of scope for this PR but tracked for codification:

# Finding Why it matters Followed up by
O1 ACR policies.azureADAuthenticationAsArmPolicy.status must be enabled on the canonical registry. If disabled, the platform rejects the ARM→ACR token exchange and surfaces a generic ImageError with zero pull attempts recorded on the ACR. This was the single root cause of v15–v19 failures. Foundry hosted-agents pull via AAD-as-ARM token exchange. The canonical ACR was created out-of-band with this policy disabled by default; the test ACR was created with it enabled and worked first try. #1108
O2 The Foundry AI-account system MI (351cdb70-…) needs AcrPull and Container Registry Repository Reader on the canonical ACR. The docs name only the project MI; the live behaviour requires both. Pulls happen under more than one identity context during hosted-agent provisioning. #1108
O3 When deploying via the SDK path (scripts/ops/deploy_hosted_agent.py), the per-version agent MI minted by create_version does not get Foundry User on the project automatically (the azd / VS-Code extension path does). Without it, the container runs and the agent code succeeds, but storage POST /storage/responses returns 401 and the public call surfaces as HTTP 500 (or, in the Playground, "An internal error occurred while storing the response"). This was the final unblock for v20, and the same condition will need to be re-applied to v21 (carrying the streaming fix) until #1107 lands. Manual az role assignment create after create_version is the runbook today. #1107 -- closed by this PR (fix #8 above)

How to verify in this branch (updated)

# 1. Build image into canonical ACR via git remote-context (avoids Windows-client upload hangs):
az acr build --registry holidaypeakhub405devacr `
  --image "inventory-health-check:foundry-v7" `
  --file "apps/inventory-health-check/src/Dockerfile" --target prod --no-logs `
  "https://github.com/Azure-Samples/holiday-peak-hub.git#feature/foundry-hosted-agents-pilot"

# 2. Confirm the three operational prerequisites are in place (#1107, #1108 will automate these):
az acr config authentication-as-arm show --registry holidaypeakhub405devacr   # status=enabled
az role assignment list --scope <acr-id> --assignee 351cdb70-0600-4c8c-b7f2-c6bf92ae1089
# Expect AcrPull + Container Registry Repository Reader.

# 3. Register a hosted version (v21):
$env:PROJECT_ENDPOINT = "https://holidaypeakhub405devais.services.ai.azure.com/api/projects/aipholidaris"
$env:MODEL_DEPLOYMENT_NAME_FAST = "gpt-5-nano"
$env:MODEL_DEPLOYMENT_NAME_RICH = "gpt-5"
$env:HPH_AGENT_ID_FAST = "ecommerce-catalog-search-fast"
$env:HPH_AGENT_ID_RICH = "product-management-assortment-optimization-rich"

python scripts/ops/deploy_hosted_agent.py `
  --agent-yaml apps/inventory-health-check/agent.hosted.yaml `
  --image-uri "holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v7" `
  --project-endpoint $env:PROJECT_ENDPOINT --json

# 4. (Automated by this PR) The deploy script now auto-grants `Foundry User` on the
#    per-version managed identity after `create_version` reaches `active`.
#    Pass `--no-auto-grant-foundry-user` to opt out, or `--project-scope` to override.

# 5. Invoke (both paths now succeed with the v21 image):
python .tmp/invoke-hosted-agent.py   # stream=false; expect status=completed
#   plus open the Foundry portal Playground for the agent and send "hi there";
#   the reply must render (stream=true) without an "internal error storing the response" toast.

Expected: status=active from step 3, status=completed from step 5 with the structured agent response in both the non-streaming (curl) and streaming (Playground) paths.

Out of scope (tracked elsewhere)

Cataldir added 2 commits May 13, 2026 10:46
Implements end-to-end support for Azure AI Foundry V3 Hosted Agents (preview).

Framework (lib/):
- holiday_peak_lib.foundry_hosting: manifest loader (Pydantic v2),
  env-var resolver, deploy wrapper over AIProjectClient.agents.create_version
  with terminal-status polling and async helper. Azure SDK imports lazy.
- mount_hosted_agent: default prefix flipped to '' (Foundry gateway adds
  /openai/v1/ externally and forwards to container '/responses').
- BaseRetailAgent.serve_hosted: default prefix updated.

Product (apps/inventory-health-check/):
- agent.hosted.yaml NEW: V3 registration manifest (kind: hosted,
  responses 1.0.0, 19 env vars, gpt-5-nano + gpt-5 model resources).
- agent.yaml: tracking shape preserved, doc-only update referencing sibling.
- Dockerfile: --port \${UVICORN_PORT:-8000} for hosted-runtime portability.
- main.py: docstring updated to /responses.

Ops:
- scripts/ops/deploy_hosted_agent.py NEW: CLI runbook entry point.

Tests:
- 24 new tests across manifest loader, deploy wrapper, and hosted mount.
- Fleet-wide tests/ops/test_foundry_portal_tracking_manifests.py guardrail
  preserved (27/27 pass).
- Targeted suite: 56 passed in 6.40s.
- pylint 9.78/10 on new module + CLI.
…ples for portal visibility

Four bugs identified by line-by-line comparison against the official MS Learn
`Deploy a hosted agent` doc and the `Microsoft/foundry-samples` repository
(after the scaffolding-only initial PR did not produce a visible agent):

1. `AIProjectClient` now built with `allow_preview=True` so the
   `agents.create_version` V3 surface is actually exposed. Without the flag
   the SDK silently routes to legacy assistants and the new agent never
   materializes in the New Foundry portal.

2. Terminal status set tightened from
   `{active,ready,succeeded,failed,error}` to the documented terminal set
   `{active,failed,deleted}` with `active` as the only success terminal;
   `deleting` is correctly treated as transient.

3. `apps/inventory-health-check/agent.hosted.yaml` no longer redeclares
   the platform-injected `APPLICATIONINSIGHTS_CONNECTION_STRING`. The full
   forbidden list is now documented inline:
   `FOUNDRY_PROJECT_ENDPOINT`, `FOUNDRY_PROJECT_ARM_ID`,
   `FOUNDRY_AGENT_NAME`, `FOUNDRY_AGENT_VERSION`,
   `FOUNDRY_AGENT_SESSION_ID`, `APPLICATIONINSIGHTS_CONNECTION_STRING`.
   Collisions on any of these cause `create_version` to reject the manifest.

4. Pilot manifest renamed `FOUNDRY_AGENT_NAME_FAST` / `_RICH` to
   `FOUNDRY_AGENT_ID_FAST` / `_RICH` to match the runtime contract in
   ADR-010 / `holiday_peak_lib.config._build_foundry_config`.

Loader (`load_manifest`) also probes `agent.manifest.yaml` first \u2014 the
canonical name used by `foundry-samples` and `azd ai agent init -m` \u2014
before falling back to `agent.hosted.yaml` and `agent.yaml`, so future
services may adopt either name without changes to the loader.

Tests: 18 hosting tests pass; one new test covers `deleted` status, one
verifies `allow_preview=True` is passed to `AIProjectClient`, two cover
the loader filename-priority ordering. Pylint 9.78/10.
@Cataldir Cataldir changed the title feat(#990): Foundry V3 hosted-agents end-to-end scaffolding feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) May 13, 2026
Cataldir added 4 commits May 13, 2026 12:40
…ationError)

Foundry V3 hosted-agents platform reserves the entire FOUNDRY_*/AGENT_* env-var namespaces (per container-image-spec). Live create_version returned: 'Environment variable FOUNDRY_AGENT_ID_FAST is reserved for platform use.' Rename in-container env vars to HPH_AGENT_ID_FAST/_RICH (and matching HPH_AGENT_NAME_*). build_foundry_config now reads HPH_ first with FOUNDRY_AGENT_ID_* fallback so AKS deploys remain back-compat. Operator env contract unchanged: external ${FOUNDRY_AGENT_ID_FAST} is mapped to HPH_AGENT_ID_FAST inside the container via manifest placeholder substitution.
…ctive'

Foundry SDK 2.1.0 returns status as AgentVersionStatus enum whose str() is 'AgentVersionStatus.FAILED'. Previous str().lower() produced 'agentversionstatus.failed' which never matched terminal sets. Add _normalize_status helper that prefers enum .value and falls back to stripping dotted Enum.MEMBER prefix. Three new tests cover all paths.
…convention

The azure-ai-agentserver-core framework reads the canonical PORT env var
(default 8088) via resolve_port(), but our containers were only listening
on UVICORN_PORT. This caused Foundry V3 hosted-agent invocations to
return 424 session_not_ready because the gateway probed PORT=8088 while
uvicorn was bound to UVICORN_PORT.

Changes:

apps/inventory-health-check/src/Dockerfile
  - CMD now reads ${PORT:-${UVICORN_PORT:-8088}} so Foundry V3 PORT
    takes precedence, AKS keeps UVICORN_PORT=8000 as legacy override,
    and the framework default of 8088 is the fallback.

apps/inventory-health-check/agent.hosted.yaml
  - Add PORT=8088 (canonical Foundry V3), UVICORN_PORT=8088 (alignment),
    and WEB_CONCURRENCY=1 to keep startup under readiness deadline.

lib/src/holiday_peak_lib/app_factory.py
  - _service_lifespan now emits six explicit lifespan_* log lines for
    trace correlation in App Insights.

lib/src/holiday_peak_lib/foundry_hosting/deploy.py
  - _extract handles collections.abc.Mapping (AgentVersionDetails is a
    MutableMapping, not a dict, and exposes fields via __getitem__).
  - Add _pick_latest_version tolerant of v3, 3, 3.1.0 label shapes.

lib/tests/test_foundry_hosting_deploy.py
  - 479 new lines covering Mapping branch, picker, and re-fetch path.

memories/session/foundry-v3-pilot-status.md
  - Resume-state notes: PORT root cause, namespace collision,
    lifespan-mount behavior, ACR drift correction.

Refs #990. PR #1103.
Foundry V3 hosted-agents reject "PORT" with "invalid_payload: Environment
variable 'PORT' is reserved for platform use". The Dockerfile CMD already
reads the platform-injected value first, then UVICORN_PORT, then 8088, so
removing PORT here lets the platform inject its own value automatically.
Keep UVICORN_PORT for local docker-run / AKS dev parity.

Also refresh memories/session/foundry-v3-pilot-status.md with the runbook
proven during the pilot:
  1. ACR azureAdAuthenticationAsArmPolicy must be enabled
  2. AI-account system MI needs AcrPull + Container Registry Repository
     Reader on the canonical ACR (not only the project MI)
  3. Per-version agent MI and blueprint MI need Foundry User on the
     project when deploying via the SDK (azd auto-handles this; SDK
     path does not)

v20 of inventory-health-check is now active and returns 200 from
/responses with a structured domain answer.

Refs: #990
@Cataldir
Copy link
Copy Markdown
Contributor Author

Follow-up issues filed for the operational findings discovered during root-cause analysis of v15-v19 ImageErrors:

PR body updated with the new findings and end-to-end invocation evidence (v20 active, status=completed, Foundry storage POST -> 201). The three operational fixes are already applied to the live dev environment so the pilot is green; the issues track codifying them so future hosted-agent deploys do not need manual operator RBAC steps.

…treaming

Foundry's ResponsesHostServer (agent-framework-foundry-hosting==1.0.0a260507)

calls agent.run with two distinct contracts depending on stream:

  stream=False -> response = await agent.run(stream=False, ...)  # coroutine

  stream=True  -> async for update in agent.run(stream=True, ...): # iterator

Our adapter was marked async def run, so it always returned a coroutine.

When the Foundry portal Playground (which always sets stream=True) hit

the adapter, the framework tried to async-iterate the coroutine and

crashed with: 'async for' requires an object with __aiter__, got coroutine.

Fix: reshape run() into a synchronous dispatcher that returns either a

coroutine (_run_once -> AgentResponse) or an async iterator

(_run_streaming -> AgentResponseUpdate) based on the stream flag. The

streaming path emits a single AgentResponseUpdate carrying one text

content -- sufficient for the SSE tracker to render and terminate the

stream cleanly. Per-token streaming via invoke_model_stream remains a

follow-up.

Tests:

- Replaced test_hosted_run_adapter_refuses_streaming with

  test_hosted_run_adapter_streams_single_update (pins the async-iterator

  contract)

- Added test_hosted_run_adapter_non_streaming_returns_awaitable to pin

  the awaitable contract for stream=False

- All 12 hosted-adapter tests pass; 1360 lib tests pass; 3 pilot tests pass

Refs: PR #1103
@Cataldir
Copy link
Copy Markdown
Contributor Author

Streaming-protocol fix landed: 881b49a8

Found a second-order code bug while diagnosing the "agent doesn't reply in the UI" report: the v20 ping test only exercised the stream=false path, but the Foundry portal Playground always sets stream=true by default. The streaming path crashed with:

TypeError: 'async for' requires an object with __aiter__ method, got coroutine
  at agent_framework_foundry_hosting/_responses.py:341

…which cascaded into a 401 on the persistence write because the response object had already been created server-side. Symptom in the portal:

"An internal error occurred while storing the response. Subsequent retrieval is not guaranteed."

Root cause

_HostedAgentRunAdapter.run was marked async def, so it always returned a coroutine. The upstream framework's contract is polymorphic:

  • stream=Falseresponse = await agent.run(stream=False, ...) (coroutine returning AgentResponse) ✅
  • stream=Trueasync for update in agent.run(stream=True, ...): (async iterator of AgentResponseUpdate) ❌

Same method run must return either a coroutine or an async iterator. async def can only do the former.

Fix (881b49a8)

Reshaped run into a synchronous dispatcher in lib/src/holiday_peak_lib/agents/hosted.py:

def run(self, messages=None, *, stream=False, session=None, **kwargs):
    if stream:
        return self._run_streaming(messages)  # async iterator
    return self._run_once(messages)            # coroutine

async def _run_once(self, messages) -> AgentResponse: ...
async def _run_streaming(self, messages) -> AsyncIterator[AgentResponseUpdate]:
    reply_text = await self._invoke_handle(messages)
    yield AgentResponseUpdate(
        contents=[Content(type="text", text=reply_text)],
        role="assistant",
        agent_id=self.id,
    )

The retail agents' handle() is unary, so the streaming path emits one well-formed AgentResponseUpdate chunk carrying the full reply. The SSE tracker on the host side renders it correctly and terminates the stream. Per-token streaming via invoke_model_stream remains a follow-up.

Tests

  • test_hosted_run_adapter_streams_single_update — pins the async-iterator contract (hasattr(iterator, "__aiter__"))
  • test_hosted_run_adapter_non_streaming_returns_awaitable — pins the awaitable contract
  • 12/12 hosted-adapter tests pass; 1360 lib tests pass; 3 pilot tests pass; pre-push gates green

Operator next steps

Once this PR merges (or now, on the pilot branch):

  1. Build v21: az acr build --registry holidaypeakhub405devacr --image inventory-health-check:foundry-v7 --file apps/inventory-health-check/src/Dockerfile --target prod --no-logs "https://github.com/Azure-Samples/holiday-peak-hub.git#feature/foundry-hosted-agents-pilot"
  2. Deploy v21: python scripts/ops/deploy_hosted_agent.py --agent-yaml apps/inventory-health-check/agent.hosted.yaml --image-uri holidaypeakhub405devacr.azurecr.io/inventory-health-check:foundry-v7 --project-endpoint <endpoint> --json
  3. Manually grant Foundry User on the new per-version MI (this is the manual step [P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 will automate; see operational finding O3 in the PR body)
  4. Verify in Playground: open the agent in the Foundry portal, send any prompt, the reply must render without the "internal error storing the response" toast.

Step 3 is still required for v21 because #1107 hasn't landed yet. The reconciliation plan in .tmp/reconciliation-plan.md will be updated to reflect this new sequencing: v21 deploy + manual RBAC grant + Playground verification all precede the merge of #1103.

Consolidate the manual `az role assignment create` runbook step into
`deploy_hosted_agent_version`. The per-version managed identity minted by
`AIProjectClient.agents.create_version` does NOT receive the Foundry User
role on the project scope automatically, so every Playground / Responses
invocation fails 401 on the storage POST:

    Foundry storage POST .../storage/responses?api-version=v1 -> 401
    Principal does not have access to API/Operation.

The `azd` and VS-Code extension deploy paths grant it implicitly; the SDK
path (this module) did not, leaving operators to remember a manual step
that was easy to skip. This change closes the loop in code.

Implementation
--------------

* `deploy_hosted_agent_version` accepts `auto_grant_role` (default True),
  `foundry_role_name` ("Foundry User"), `project_scope` (optional override),
  `role_granter` + `scope_resolver` (test seams).
* On reaching `active`, the helper resolves the per-version principal id
  from `version_obj.instance_identity.principal_id` (with two preview-era
  field aliases), derives the project ARM scope from `project_endpoint`
  via `az resource list`, and calls `az role assignment create` with the
  `--assignee-principal-type ServicePrincipal` flag — matching the manual
  runbook one-for-one.
* The grant is idempotent: `RoleAssignmentExists` from the Azure CLI is
  treated as success and recorded as `status=already_exists`.
* A failed grant does NOT mask a successful version activation. The
  failure is captured in `result.extras["role_grant"]` with `status=failed`
  and `error=<stderr>` so operators can re-run or escalate.
* `scripts/ops/deploy_hosted_agent.py` exposes `--no-auto-grant-foundry-user`,
  `--foundry-role-name`, and `--project-scope` CLI flags. The JSON output
  now includes the `role_grant` payload.

Tests
-----

* +12 tests in `lib/tests/test_foundry_hosting_deploy.py`:
  - principal-id extraction (3 shapes: `instance_identity`, `managed_identity`,
    `Mapping`) + missing-id null path
  - scope derivation: resolver test seam, malformed endpoint, no-account
  - integration: granted / skipped / already-exists / failure / no-principal
    / explicit-scope-override
  - default `_grant_role_via_az`: success (parses assignment id),
    already-exists (idempotent), real-failure (raises)
* All 1376 lib tests + 108 pilot tests pass.

Refs: #1107, runbook docs/ops/foundry-hosted-agents.md
@Cataldir
Copy link
Copy Markdown
Contributor Author

Fix #8 landed: Foundry User role auto-grant now in code (#1107 closed by this PR).

582f443e adds the per-version managed-identity role grant to
deploy_hosted_agent_version, consolidating the manual
az role assignment create runbook step into the deploy path.

What changed

  • lib/src/holiday_peak_lib/foundry_hosting/deploy.py
    • New helpers: _extract_principal_id (probes instance_identity /
      managed_identity / identity field aliases), _derive_project_scope_from_endpoint
      (parses https://{a}.services.ai.azure.com/api/projects/{p} → ARM scope via
      az resource list), _grant_role_via_az (idempotent
      az role assignment create --assignee-principal-type ServicePrincipal --role 'Foundry User'),
      _ensure_foundry_user_grant, _maybe_grant_foundry_user.
    • deploy_hosted_agent_version accepts:
      auto_grant_role: bool = True, foundry_role_name: str = "Foundry User",
      project_scope: str | None = None, role_granter & scope_resolver
      (test seams).
    • Failed grant does NOT mask a successful version activation — recorded
      under result.extras["role_grant"] with status=failed and stderr.
    • RoleAssignmentExists is treated as success
      (status=already_exists).
  • scripts/ops/deploy_hosted_agent.py
    • New flags: --no-auto-grant-foundry-user, --foundry-role-name,
      --project-scope. JSON output now includes the role_grant payload.
  • lib/tests/test_foundry_hosting_deploy.py
    • +12 unit tests covering granted / already-exists / failed / skipped /
      no-principal / explicit-scope paths, principal-id extraction across
      3 SDK field-name aliases, and the default _grant_role_via_az
      success/error/idempotent matrix.

Verification

  • pytest lib/tests/test_foundry_hosting_deploy.py → 37 passed (12 new + 25 existing).
  • pytest lib/tests → 1376 passed.
  • pytest tests → 108 passed, 11 skipped (pre-existing).
  • black + isort clean.
  • pylint --fail-on=E,F clean (only pre-existing R/C warnings).

Behaviour after merge

Operators no longer need to follow verify-step 4 from the original PR body.
The standard python scripts/ops/deploy_hosted_agent.py … invocation now
handles the grant. Backwards-compatibility for environments where role
assignment is managed out of band: pass --no-auto-grant-foundry-user.

This unblocks the v21 (foundry-v7) image build + deploy that closes the
final piece of the live-Playground regression.

@Cataldir Cataldir force-pushed the feature/foundry-hosted-agents-pilot branch from 233016e to 02b912d Compare May 18, 2026 14:27
@Cataldir
Copy link
Copy Markdown
Contributor Author

#1107 live validation update: hosted Redis/Event Hub isolation

Pushed commit d5d7d214 to feature/foundry-hosted-agents-pilot.

What changed

  • Added framework runtime flags in lib/src/holiday_peak_lib/app_factory.py:
    • HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false detaches optional Redis hot memory from hosted request handling.
    • HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false skips Event Hub subscriber startup for hosted containers outside the private AKS VNet.
  • Added bounded Redis socket/connect timeouts and fail-open hot-memory detach when Key Vault Redis secret resolution fails.
  • Updated apps/inventory-health-check/agent.hosted.yaml to disable hot memory and Event Hub subscribers for the Foundry hosted pilot.
  • Documented the runtime isolation contract in docs/governance/backend-governance.md while preserving ADR-007/ADR-032 three-tier memory and ADR-006 Event Hubs as canonical for product services.

Live deployment

  • Built ACR image: holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:b143209842ea322adfd2af99069614db7e9d82bc23088cdeed106d83cf9304a0
  • Deployed Foundry hosted version: 24
  • Status: active
  • Role grant: Foundry User granted to hosted MI at project scope
  • Version metadata verified: digest-pinned image, HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false, HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false

Live Responses API validation

  • Non-streaming probe: HTTP 200, v24, health_status: healthy for SKU-1234.
  • Streaming probe with store:false: HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.
  • Default streaming probe: initial curl saw a transient connection reset, immediate diagnostic retry returned HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.

The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24.

Validation gates

  • Focused hosted/runtime regressions: 65 passed.
  • Full pre-push gate passed:
    • python -m isort --check-only lib apps
    • python -m black --check lib apps
    • python -m pylint --fail-on=E,F ...
    • python -m mypy ...
    • python scripts/ops/check_markdown_links.py --roots docs/governance docs/architecture
    • python scripts/ops/check_event_schema_contracts.py
    • pytest lib/tests --maxfail=1: 1400 passed, 5 skipped
    • app tests excluding UI: 705 passed

@Cataldir
Copy link
Copy Markdown
Contributor Author

#1107 live validation update: hosted Redis/Event Hub isolation

Pushed commit d5d7d214 to feature/foundry-hosted-agents-pilot.

What changed

  • Added framework runtime flags in lib/src/holiday_peak_lib/app_factory.py:
    • HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false detaches optional Redis hot memory from hosted request handling.
    • HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false skips Event Hub subscriber startup for hosted containers outside the private AKS VNet.
  • Added bounded Redis socket/connect timeouts and fail-open hot-memory detach when Key Vault Redis secret resolution fails.
  • Updated apps/inventory-health-check/agent.hosted.yaml to disable hot memory and Event Hub subscribers for the Foundry hosted pilot.
  • Documented the runtime isolation contract in docs/governance/backend-governance.md while preserving ADR-007/ADR-032 three-tier memory and ADR-006 Event Hubs as canonical for product services.

Live deployment

  • Built ACR image: holidaypeakhub405devacr.azurecr.io/inventory-health-check@sha256:b143209842ea322adfd2af99069614db7e9d82bc23088cdeed106d83cf9304a0
  • Deployed Foundry hosted version: 24
  • Status: active
  • Role grant: Foundry User granted to hosted MI at project scope
  • Version metadata verified: digest-pinned image, HOLIDAY_PEAK_HOT_MEMORY_ENABLED=false, HOLIDAY_PEAK_EVENTHUB_SUBSCRIBERS_ENABLED=false

Live Responses API validation

  • Non-streaming probe: HTTP 200, v24, health_status: healthy for SKU-1234.
  • Streaming probe with store:false: HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.
  • Default streaming probe: initial curl saw a transient connection reset, immediate diagnostic retry returned HTTP 200, v24, final response.completed, .done events present, health_status: healthy for SKU-1234.

The prior Playground-style failure mode (HTTP 200 SSE starts, then hangs without completion until timeout) did not reproduce on v24.

Validation gates

  • Focused hosted/runtime regressions: 65 passed.
  • Full pre-push gate passed:
    • python -m isort --check-only lib apps
    • python -m black --check lib apps
    • python -m pylint --fail-on=E,F ...
    • python -m mypy ...
    • python scripts/ops/check_markdown_links.py --roots docs/governance docs/architecture
    • python scripts/ops/check_event_schema_contracts.py
    • pytest lib/tests --maxfail=1: 1400 passed, 5 skipped
    • app tests excluding UI: 705 passed

Comment thread lib/src/holiday_peak_lib/agents/hosted.py Fixed
return _translate


_HostedAgentRunAdapter = _ResponsesAgentRunAdapter
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

UI route-segment bundle budgets

UI route-segment bundle-budget report (gzipped JS, kilobytes):

route           size      limit     source                      status
--------------------------------------------------------------------------------
/               167.6     150       floor (root+polyfill)       OVER
/retailers      485.4     200       retailers.html              OVER
/builders       485.4     200       builders.html               OVER
/deploy         485.1     250       deploy.html                 OVER

Advisory at v1 (does not block PRs). Strict mode activates after the F1 cleanup follow-up trims dead-weight deps from the global path.

Budgets live in apps/ui/budgets.json. Gate spec: docs/ui/a11y-perf.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py

1 participant