You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bringing the canonical ACR holidaypeakhub405devacr into a working state for Foundry V3 hosted-agents required three operational changes that today exist only on the live resource (no Bicep / Terraform definition in infra/). If the registry is ever rebuilt, deleted, or replicated, every hosted-agent in the project will fail again with the same generic ImageError: Failed to pull container image … that took five inventory-health-check versions to root-cause during PR #1103.
These are the three operational changes that must be codified:
Microsoft.ContainerRegistry/registries/properties/policies/azureADAuthenticationAsArmPolicy.status must be enabled. Foundry hosted-agents pull images by exchanging an ARM-audience AAD token for an ACR data-plane token. When this policy is disabled (the default in some ACR provisioning paths) the exchange is rejected and the platform reports the generic ImageErrorwith zero pull attempts recorded on the ACR, which is what made it so hard to diagnose. Confirmed root cause in PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103 — flipping this policy was the single change that moved inventory-health-check from failed/ImageError to active in 21 seconds.
The Foundry AI-account system-assigned managed identity (holidaypeakhub405devais / principal 351cdb70-0600-4c8c-b7f2-c6bf92ae1089) needs AcrPull and Container Registry Repository Reader on the registry — not only the project MI. The docs name only the project MI, but the live behaviour requires both; the test ACR hphtestacr95876 had the same grant by accident, which is why it worked while the canonical one failed.
az acr config authentication-as-arm show --registry holidaypeakhub405devacr
# { "status": "enabled" }
infra/ Bicep — no Microsoft.ContainerRegistry/registries resource exists. infra/deploy-portal/ contains only main.bicep + modules/portal.bicep, both unrelated to the canonical ACR.
Required change
When the canonical ACR is brought under IaC management (separately scoped, may already be tracked by an area:infra epic), the resource definition must include:
Until the ACR is in IaC, the same enforcement should be implemented as a post-deploy idempotent check script in scripts/ops/ that:
runs in CI on the dev environment branch policy already in place,
reads the canonical ACR name from azd env get-values,
calls az acr config authentication-as-arm update --status enabled if not already enabled,
ensures both role assignments exist for the Foundry account MI,
fails the workflow if any required state is missing and --enforce is passed.
Acceptance criteria
infra/ (or the equivalent IaC module that creates holidaypeakhub405devacr) declares policies.azureADAuthenticationAsArmPolicy.status = 'enabled' on the registry resource.
infra/ declares role assignments granting AcrPullandContainer Registry Repository Reader to the Foundry account system MI on the ACR scope.
OR: a scripts/ops/ensure_acr_foundry_prereqs.{ps1,sh} exists, is documented in docs/governance/hosted-agents.md, and runs idempotently in the deploy-azd-* workflow before agent registration.
A regression test (Bicep what-if snapshot or script unit test) asserts both states are enforced.
memories/session/foundry-v3-pilot-status.md is updated to point at the codified path and the manual az acr config authentication-as-arm update step is removed from the runbook.
Risks and dependencies
Risk: Canonical ACR was created out-of-band; importing it into IaC may require an azd import / terraform import flow that doesn't disturb running deployments. Coordinate with the team that owns the original portal deployment.
Risk: Flipping azureADAuthenticationAsArmPolicy is a no-op for non-Foundry consumers (App Service / ACI / AKS use a different token exchange), but Azure Policy may flag the change. Coordinate with platform security.
Dependency: Any future epic that brings the canonical ACR into IaC. If that epic doesn't exist yet, the script-based interim path satisfies AC.
ADR impact
No existing ADR is invalidated. This work is operational hardening of the runtime prerequisites for ADR-038 (continuous agent evaluation) / ADR-046 (Foundry V3 hosted-agents) — recording an operational annex under docs/architecture/adrs/ (or a docs/architecture/runtime-prereqs/ note) would be appropriate but not blocking.
Labels
priority: medium
type:devops
area:infra
foundry
tech-debt
BPMN process
%%{init: {'theme':'base', 'themeVariables': {
'primaryColor':'#FFB3BA',
'primaryTextColor':'#000',
'primaryBorderColor':'#FF8B94',
'lineColor':'#BAE1FF',
'secondaryColor':'#BAE1FF',
'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
A[Confirm ACR IaC Scope] --> B[Design Bicep or Script Path]
B --> C[Implement on Issue Branch]
C --> D[Add Snapshot Test]
D --> E[Open PR]
E --> F[CI What-If or Script Validate]
F --> G[Merge to Main]
G --> H[Update Runbook + Close Issue]
Loading
Filed as a follow-up to PR #1103. The three operational fixes described here are already applied to the live canonical ACR; this issue tracks codifying them so they survive a registry rebuild.
Problem statement
Bringing the canonical ACR
holidaypeakhub405devacrinto a working state for Foundry V3 hosted-agents required three operational changes that today exist only on the live resource (no Bicep / Terraform definition ininfra/). If the registry is ever rebuilt, deleted, or replicated, every hosted-agent in the project will fail again with the same genericImageError: Failed to pull container image …that took fiveinventory-health-checkversions to root-cause during PR #1103.These are the three operational changes that must be codified:
Microsoft.ContainerRegistry/registries/properties/policies/azureADAuthenticationAsArmPolicy.statusmust beenabled. Foundry hosted-agents pull images by exchanging an ARM-audience AAD token for an ACR data-plane token. When this policy isdisabled(the default in some ACR provisioning paths) the exchange is rejected and the platform reports the genericImageErrorwith zero pull attempts recorded on the ACR, which is what made it so hard to diagnose. Confirmed root cause in PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103 — flipping this policy was the single change that movedinventory-health-checkfromfailed/ImageErrortoactivein 21 seconds.The Foundry AI-account system-assigned managed identity (
holidaypeakhub405devais/ principal351cdb70-0600-4c8c-b7f2-c6bf92ae1089) needsAcrPullandContainer Registry Repository Readeron the registry — not only the project MI. The docs name only the project MI, but the live behaviour requires both; the test ACRhphtestacr95876had the same grant by accident, which is why it worked while the canonical one failed.(Documentation only — not an ACR property): The per-version agent MI and blueprint MI need
Foundry Useron the project. That side is handled by sibling issue [P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 (deploy script auto-grants), so this issue's scope is ACR only.Current behavior evidence
feat(#990): Foundry V3 hosted-agents pilot end-to-end, branchfeature/foundry-hosted-agents-pilot.infra/Bicep — noMicrosoft.ContainerRegistry/registriesresource exists.infra/deploy-portal/contains onlymain.bicep+modules/portal.bicep, both unrelated to the canonical ACR.Required change
When the canonical ACR is brought under IaC management (separately scoped, may already be tracked by an
area:infraepic), the resource definition must include:Until the ACR is in IaC, the same enforcement should be implemented as a post-deploy idempotent check script in
scripts/ops/that:devenvironment branch policy already in place,azd env get-values,az acr config authentication-as-arm update --status enabledif not already enabled,--enforceis passed.Acceptance criteria
infra/(or the equivalent IaC module that createsholidaypeakhub405devacr) declarespolicies.azureADAuthenticationAsArmPolicy.status = 'enabled'on the registry resource.infra/declares role assignments grantingAcrPullandContainer Registry Repository Readerto the Foundry account system MI on the ACR scope.scripts/ops/ensure_acr_foundry_prereqs.{ps1,sh}exists, is documented in docs/governance/hosted-agents.md, and runs idempotently in thedeploy-azd-*workflow before agent registration.what-ifsnapshot or script unit test) asserts both states are enforced.az acr config authentication-as-arm updatestep is removed from the runbook.Risks and dependencies
azd import/terraform importflow that doesn't disturb running deployments. Coordinate with the team that owns the original portal deployment.azureADAuthenticationAsArmPolicyis a no-op for non-Foundry consumers (App Service / ACI / AKS use a different token exchange), but Azure Policy may flag the change. Coordinate with platform security.ADR impact
No existing ADR is invalidated. This work is operational hardening of the runtime prerequisites for ADR-038 (continuous agent evaluation) / ADR-046 (Foundry V3 hosted-agents) — recording an operational annex under docs/architecture/adrs/ (or a
docs/architecture/runtime-prereqs/note) would be appropriate but not blocking.Labels
priority: mediumtype:devopsarea:infrafoundrytech-debtBPMN process
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#FFB3BA', 'primaryTextColor':'#000', 'primaryBorderColor':'#FF8B94', 'lineColor':'#BAE1FF', 'secondaryColor':'#BAE1FF', 'tertiaryColor':'#FFFFFF' }}}%% flowchart LR A[Confirm ACR IaC Scope] --> B[Design Bicep or Script Path] B --> C[Implement on Issue Branch] C --> D[Add Snapshot Test] D --> E[Open PR] E --> F[CI What-If or Script Validate] F --> G[Merge to Main] G --> H[Update Runbook + Close Issue]Filed as a follow-up to PR #1103. The three operational fixes described here are already applied to the live canonical ACR; this issue tracks codifying them so they survive a registry rebuild.