Skip to content

[P2] foundry-hosting: codify ACR prerequisites for Foundry V3 hosted-agents in IaC #1108

@Cataldir

Description

@Cataldir

Problem statement

Bringing the canonical ACR holidaypeakhub405devacr into a working state for Foundry V3 hosted-agents required three operational changes that today exist only on the live resource (no Bicep / Terraform definition in infra/). If the registry is ever rebuilt, deleted, or replicated, every hosted-agent in the project will fail again with the same generic ImageError: Failed to pull container image … that took five inventory-health-check versions to root-cause during PR #1103.

These are the three operational changes that must be codified:

  1. Microsoft.ContainerRegistry/registries/properties/policies/azureADAuthenticationAsArmPolicy.status must be enabled. Foundry hosted-agents pull images by exchanging an ARM-audience AAD token for an ACR data-plane token. When this policy is disabled (the default in some ACR provisioning paths) the exchange is rejected and the platform reports the generic ImageError with zero pull attempts recorded on the ACR, which is what made it so hard to diagnose. Confirmed root cause in PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103 — flipping this policy was the single change that moved inventory-health-check from failed/ImageError to active in 21 seconds.

  2. The Foundry AI-account system-assigned managed identity (holidaypeakhub405devais / principal 351cdb70-0600-4c8c-b7f2-c6bf92ae1089) needs AcrPull and Container Registry Repository Reader on the registry — not only the project MI. The docs name only the project MI, but the live behaviour requires both; the test ACR hphtestacr95876 had the same grant by accident, which is why it worked while the canonical one failed.

  3. (Documentation only — not an ACR property): The per-version agent MI and blueprint MI need Foundry User on the project. That side is handled by sibling issue [P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107 (deploy script auto-grants), so this issue's scope is ACR only.

Current behavior evidence

Required change

When the canonical ACR is brought under IaC management (separately scoped, may already be tracked by an area:infra epic), the resource definition must include:

resource acr 'Microsoft.ContainerRegistry/registries@2023-11-01-preview' = {
  name: acrName
  location: acrLocation
  sku: { name: 'Premium' }
  properties: {
    adminUserEnabled: false
    publicNetworkAccess: 'Enabled'
    policies: {
      azureADAuthenticationAsArmPolicy: { status: 'enabled' }
    }
  }
}

// Foundry AI-account system MI needs AcrPull + Container Registry Repository Reader
resource acrPullForAccount 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(acr.id, foundryAccountPrincipalId, 'AcrPull')
  scope: acr
  properties: {
    principalId: foundryAccountPrincipalId
    principalType: 'ServicePrincipal'
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      '7f951dda-4ed3-4680-a7ca-43fe172d538d' // AcrPull
    )
  }
}
resource acrRepoReaderForAccount 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(acr.id, foundryAccountPrincipalId, 'ContainerRegistryRepositoryReader')
  scope: acr
  properties: {
    principalId: foundryAccountPrincipalId
    principalType: 'ServicePrincipal'
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      'b93aa761-3e63-49ed-ac28-beffa264f7ac' // Container Registry Repository Reader
    )
  }
}

Until the ACR is in IaC, the same enforcement should be implemented as a post-deploy idempotent check script in scripts/ops/ that:

  • runs in CI on the dev environment branch policy already in place,
  • reads the canonical ACR name from azd env get-values,
  • calls az acr config authentication-as-arm update --status enabled if not already enabled,
  • ensures both role assignments exist for the Foundry account MI,
  • fails the workflow if any required state is missing and --enforce is passed.

Acceptance criteria

  • infra/ (or the equivalent IaC module that creates holidaypeakhub405devacr) declares policies.azureADAuthenticationAsArmPolicy.status = 'enabled' on the registry resource.
  • infra/ declares role assignments granting AcrPull and Container Registry Repository Reader to the Foundry account system MI on the ACR scope.
  • OR: a scripts/ops/ensure_acr_foundry_prereqs.{ps1,sh} exists, is documented in docs/governance/hosted-agents.md, and runs idempotently in the deploy-azd-* workflow before agent registration.
  • A regression test (Bicep what-if snapshot or script unit test) asserts both states are enforced.
  • memories/session/foundry-v3-pilot-status.md is updated to point at the codified path and the manual az acr config authentication-as-arm update step is removed from the runbook.

Risks and dependencies

ADR impact

No existing ADR is invalidated. This work is operational hardening of the runtime prerequisites for ADR-038 (continuous agent evaluation) / ADR-046 (Foundry V3 hosted-agents) — recording an operational annex under docs/architecture/adrs/ (or a docs/architecture/runtime-prereqs/ note) would be appropriate but not blocking.

Labels

  • priority: medium
  • type:devops
  • area:infra
  • foundry
  • tech-debt

BPMN process

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#FFB3BA',
  'primaryTextColor':'#000',
  'primaryBorderColor':'#FF8B94',
  'lineColor':'#BAE1FF',
  'secondaryColor':'#BAE1FF',
  'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
  A[Confirm ACR IaC Scope] --> B[Design Bicep or Script Path]
  B --> C[Implement on Issue Branch]
  C --> D[Add Snapshot Test]
  D --> E[Open PR]
  E --> F[CI What-If or Script Validate]
  F --> G[Merge to Main]
  G --> H[Update Runbook + Close Issue]
Loading

Filed as a follow-up to PR #1103. The three operational fixes described here are already applied to the live canonical ACR; this issue tracks codifying them so they survive a registry rebuild.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:infraInfrastructure / IaC concernfoundryMicrosoft AI Foundry integrationpriority: mediumImportant but not blocking v0tech-debtTechnical debt items requiring cleanuptype:devopsDevOps/CI work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions