Skip to content

[P1] foundry-hosting: auto-grant Foundry User to per-version MI in deploy_hosted_agent.py #1107

@Cataldir

Description

@Cataldir

Problem statement

When a Foundry V3 hosted-agent version is created via the SDK path (lib/src/holiday_peak_lib/foundry_hosting/deploy.py  client.agents.create_version), the platform mints a new per-version managed identity (instance_identity.principal_id) without granting it the Foundry User role on the project. As a result, the container starts, accepts requests, runs the agent handler successfully, and then fails to persist the response: every POST /storage/responses from the container returns HTTP 401 and the public /responses call surfaces as HTTP 500 to the caller  even though the agent code is healthy.

The azd up / VS Code Foundry-extension deploy path does perform this RBAC grant automatically. The SDK path does not, so any agent deployed by scripts/ops/deploy_hosted_agent.py requires a manual operator step before it can serve traffic. That manual step blocked PR #1103 for ~40 minutes and is the only remaining wart in the otherwise fully working pilot.

Once granted, the role is permanent for that specific per-version MI, but a fresh MI is minted on every create_version, so the grant must be applied on each new version. The blueprint MI (blueprint.principal_id) is stable across versions of the same agent name and only needs to be granted once.

Current behavior evidence

  • lib/src/holiday_peak_lib/foundry_hosting/deploy.py ÔÇö deploy_hosted_agent_version (line ~430) calls client.agents.create_version, then _resolve_latest_version, then optionally polls until terminal. No role-assignment step exists.
  • scripts/ops/deploy_hosted_agent.py ÔÇö the CLI wrapper around the library function. Same behaviour.
  • App Insights trace from inventory-health-check v15ÔÇôv19 (before manual grant ÔÇö PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103 pilot logs):
    Foundry storage POST .../storage/responses?api-version=v1 -> 401
    Inbound POST /responses completed with status 500
    
  • App Insights trace from v20 (after manual grant of Foundry User to the per-version MI):
    DefaultAzureCredential acquired a token from ManagedIdentityCredential
    Foundry storage POST .../storage/responses?api-version=v1 -> 201
    Response caresp_ completed: status=completed output_count=1
    Inbound POST /responses completed with status 200
    
  • Manual fix that unblocked PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103:
    $projectScope = "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<account>/projects/<project>"
    az role assignment create \
      --assignee-object-id <instance_identity.principal_id> \
      --assignee-principal-type ServicePrincipal \
      --role "Foundry User" \
      --scope $projectScope
  • Runbook reference: memories/session/foundry-v3-pilot-status.md ┬º"Three root causes (in order of discovery)" ÔÇö case 3.

Required change

deploy_hosted_agent_version must, after _resolve_latest_version returns and before entering the poll loop:

  1. Extract instance_identity.principal_id and blueprint.principal_id from version_obj (both are nested objects on HostedAgentVersion).
  2. Resolve the project ARM resource ID from the project endpoint (https://<account>.services.ai.azure.com/api/projects/<project>  /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<account>/projects/<project>). The account-name / project-name segments come from the endpoint URL; the subscription + RG can come from AZURE_SUBSCRIPTION_ID + AZURE_RESOURCE_GROUP env vars or from an az resource show lookup on the account.
  3. Use azure.mgmt.authorization.AuthorizationManagementClient.role_assignments.create to assign Foundry User (role definition ID is stable ÔÇö record it as a module constant). Use a deterministic name=guid(scope, principal_id, role_def_id) so the call is idempotent (HTTP 201 on first call, HTTP 409 on subsequent ÔÇö treat 409 as success).
  4. Apply the same grant to the blueprint MI (idempotent ÔÇö usually a no-op after the first version of a given agent name).
  5. Emit a structured log foundry_hosted_agent_rbac_granted capturing principal_id, role, scope, and the assignment ID.
  6. Fail-soft behaviour: if the calling identity does not have Microsoft.Authorization/roleAssignments/write on the project scope (HTTP 403 from the API), log a clear warning explaining the manual grant needed and surface a RoleAssignmentSkipped field on HostedAgentDeploymentResult ÔÇö do not fail the overall deploy, because in CI scenarios the caller may legitimately delegate RBAC to a separate workflow.

Suggested API surface

Add a grant_rbac: bool = True flag and an optional rbac_credential: TokenCredential | None = None (defaults to the same credential used for the project client). New module constant:

_FOUNDRY_USER_ROLE_DEFINITION_ID = (
    "/providers/Microsoft.Authorization/roleDefinitions/"
    "<role-id-for-Foundry User>"
)

(Confirm the exact role definition GUID with az role definition list --name "Foundry User" during implementation ÔÇö do not hard-code an unverified GUID.)

CLI flag wiring

Expose --no-grant-rbac on scripts/ops/deploy_hosted_agent.py for CI flows that already delegate RBAC, and --rbac-scope <arm-id> for projects whose scope cannot be inferred from the endpoint.

Acceptance criteria

  • deploy_hosted_agent_version performs the two role grants (instance MI + blueprint MI) by default after _resolve_latest_version.
  • Grant is idempotent (deterministic assignment name; HTTP 409 treated as success).
  • On HTTP 403 from the role-assignment API, a warning is logged and HostedAgentDeploymentResult.extras["rbac_skipped"] = True; the deploy still returns the version result.
  • grant_rbac=False / --no-grant-rbac flag fully skips the section and emits a structured log noting the skip.
  • Unit tests cover: happy path (201), idempotent path (409), permission-denied path (403 ÔåÆ warn + extras flag), and skip path (grant_rbac=False).
  • Integration test (live, opt-in via env var) deploys an agent, asserts the version is active, and asserts an invocation against the public endpoint returns HTTP 200 with Foundry storage POST ÔǪ -> 201 in the App Insights trace.
  • memories/session/foundry-v3-pilot-status.md is updated to drop case 3 from the manual runbook and reference this PR/issue.
  • scripts/ops/deploy_hosted_agent.py CLI help documents the new flags.

Risks and dependencies

ADR impact

No ADR is invalidated. This implements the operator-facing contract for ADR-046 (Foundry V3 hosted-agents) ÔÇö recommended to add a short addendum to ADR-046 capturing "SDK deploy path auto-grants Foundry User on per-version MI; manual azd flow continues to rely on azd auth permissions".

Labels

  • priority: high
  • type:backend
  • type:devops
  • foundry

BPMN process

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#FFB3BA',
  'primaryTextColor':'#000',
  'primaryBorderColor':'#FF8B94',
  'lineColor':'#BAE1FF',
  'secondaryColor':'#BAE1FF',
  'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
  A[Confirm Foundry User Role GUID] --> B[Design RBAC Step]
  B --> C[Implement on Issue Branch]
  C --> D[Unit + Integration Tests]
  D --> E[Open PR]
  E --> F[Validation and Fixes]
  F --> G[Merge to Main]
  G --> H[Update Runbook + Close Issue]
Loading

Filed as a follow-up to PR #1103. The manual grant described here was applied during the pilot and proved the design works; this issue tracks moving the grant into the SDK path so future hosted-agent deployments don't require operator RBAC steps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    foundryMicrosoft AI Foundry integrationpriority: highRequired for v0 completenesstype:backendBackend implementationtype:devopsDevOps/CI work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions