You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Foundry V3 hosted-agent version is created via the SDK path (lib/src/holiday_peak_lib/foundry_hosting/deploy.py  client.agents.create_version), the platform mints a new per-version managed identity (instance_identity.principal_id) without granting it the Foundry User role on the project. As a result, the container starts, accepts requests, runs the agent handler successfully, and then fails to persist the response: every POST /storage/responses from the container returns HTTP 401 and the public /responses call surfaces as HTTP 500 to the caller  even though the agent code is healthy.
The azd up / VS Code Foundry-extension deploy path does perform this RBAC grant automatically. The SDK path does not, so any agent deployed by scripts/ops/deploy_hosted_agent.py requires a manual operator step before it can serve traffic. That manual step blocked PR #1103 for ~40 minutes and is the only remaining wart in the otherwise fully working pilot.
Once granted, the role is permanent for that specific per-version MI, but a fresh MI is minted on every create_version, so the grant must be applied on each new version. The blueprint MI (blueprint.principal_id) is stable across versions of the same agent name and only needs to be granted once.
Current behavior evidence
lib/src/holiday_peak_lib/foundry_hosting/deploy.py ÔÇö deploy_hosted_agent_version (line ~430) calls client.agents.create_version, then _resolve_latest_version, then optionally polls until terminal. No role-assignment step exists.
Foundry storage POST .../storage/responses?api-version=v1 -> 401
Inbound POST /responses completed with status 500
App Insights trace from v20 (after manual grant of Foundry User to the per-version MI):
DefaultAzureCredential acquired a token from ManagedIdentityCredential
Foundry storage POST .../storage/responses?api-version=v1 -> 201
Response caresp_ completed: status=completed output_count=1
Inbound POST /responses completed with status 200
deploy_hosted_agent_version must, after_resolve_latest_version returns and before entering the poll loop:
Extract instance_identity.principal_id and blueprint.principal_id from version_obj (both are nested objects on HostedAgentVersion).
Resolve the project ARM resource ID from the project endpoint (https://<account>.services.ai.azure.com/api/projects/<project>  /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<account>/projects/<project>). The account-name / project-name segments come from the endpoint URL; the subscription + RG can come from AZURE_SUBSCRIPTION_ID + AZURE_RESOURCE_GROUP env vars or from an az resource show lookup on the account.
Use azure.mgmt.authorization.AuthorizationManagementClient.role_assignments.create to assign Foundry User (role definition ID is stable ÔÇö record it as a module constant). Use a deterministic name=guid(scope, principal_id, role_def_id) so the call is idempotent (HTTP 201 on first call, HTTP 409 on subsequent ÔÇö treat 409 as success).
Apply the same grant to the blueprint MI (idempotent ÔÇö usually a no-op after the first version of a given agent name).
Emit a structured log foundry_hosted_agent_rbac_granted capturing principal_id, role, scope, and the assignment ID.
Fail-soft behaviour: if the calling identity does not have Microsoft.Authorization/roleAssignments/write on the project scope (HTTP 403 from the API), log a clear warning explaining the manual grant needed and surface a RoleAssignmentSkipped field on HostedAgentDeploymentResult ÔÇö do not fail the overall deploy, because in CI scenarios the caller may legitimately delegate RBAC to a separate workflow.
Suggested API surface
Add a grant_rbac: bool = True flag and an optional rbac_credential: TokenCredential | None = None (defaults to the same credential used for the project client). New module constant:
(Confirm the exact role definition GUID with az role definition list --name "Foundry User" during implementation ÔÇö do not hard-code an unverified GUID.)
CLI flag wiring
Expose --no-grant-rbac on scripts/ops/deploy_hosted_agent.py for CI flows that already delegate RBAC, and --rbac-scope <arm-id> for projects whose scope cannot be inferred from the endpoint.
Acceptance criteria
deploy_hosted_agent_version performs the two role grants (instance MI + blueprint MI) by default after _resolve_latest_version.
Grant is idempotent (deterministic assignment name; HTTP 409 treated as success).
On HTTP 403 from the role-assignment API, a warning is logged and HostedAgentDeploymentResult.extras["rbac_skipped"] = True; the deploy still returns the version result.
grant_rbac=False / --no-grant-rbac flag fully skips the section and emits a structured log noting the skip.
Integration test (live, opt-in via env var) deploys an agent, asserts the version is active, and asserts an invocation against the public endpoint returns HTTP 200 with Foundry storage POST  -> 201 in the App Insights trace.
scripts/ops/deploy_hosted_agent.py CLI help documents the new flags.
Risks and dependencies
Risk: Hard-coding the Foundry User role definition GUID. Mitigation: look it up dynamically once on first call and cache per-process; record the resolved GUID in the structured log.
Risk:azure-mgmt-authorization adds a transitive dependency. Mitigation: it is already in the dependency closure of azure-identity / azure-mgmt-core; verify and add an explicit pin in lib/pyproject.toml.
Risk: Calling-identity RBAC mismatch in CI. Mitigation: the fail-soft behaviour above + the --no-grant-rbac flag let workflows opt out cleanly.
No ADR is invalidated. This implements the operator-facing contract for ADR-046 (Foundry V3 hosted-agents) ÔÇö recommended to add a short addendum to ADR-046 capturing "SDK deploy path auto-grants Foundry User on per-version MI; manual azd flow continues to rely on azd auth permissions".
Labels
priority: high
type:backend
type:devops
foundry
BPMN process
%%{init: {'theme':'base', 'themeVariables': {
'primaryColor':'#FFB3BA',
'primaryTextColor':'#000',
'primaryBorderColor':'#FF8B94',
'lineColor':'#BAE1FF',
'secondaryColor':'#BAE1FF',
'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
A[Confirm Foundry User Role GUID] --> B[Design RBAC Step]
B --> C[Implement on Issue Branch]
C --> D[Unit + Integration Tests]
D --> E[Open PR]
E --> F[Validation and Fixes]
F --> G[Merge to Main]
G --> H[Update Runbook + Close Issue]
Loading
Filed as a follow-up to PR #1103. The manual grant described here was applied during the pilot and proved the design works; this issue tracks moving the grant into the SDK path so future hosted-agent deployments don't require operator RBAC steps.
Problem statement
When a Foundry V3 hosted-agent version is created via the SDK path (
lib/src/holiday_peak_lib/foundry_hosting/deploy.pyclient.agents.create_version), the platform mints a new per-version managed identity (instance_identity.principal_id) without granting it theFoundry Userrole on the project. As a result, the container starts, accepts requests, runs the agent handler successfully, and then fails to persist the response: everyPOST /storage/responsesfrom the container returns HTTP 401 and the public/responsescall surfaces as HTTP 500 to the caller  even though the agent code is healthy.The
azd up/ VS Code Foundry-extension deploy path does perform this RBAC grant automatically. The SDK path does not, so any agent deployed by scripts/ops/deploy_hosted_agent.py requires a manual operator step before it can serve traffic. That manual step blocked PR #1103 for ~40 minutes and is the only remaining wart in the otherwise fully working pilot.Once granted, the role is permanent for that specific per-version MI, but a fresh MI is minted on every
create_version, so the grant must be applied on each new version. The blueprint MI (blueprint.principal_id) is stable across versions of the same agent name and only needs to be granted once.Current behavior evidence
deploy_hosted_agent_version(line ~430) callsclient.agents.create_version, then_resolve_latest_version, then optionally polls until terminal. No role-assignment step exists.inventory-health-checkv15ÔÇôv19 (before manual grant ÔÇö PR feat(#990): Foundry V3 hosted-agents pilot end-to-end (with portal-visibility fixes) #1103 pilot logs):Foundry Userto the per-version MI):Required change
deploy_hosted_agent_versionmust, after_resolve_latest_versionreturns and before entering the poll loop:instance_identity.principal_idandblueprint.principal_idfromversion_obj(both are nested objects onHostedAgentVersion).https://<account>.services.ai.azure.com/api/projects/<project>/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<account>/projects/<project>). The account-name / project-name segments come from the endpoint URL; the subscription + RG can come fromAZURE_SUBSCRIPTION_ID+AZURE_RESOURCE_GROUPenv vars or from anaz resource showlookup on the account.azure.mgmt.authorization.AuthorizationManagementClient.role_assignments.createto assignFoundry User(role definition ID is stable  record it as a module constant). Use a deterministicname=guid(scope, principal_id, role_def_id)so the call is idempotent (HTTP 201 on first call, HTTP 409 on subsequent  treat 409 as success).foundry_hosted_agent_rbac_grantedcapturing principal_id, role, scope, and the assignment ID.Microsoft.Authorization/roleAssignments/writeon the project scope (HTTP 403 from the API), log a clear warning explaining the manual grant needed and surface aRoleAssignmentSkippedfield onHostedAgentDeploymentResult do not fail the overall deploy, because in CI scenarios the caller may legitimately delegate RBAC to a separate workflow.Suggested API surface
Add a
grant_rbac: bool = Trueflag and an optionalrbac_credential: TokenCredential | None = None(defaults to the same credential used for the project client). New module constant:(Confirm the exact role definition GUID with
az role definition list --name "Foundry User"during implementation ÔÇö do not hard-code an unverified GUID.)CLI flag wiring
Expose
--no-grant-rbaconscripts/ops/deploy_hosted_agent.pyfor CI flows that already delegate RBAC, and--rbac-scope <arm-id>for projects whose scope cannot be inferred from the endpoint.Acceptance criteria
deploy_hosted_agent_versionperforms the two role grants (instance MI + blueprint MI) by default after_resolve_latest_version.HostedAgentDeploymentResult.extras["rbac_skipped"] = True; the deploy still returns the version result.grant_rbac=False/--no-grant-rbacflag fully skips the section and emits a structured log noting the skip.grant_rbac=False).active, and asserts an invocation against the public endpoint returns HTTP 200 withFoundry storage POST  -> 201in the App Insights trace.scripts/ops/deploy_hosted_agent.pyCLI help documents the new flags.Risks and dependencies
Foundry Userrole definition GUID. Mitigation: look it up dynamically once on first call and cache per-process; record the resolved GUID in the structured log.azure-mgmt-authorizationadds a transitive dependency. Mitigation: it is already in the dependency closure ofazure-identity/azure-mgmt-core; verify and add an explicit pin inlib/pyproject.toml.--no-grant-rbacflag let workflows opt out cleanly._pick_latest_version+_normalize_statuspatches this issue's implementation will build on.ADR impact
No ADR is invalidated. This implements the operator-facing contract for ADR-046 (Foundry V3 hosted-agents) ÔÇö recommended to add a short addendum to ADR-046 capturing "SDK deploy path auto-grants Foundry User on per-version MI; manual
azdflow continues to rely onazd authpermissions".Labels
priority: hightype:backendtype:devopsfoundryBPMN process
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#FFB3BA', 'primaryTextColor':'#000', 'primaryBorderColor':'#FF8B94', 'lineColor':'#BAE1FF', 'secondaryColor':'#BAE1FF', 'tertiaryColor':'#FFFFFF' }}}%% flowchart LR A[Confirm Foundry User Role GUID] --> B[Design RBAC Step] B --> C[Implement on Issue Branch] C --> D[Unit + Integration Tests] D --> E[Open PR] E --> F[Validation and Fixes] F --> G[Merge to Main] G --> H[Update Runbook + Close Issue]Filed as a follow-up to PR #1103. The manual grant described here was applied during the pilot and proved the design works; this issue tracks moving the grant into the SDK path so future hosted-agent deployments don't require operator RBAC steps.