Retry transient ARM DeploymentNotFound (404) on subscription/resource-group deployments#8551
Retry transient ARM DeploymentNotFound (404) on subscription/resource-group deployments#8551Copilot wants to merge 2 commits into
Conversation
Co-authored-by: JeffreyCA <9157833+JeffreyCA@users.noreply.github.com>
|
/azp run azure-dev - cli |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
Intermittent
DeploymentNotFoundfailures onazd up/provisionandazd downfor subscription-scoped Bicep deployments (commonly withinfra.layers), even though ARM ultimately reports the deployment asSucceeded(matchingcorrelationId/TraceID).Root cause
Both failing paths go through
StandardDeployments.DeployToSubscription→BeginCreateOrUpdateAtSubscriptionScope+PollUntilDone(theazd downvoid path deploys an empty template through the same method). ARM can briefly return HTTP 404 for a just-submitted subscription-scoped deployment (read-after-write inconsistency). The Azure SDK does not treat 404 as retryable, so the poller fails terminally despite the deployment succeeding. Multiple sequential subscription-scoped deployments perinfra.layersrun make the low-probability race more visible.Changes
pkg/azapi/standard_deployments.gowithDeploymentRetry(ctx)that augments the default ARM-retryable status codes with404, scoped to the deploy operation and propagated to the LRO poller.DeployToSubscriptionandDeployToResourceGroup. Not applied toGet*Deployment, so genuine not-found checks stay fast. Bounded retries (5 attempts, 3s→15s backoff) ensure a truly missing deployment still surfaces.pkg/azsdk/zip_deploy_client.go.pkg/azapi/standard_deployments_coverage3_test.go— transient 404 retried→succeeds (subscription + resource group); persistent 404 still fails after retries.CHANGELOG.md— bug-fix entry.Notes for reviewers
The retry window (5 attempts / 3s→15s) was chosen to cover the ~30s consistency window observed in the reporter's traces; it can be tuned if longer windows occur. The real ARM timing is hard to reproduce deterministically, so manual validation by repeating
azd up/azd downon a multi-layer subscription-scoped project is the most meaningful end-to-end check.