Is there an existing issue for this?
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave comments along the lines of "+1", "me too" or "any updates", they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.
Terraform Version
1.14.0
AzureRM Provider Version
4.64.0
Affected Resource(s)/Data Source(s)
azurerm_kubernetes_cluster_node_pool
Terraform Configuration Files
# Two AKS clusters in different regions/resource groups with same-named subnets
# Nodepools are created sequentially instead of in parallel
module "cluster_east" {
source = "./modules/aks"
location = "eastus"
resource_group_name = "rg-east"
vnet_name = "vnet-east"
subnet_name = "aks-nodes" # Same name as cluster_west
pod_subnet_name = "aks-pods" # Same name as cluster_west
cluster_name = "aks-east"
nodepool_name = "workload"
}
module "cluster_west" {
source = "./modules/aks"
location = "westus2"
resource_group_name = "rg-west"
vnet_name = "vnet-west"
subnet_name = "aks-nodes" # Same name as cluster_east
pod_subnet_name = "aks-pods" # Same name as cluster_east
cluster_name = "aks-west"
nodepool_name = "workload"
}
# Inside the module: azurerm_kubernetes_cluster_node_pool with vnet_subnet_id and pod_subnet_id
Debug Output/Panic Output
Note: The output below is the expected debug output based on the log statements in internal/locks/mutexkv.go, not captured from an actual run. I've already applied all terraform so applying again for debug is tiresome
# TF_LOG=DEBUG terraform apply 2>&1 | grep -E 'Lock|Unlock'
# Shows sequential lock acquisition on the same mutex key despite different subnets:
[DEBUG] provider.terraform-provider-azurerm: Locking "azurerm_subnet.aks-nodes"
[DEBUG] provider.terraform-provider-azurerm: Locked "azurerm_subnet.aks-nodes"
# ... Nodepool A creation takes 10-15 minutes ...
[DEBUG] provider.terraform-provider-azurerm: Unlocking "azurerm_subnet.aks-nodes"
[DEBUG] provider.terraform-provider-azurerm: Unlocked "azurerm_subnet.aks-nodes"
# Only AFTER Nodepool A completes, Nodepool B starts:
[DEBUG] provider.terraform-provider-azurerm: Locking "azurerm_subnet.aks-nodes"
[DEBUG] provider.terraform-provider-azurerm: Locked "azurerm_subnet.aks-nodes"
# ... Nodepool B creation takes 10-15 minutes ...
[DEBUG] provider.terraform-provider-azurerm: Unlocking "azurerm_subnet.aks-nodes"
[DEBUG] provider.terraform-provider-azurerm: Unlocked "azurerm_subnet.aks-nodes"
Expected Behaviour
When two AKS clusters exist in different VNets, different resource groups, and different regions, their nodepool creation operations should run in parallel, even if the subnets happen to share the same name (e.g., aks-nodes). The subnets are completely independent Azure resources with different resource IDs.
Actual Behaviour
Nodepool creation is serialized: the second nodepool blocks on a provider-internal mutex until the first nodepool creation fully completes (including Azure API polling). This adds 10-15 minutes of unnecessary wait time to every terraform apply.
The root cause is that the provider's internal locking mechanism uses only the subnet name (e.g., aks-nodes) as the mutex key, without any qualification by VNet, resource group, subscription, or region. Two completely unrelated subnets that happen to share the same name collide on the same sync.Mutex.
Steps to Reproduce
- Create two AKS clusters in different regions/resource groups, each with its own VNet
- Use the same subnet name (e.g.,
aks-nodes) in both VNets -- this is a very common naming convention
- Add a nodepool to each cluster (either via
azurerm_kubernetes_cluster_node_pool or via default_node_pool with additional pools)
- Run
terraform apply
- Observe in Azure Portal or debug logs that the second nodepool waits for the first to complete before starting
Important Factoids
No response
References
Is there an existing issue for this?
Community Note
Terraform Version
1.14.0
AzureRM Provider Version
4.64.0
Affected Resource(s)/Data Source(s)
azurerm_kubernetes_cluster_node_pool
Terraform Configuration Files
Debug Output/Panic Output
Note: The output below is the expected debug output based on the log statements in
internal/locks/mutexkv.go, not captured from an actual run. I've already applied all terraform so applying again for debug is tiresomeExpected Behaviour
When two AKS clusters exist in different VNets, different resource groups, and different regions, their nodepool creation operations should run in parallel, even if the subnets happen to share the same name (e.g.,
aks-nodes). The subnets are completely independent Azure resources with different resource IDs.Actual Behaviour
Nodepool creation is serialized: the second nodepool blocks on a provider-internal mutex until the first nodepool creation fully completes (including Azure API polling). This adds 10-15 minutes of unnecessary wait time to every
terraform apply.The root cause is that the provider's internal locking mechanism uses only the subnet name (e.g.,
aks-nodes) as the mutex key, without any qualification by VNet, resource group, subscription, or region. Two completely unrelated subnets that happen to share the same name collide on the samesync.Mutex.Steps to Reproduce
aks-nodes) in both VNets -- this is a very common naming conventionazurerm_kubernetes_cluster_node_poolor viadefault_node_poolwith additional pools)terraform applyImportant Factoids
No response
References
azurerm_kubernetes_cluster_node_pool- lock subnet ID instead of subnet name #26939 -- The exact same fix (ByName->ByID) was proposed and closed without merge in Sep 2024. It was closed because Azure claimed to have fixed the underlying race condition ([BUG] Create multiple node pool with same vnet subnet ID, it throws SetVNetOwnershipFailed Azure/AKS#4522). However, the race condition re-emerged with pod subnets, leading to locks being re-added in PRazurerm_kubernetes_cluster_node_pool- prevent race by polling pod subnet provisioning state during node pool creation #29537 usingByNameagain.azurerm_kubernetes_cluster_node_pool- lock subnet ID instead of subnet name #26939.azurerm_kubernetes_cluster_node_pool- prevent race by polling pod subnet provisioning state during node pool creation #29537 -- Re-introduced subnet name locks (Aug 2025) to fix the pod subnet race condition. This PR usedlocks.MultipleByNamewith subnet name only, re-introducing the false serialization bug.azurerm_kubernetes_cluster- remove subnet lock #27583 -- Removed all locks (Oct 2024) based on Azure's claimed fix, later found to be incomplete.azurerm_container_group: fix parallel provision failure given the samenetwork_profile_id#15098 --container_group_resource.goprecedent forlocks.ByID(subnet.ID()).