azurerm_firewall_policy_rule_collection_group - fix timeout expiring while waiting for policy lock#32081
Conversation
…while waiting for policy lock When multiple firewall policy rule collection groups target the same firewall policy, Terraform processes them in parallel but Azure enforces serial processing (responding with 409 AnotherOperationInProgress). The provider already serializes these operations using locks.ByName(), but the timeout context was created before the lock was acquired. This meant time spent waiting for the lock consumed the 30-minute timeout budget, causing later operations to fail with context.DeadlineExceeded. Move lock acquisition before the timeout context creation in both CreateUpdate and Delete so each operation gets its full timeout budget regardless of how long it waited for the lock.
|
Thanks for the PR, however, the timeouts should apply to the create/update operation, not to a specifc api call, if you need to increase the timeout for your resource specifically see https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/firewall_policy_rule_collection_group#timeouts |
|
@mstroob I see your point. However, Azure does not allow concurrent updates to firewall resources but the azurerm provider tries to concurrently create or update rule collection groups on the same firewall policy for example. Lately the updated for rule collection groups have been taking longer and longer causing us to hit the timeout. The current locking mechanism seems like a workaround for this. Because the timeout context is created before the lock is acquired the timeout actually applies all preceding rule collection group updates as well. Increasing the timeout is not a real fix for the problem. We see updates now often taking 10–15 minutes so if you have a policy with say 10 rules groups that require an update that would require me to set a 150-minute timeout to workaround the fact that the provider attempts a concurrent operation on a resource that does not allow that. My PR was an attempt to "fix" this, but it would indeed still cause create/update operations to "hang" and now without a timeout. Do you see another way to fix this? The real fix in my opinion would be that the provider does not attempt parallel create/update/delete operations on the same firewall policy ID to begin with. In that case the lock can be removed here as well because the locking is done on a higher level. |
|
@mstroob here is an alternate implementation, in a draft PR #32094. Though technically very similar it leverages
The timeout still covers the full API operation (create/update/delete + polling). It just starts after the mutex is acquired, so lock contention across concurrent rule collection groups on the same policy doesn't eat into the timeout budget. |
|
Hi @sanderaernouts I would suggest checking the issues to see if there is a similar one, or opening a new one and including the required details, especially a way to reproduce the issue, as there may be another underlying problem that causes the very slow updates that could be fixed, rather than just removing the timeout. |
Summary
azurerm_firewall_policy_rule_collection_groupCreateUpdate and Delete functionsProblem
We recently started seeing timeouts when deploying multiple
azurerm_firewall_policy_rule_collection_groupresources that target the same firewall policy, Terraform processes them in parallel. Azure enforces serial processing on the same firewall policy, responding with409 AnotherOperationInProgressfor concurrent requests.The provider already serializes these operations using
locks.ByName(), which prevents the 409 errors. However, the timeout context (context.WithTimeoutwith a 30-minute deadline) was created before the lock was acquired:When N goroutines compete for the same lock, all N timers start simultaneously. Goroutines that wait for the lock have their timeout budget consumed by wait time, eventually causing
context.DeadlineExceedederrors that manifest as operation timeouts.Fix
Move lock acquisition before the timeout context creation, so each operation gets its full timeout regardless of how long it waited for the lock:
Applied to both
resourceFirewallPolicyRuleCollectionGroupCreateUpdateandresourceFirewallPolicyRuleCollectionGroupDelete.Side effects
client.Get) in CreateUpdate is now inside the lock scopeNote
All 6 firewall resource files have this same timeout-before-lock pattern. This PR only fixes
firewall_policy_rule_collection_group_resource.go. The same fix can be applied to other resources separately if needed.