|
| 1 | +--- |
| 2 | +title: Troubleshooting Reconciliation Issues |
| 3 | +weight: 3 |
| 4 | +--- |
| 5 | +# Troubleshooting Reconciliation Issues |
| 6 | + |
| 7 | +This document provides guidance for troubleshooting common reconciliation and performance issues with the Terraform provider, based on real-world scenarios and their resolutions. |
| 8 | + |
| 9 | +## Reconciliations Blocking Unexpectedly |
| 10 | + |
| 11 | +### Problem Description |
| 12 | + |
| 13 | +You may experience reconciliations backing up behind long-running terraform apply/destroy operations, where workspaces appear to get "stuck" even when CPU resources are available. This manifests as: |
| 14 | + |
| 15 | +- Reconciliations not making progress despite available CPU capacity |
| 16 | +- Long queues of pending reconciliations |
| 17 | +- Underutilization of configured `--max-reconcile-rate` settings |
| 18 | + |
| 19 | +### Root Cause |
| 20 | + |
| 21 | +This issue is caused by the provider's locking mechanism used to prevent terraform plugin cache corruption: |
| 22 | + |
| 23 | +1. **Read-Write Lock Behavior**: The provider uses RWMutex for workspace operations: |
| 24 | + - Multiple `terraform plan` operations can run concurrently (RLock) |
| 25 | + - Only one `terraform init` operation can run at a time (Lock) |
| 26 | + - When a Lock is requested, it blocks all new RLock requests until completed |
| 27 | + |
| 28 | +2. **Blocking Scenario**: When a new Workspace is created requiring `terraform init`: |
| 29 | + - The Lock request waits for all current RLock (plan) operations to finish |
| 30 | + - Meanwhile, all new RLock requests are blocked |
| 31 | + - This effectively makes the provider single-threaded until the init completes |
| 32 | + |
| 33 | +### Solutions and Workarounds |
| 34 | + |
| 35 | +#### 1. Use Persistent Storage (Recommended) |
| 36 | + |
| 37 | +Mount a persistent volume to `/tf` to eliminate the need for frequent `terraform init` operations: |
| 38 | + |
| 39 | +```yaml |
| 40 | +apiVersion: pkg.crossplane.io/v1alpha1 |
| 41 | +kind: DeploymentRuntimeConfig |
| 42 | +metadata: |
| 43 | + name: provider-terraform-with-pv |
| 44 | +spec: |
| 45 | + deploymentTemplate: |
| 46 | + spec: |
| 47 | + template: |
| 48 | + spec: |
| 49 | + containers: |
| 50 | + - name: package-runtime |
| 51 | + volumeMounts: |
| 52 | + - name: tf-workspace |
| 53 | + mountPath: /tf |
| 54 | + volumes: |
| 55 | + - name: tf-workspace |
| 56 | + persistentVolumeClaim: |
| 57 | + claimName: provider-terraform-pvc |
| 58 | +``` |
| 59 | +
|
| 60 | +**Benefits:** |
| 61 | +- Workspaces persist across pod restarts |
| 62 | +- Eliminates need to re-run `terraform init` on restart |
| 63 | +- Reduces plugin download traffic |
| 64 | +- Significantly improves performance with many workspaces |
| 65 | + |
| 66 | +#### 2. Disable Plugin Cache for High Concurrency |
| 67 | + |
| 68 | +If persistent storage is not available, consider disabling the terraform plugin cache to avoid locking entirely: |
| 69 | + |
| 70 | +```yaml |
| 71 | +apiVersion: pkg.crossplane.io/v1alpha1 |
| 72 | +kind: ControllerConfig |
| 73 | +metadata: |
| 74 | + name: provider-terraform-no-cache |
| 75 | +spec: |
| 76 | + args: |
| 77 | + - --debug |
| 78 | + env: |
| 79 | + - name: TF_PLUGIN_CACHE_DIR |
| 80 | + value: "" |
| 81 | +``` |
| 82 | + |
| 83 | +**Trade-offs:** |
| 84 | +- Eliminates blocking issues |
| 85 | +- Increases network traffic (providers downloaded per workspace) |
| 86 | +- Higher NAT gateway costs in cloud environments |
| 87 | +- Still better than single-threaded performance |
| 88 | + |
| 89 | +#### 3. Optimize Concurrency Settings |
| 90 | + |
| 91 | +Align your `--max-reconcile-rate` with available CPU resources: |
| 92 | + |
| 93 | +```yaml |
| 94 | +apiVersion: pkg.crossplane.io/v1alpha1 |
| 95 | +kind: ControllerConfig |
| 96 | +metadata: |
| 97 | + name: provider-terraform-optimized |
| 98 | +spec: |
| 99 | + args: |
| 100 | + - --max-reconcile-rate=4 # Match your CPU allocation |
| 101 | + resources: |
| 102 | + requests: |
| 103 | + cpu: 4 |
| 104 | + limits: |
| 105 | + cpu: 4 |
| 106 | +``` |
| 107 | + |
| 108 | +### Monitoring and Diagnosis |
| 109 | + |
| 110 | +Use these Prometheus queries to monitor reconciliation performance: |
| 111 | + |
| 112 | +```promql |
| 113 | +# Maximum concurrent reconciles configured |
| 114 | +sum by (controller)(controller_runtime_max_concurrent_reconciles{controller="managed/workspace.tf.upbound.io"}) |
| 115 | +
|
| 116 | +# Active workers currently processing |
| 117 | +sum by (controller)(controller_runtime_active_workers{controller="managed/workspace.tf.upbound.io"}) |
| 118 | +
|
| 119 | +# Reconciliation rate |
| 120 | +sum by (controller)(rate(controller_runtime_reconcile_total{controller="managed/workspace.tf.upbound.io"}[5m])) |
| 121 | +
|
| 122 | +# CPU usage |
| 123 | +sum by ()(rate(container_cpu_usage_seconds_total{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"}[5m])) |
| 124 | +
|
| 125 | +# Memory usage |
| 126 | +sum by ()(container_memory_working_set_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"}) |
| 127 | +``` |
| 128 | + |
| 129 | +## Remote Git Repository Issues |
| 130 | + |
| 131 | +### Problem Description |
| 132 | + |
| 133 | +When using remote git repositories as workspace sources, you may experience: |
| 134 | + |
| 135 | +- Excessive network traffic |
| 136 | +- Providers being re-downloaded on every reconciliation |
| 137 | +- "text file busy" errors even with persistent volumes |
| 138 | + |
| 139 | +### Root Cause |
| 140 | + |
| 141 | +The provider removes and recreates the entire workspace directory for each reconciliation when using remote repositories due to limitations in the go-getter library. |
| 142 | + |
| 143 | +### Current Limitations |
| 144 | + |
| 145 | +- Remote repositories are re-cloned on every reconciliation |
| 146 | +- `terraform init` runs on every reconciliation for git-backed workspaces |
| 147 | +- Plugin cache conflicts can still occur during rapid workspace creation |
| 148 | + |
| 149 | +### Recommendations |
| 150 | + |
| 151 | +1. **Use Inline Workspaces**: When possible, embed terraform configuration directly in the Workspace spec rather than referencing remote repositories. |
| 152 | +2. **Disable Plugin Cache**: For remote repositories with high reconciliation rates, disable the plugin cache to avoid conflicts. |
| 153 | +3. **Monitor Traffic Costs**: Be aware of increased network egress costs when using remote repositories with disabled plugin cache. |
| 154 | + |
| 155 | +## Error Messages and Recovery |
| 156 | + |
| 157 | +### "text file busy" Errors |
| 158 | + |
| 159 | +``` |
| 160 | +Error: Failed to install provider |
| 161 | +Error while installing hashicorp/aws v5.44.0: open |
| 162 | +/tf/plugin-cache/registry.terraform.io/hashicorp/aws/5.44.0/linux_arm64/terraform-provider-aws_v5.44.0_x5: |
| 163 | +text file busy |
| 164 | +``` |
| 165 | +
|
| 166 | +**Resolution**: These errors typically resolve automatically due to built-in retry logic, but indicate plugin cache conflicts. Consider: |
| 167 | +- Using persistent volumes with plugin cache disabled |
| 168 | +- Reducing `--max-reconcile-rate` during initial workspace creation |
| 169 | +
|
| 170 | +### CLI Configuration Warnings |
| 171 | +
|
| 172 | +``` |
| 173 | +Warning: Unable to open CLI configuration file |
| 174 | +The CLI configuration file at "./.terraformrc" does not exist. |
| 175 | +``` |
| 176 | +
|
| 177 | +**Resolution**: This is typically harmless but can be resolved by: |
| 178 | +- Mounting a custom `.terraformrc` configuration |
| 179 | +- Setting appropriate terraform CLI environment variables |
| 180 | +
|
| 181 | +## Best Practices |
| 182 | +
|
| 183 | +1. **Start Conservative**: Begin with `--max-reconcile-rate=1` and increase gradually while monitoring performance. |
| 184 | +2. **Match Resources**: Ensure CPU requests/limits align with your concurrency settings. |
| 185 | +3. **Use Persistent Storage**: Always use persistent volumes in production environments with multiple workspaces. |
| 186 | +4. **Monitor Actively**: Set up monitoring for reconciliation rates, error rates, and resource utilization. |
| 187 | +5. **Plan for Scale**: Consider the total number of workspaces and their reconciliation patterns when designing your deployment. |
0 commit comments