Skip to content

Commit 39cc9ea

Browse files
committed
docs: Add troubleshooting docs
1 parent d08ab7b commit 39cc9ea

File tree

1 file changed

+187
-0
lines changed

1 file changed

+187
-0
lines changed

docs/monolith/Troubleshooting.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
---
2+
title: Troubleshooting Reconciliation Issues
3+
weight: 3
4+
---
5+
# Troubleshooting Reconciliation Issues
6+
7+
This document provides guidance for troubleshooting common reconciliation and performance issues with the Terraform provider, based on real-world scenarios and their resolutions.
8+
9+
## Reconciliations Blocking Unexpectedly
10+
11+
### Problem Description
12+
13+
You may experience reconciliations backing up behind long-running terraform apply/destroy operations, where workspaces appear to get "stuck" even when CPU resources are available. This manifests as:
14+
15+
- Reconciliations not making progress despite available CPU capacity
16+
- Long queues of pending reconciliations
17+
- Underutilization of configured `--max-reconcile-rate` settings
18+
19+
### Root Cause
20+
21+
This issue is caused by the provider's locking mechanism used to prevent terraform plugin cache corruption:
22+
23+
1. **Read-Write Lock Behavior**: The provider uses RWMutex for workspace operations:
24+
- Multiple `terraform plan` operations can run concurrently (RLock)
25+
- Only one `terraform init` operation can run at a time (Lock)
26+
- When a Lock is requested, it blocks all new RLock requests until completed
27+
28+
2. **Blocking Scenario**: When a new Workspace is created requiring `terraform init`:
29+
- The Lock request waits for all current RLock (plan) operations to finish
30+
- Meanwhile, all new RLock requests are blocked
31+
- This effectively makes the provider single-threaded until the init completes
32+
33+
### Solutions and Workarounds
34+
35+
#### 1. Use Persistent Storage (Recommended)
36+
37+
Mount a persistent volume to `/tf` to eliminate the need for frequent `terraform init` operations:
38+
39+
```yaml
40+
apiVersion: pkg.crossplane.io/v1alpha1
41+
kind: DeploymentRuntimeConfig
42+
metadata:
43+
name: provider-terraform-with-pv
44+
spec:
45+
deploymentTemplate:
46+
spec:
47+
template:
48+
spec:
49+
containers:
50+
- name: package-runtime
51+
volumeMounts:
52+
- name: tf-workspace
53+
mountPath: /tf
54+
volumes:
55+
- name: tf-workspace
56+
persistentVolumeClaim:
57+
claimName: provider-terraform-pvc
58+
```
59+
60+
**Benefits:**
61+
- Workspaces persist across pod restarts
62+
- Eliminates need to re-run `terraform init` on restart
63+
- Reduces plugin download traffic
64+
- Significantly improves performance with many workspaces
65+
66+
#### 2. Disable Plugin Cache for High Concurrency
67+
68+
If persistent storage is not available, consider disabling the terraform plugin cache to avoid locking entirely:
69+
70+
```yaml
71+
apiVersion: pkg.crossplane.io/v1alpha1
72+
kind: ControllerConfig
73+
metadata:
74+
name: provider-terraform-no-cache
75+
spec:
76+
args:
77+
- --debug
78+
env:
79+
- name: TF_PLUGIN_CACHE_DIR
80+
value: ""
81+
```
82+
83+
**Trade-offs:**
84+
- Eliminates blocking issues
85+
- Increases network traffic (providers downloaded per workspace)
86+
- Higher NAT gateway costs in cloud environments
87+
- Still better than single-threaded performance
88+
89+
#### 3. Optimize Concurrency Settings
90+
91+
Align your `--max-reconcile-rate` with available CPU resources:
92+
93+
```yaml
94+
apiVersion: pkg.crossplane.io/v1alpha1
95+
kind: ControllerConfig
96+
metadata:
97+
name: provider-terraform-optimized
98+
spec:
99+
args:
100+
- --max-reconcile-rate=4 # Match your CPU allocation
101+
resources:
102+
requests:
103+
cpu: 4
104+
limits:
105+
cpu: 4
106+
```
107+
108+
### Monitoring and Diagnosis
109+
110+
Use these Prometheus queries to monitor reconciliation performance:
111+
112+
```promql
113+
# Maximum concurrent reconciles configured
114+
sum by (controller)(controller_runtime_max_concurrent_reconciles{controller="managed/workspace.tf.upbound.io"})
115+
116+
# Active workers currently processing
117+
sum by (controller)(controller_runtime_active_workers{controller="managed/workspace.tf.upbound.io"})
118+
119+
# Reconciliation rate
120+
sum by (controller)(rate(controller_runtime_reconcile_total{controller="managed/workspace.tf.upbound.io"}[5m]))
121+
122+
# CPU usage
123+
sum by ()(rate(container_cpu_usage_seconds_total{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"}[5m]))
124+
125+
# Memory usage
126+
sum by ()(container_memory_working_set_bytes{container!="",namespace="crossplane-system",pod=~"upbound-provider-terraform.*"})
127+
```
128+
129+
## Remote Git Repository Issues
130+
131+
### Problem Description
132+
133+
When using remote git repositories as workspace sources, you may experience:
134+
135+
- Excessive network traffic
136+
- Providers being re-downloaded on every reconciliation
137+
- "text file busy" errors even with persistent volumes
138+
139+
### Root Cause
140+
141+
The provider removes and recreates the entire workspace directory for each reconciliation when using remote repositories due to limitations in the go-getter library.
142+
143+
### Current Limitations
144+
145+
- Remote repositories are re-cloned on every reconciliation
146+
- `terraform init` runs on every reconciliation for git-backed workspaces
147+
- Plugin cache conflicts can still occur during rapid workspace creation
148+
149+
### Recommendations
150+
151+
1. **Use Inline Workspaces**: When possible, embed terraform configuration directly in the Workspace spec rather than referencing remote repositories.
152+
2. **Disable Plugin Cache**: For remote repositories with high reconciliation rates, disable the plugin cache to avoid conflicts.
153+
3. **Monitor Traffic Costs**: Be aware of increased network egress costs when using remote repositories with disabled plugin cache.
154+
155+
## Error Messages and Recovery
156+
157+
### "text file busy" Errors
158+
159+
```
160+
Error: Failed to install provider
161+
Error while installing hashicorp/aws v5.44.0: open
162+
/tf/plugin-cache/registry.terraform.io/hashicorp/aws/5.44.0/linux_arm64/terraform-provider-aws_v5.44.0_x5:
163+
text file busy
164+
```
165+
166+
**Resolution**: These errors typically resolve automatically due to built-in retry logic, but indicate plugin cache conflicts. Consider:
167+
- Using persistent volumes with plugin cache disabled
168+
- Reducing `--max-reconcile-rate` during initial workspace creation
169+
170+
### CLI Configuration Warnings
171+
172+
```
173+
Warning: Unable to open CLI configuration file
174+
The CLI configuration file at "./.terraformrc" does not exist.
175+
```
176+
177+
**Resolution**: This is typically harmless but can be resolved by:
178+
- Mounting a custom `.terraformrc` configuration
179+
- Setting appropriate terraform CLI environment variables
180+
181+
## Best Practices
182+
183+
1. **Start Conservative**: Begin with `--max-reconcile-rate=1` and increase gradually while monitoring performance.
184+
2. **Match Resources**: Ensure CPU requests/limits align with your concurrency settings.
185+
3. **Use Persistent Storage**: Always use persistent volumes in production environments with multiple workspaces.
186+
4. **Monitor Actively**: Set up monitoring for reconciliation rates, error rates, and resource utilization.
187+
5. **Plan for Scale**: Consider the total number of workspaces and their reconciliation patterns when designing your deployment.

0 commit comments

Comments
 (0)