Skip to content

Commit 7f673a6

Browse files
committed
Add investigation documentation for canary + machine checks timeout
Documents detailed investigation into why TestFlyDeploy_DeployMachinesCheckCanary times out after 15+ minutes in CI. Key findings: - Issue is canary-specific, not BuildKit-specific - Likely cause: curl in test machine has no timeout - When test machine can't connect, it hangs indefinitely - Test machine won't auto-destroy until curl exits Includes: - Detailed code flow analysis - Hypothesis of root cause - Recommended short, medium, and long-term fixes - Reproduction steps - Code locations for future investigation
1 parent f693d01 commit 7f673a6

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Canary + Machine Checks Timeout Investigation
2+
3+
## Problem Statement
4+
5+
Tests using canary deployment strategy with machine checks timeout after 15+ minutes in CI:
6+
- `TestFlyDeploy_DeployMachinesCheckCanary` - TIMES OUT
7+
- `TestFlyDeploy_CreateBuilderWDeployToken` - TIMES OUT (suspected)
8+
9+
Similar tests WITHOUT canary strategy pass in ~60 seconds:
10+
- `TestFlyDeploy_DeployMachinesCheck` - PASSES in ~60s
11+
12+
## Test Scenario
13+
14+
Both failing tests follow this pattern:
15+
1. `fly launch --strategy canary` - Creates app + deploys (1 machine, no machine checks yet)
16+
2. Add `[[http_service.machine_checks]]` to fly.toml
17+
3. `fly deploy --buildkit --remote-only` - **HANGS HERE**
18+
19+
## Code Flow Analysis
20+
21+
### Normal Deploy with Machine Checks (Rolling Strategy)
22+
```
23+
deploy.go
24+
25+
Build image with BuildKit
26+
27+
Update machines (rolling)
28+
29+
For each machine update:
30+
- Run machine checks (test machines)
31+
- Wait for checks to pass
32+
33+
Done (~ 60 seconds)
34+
```
35+
36+
### Canary Deploy with Machine Checks
37+
```
38+
deploy.go
39+
40+
Build image with BuildKit
41+
42+
deployCanaryMachines() [machines_deploymachinesapp.go:262]
43+
├─ Create temporary canary machine (nginx)
44+
├─ runTestMachines() [machinebasedtest.go:44]
45+
│ ├─ createTestMachine() - Create curl container
46+
│ ├─ Wait for test machine to START (5min timeout)
47+
│ ├─ Wait for test machine to be DESTROYED (5min timeout)
48+
│ └─ Check exit code
49+
└─ Destroy temporary canary machine
50+
51+
Actual canary rollout
52+
├─ Update first production machine
53+
└─ Update rest of machines
54+
55+
Done (SHOULD be ~2-3 minutes, but HANGS for 15+ minutes)
56+
```
57+
58+
## Hypothesis
59+
60+
The hang occurs during `deployCanaryMachines()`, specifically in `runTestMachines()`.
61+
62+
Possible causes:
63+
64+
### 1. Test Machine Never Starts
65+
- Test machine (curl container) fails to create or start
66+
- But there's a 5-minute timeout for START state
67+
- Should return error, not hang indefinitely
68+
69+
### 2. Test Machine Never Auto-Destroys
70+
- Test machines are configured with `AutoDestroy: true`
71+
- They should auto-destroy after running the curl command
72+
- But there's a 5-minute timeout for DESTROYED state
73+
- Should return error, not hang indefinitely
74+
75+
### 3. Network Routing Issue
76+
- Test machine can't reach canary machine's private IP
77+
- Curl hangs indefinitely (no timeout in curl command)
78+
- Test machine never exits
79+
- **This could explain the indefinite hang!**
80+
81+
### 4. Private IP Not Populated
82+
- Canary machine's `PrivateIP` field is empty/null
83+
- Test machine gets `FLY_TEST_MACHINE_IP=""`
84+
- Curl tries to connect to invalid address
85+
- Hangs or fails in unexpected way
86+
87+
## Most Likely Cause: Curl Has No Timeout
88+
89+
The test machine runs:
90+
```bash
91+
curl http://[$FLY_TEST_MACHINE_IP]:80
92+
```
93+
94+
If the curl command can't connect (network issue, wrong IP, firewall, etc.), it will hang until:
95+
- The default curl timeout (which can be several minutes or infinite)
96+
- The machine is killed externally
97+
98+
The test machine won't auto-destroy until the command exits.
99+
100+
## Recommended Fixes
101+
102+
### Short-term: Add Timeout to Curl
103+
Modify the test to include a timeout:
104+
```toml
105+
[[http_service.machine_checks]]
106+
image = "curlimages/curl"
107+
entrypoint = ["/bin/sh", "-c"]
108+
command = ["curl --max-time 30 --connect-timeout 10 http://[$FLY_TEST_MACHINE_IP]:80"]
109+
```
110+
111+
### Medium-term: Add Overall Timeout to Test Machine Execution
112+
In `machinebasedtest.go`, add a context timeout around the entire test machine execution.
113+
114+
### Long-term: Investigate Why Curl Can't Connect in Canary Scenario
115+
- Check if temporary canary machines have different network configuration
116+
- Verify PrivateIP is populated correctly
117+
- Check if machine-to-machine connectivity works in CI environment
118+
- Look for differences between temporary canary machines and regular machines
119+
120+
## Reproduction Steps
121+
122+
Toreproduce locally:
123+
```bash
124+
cd test/preflight
125+
# Enable the commented-out test
126+
# Run just that test
127+
go test -tags=integration -v -timeout=20m -run TestFlyDeploy_DeployMachinesCheckCanary .
128+
```
129+
130+
## Related Code Locations
131+
132+
- `internal/command/deploy/machines_deploymachinesapp.go:260-320` - deployCanaryMachines()
133+
- `internal/command/deploy/machinebasedtest.go:44-150` - runTestMachines()
134+
- `internal/appconfig/machines.go:108-220` - ToTestMachineConfig()
135+
- `test/preflight/fly_deploy_test.go:394-417` - The failing test
136+
137+
## Timeline
138+
139+
- 2025-12-17: Issue discovered during CI runs on use-buildkit branch
140+
- 2025-12-17: Tests commented out to unblock CI
141+
- Investigation documented in this file

0 commit comments

Comments
 (0)