|
| 1 | +# Canary + Machine Checks Timeout Investigation |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +Tests using canary deployment strategy with machine checks timeout after 15+ minutes in CI: |
| 6 | +- `TestFlyDeploy_DeployMachinesCheckCanary` - TIMES OUT |
| 7 | +- `TestFlyDeploy_CreateBuilderWDeployToken` - TIMES OUT (suspected) |
| 8 | + |
| 9 | +Similar tests WITHOUT canary strategy pass in ~60 seconds: |
| 10 | +- `TestFlyDeploy_DeployMachinesCheck` - PASSES in ~60s |
| 11 | + |
| 12 | +## Test Scenario |
| 13 | + |
| 14 | +Both failing tests follow this pattern: |
| 15 | +1. `fly launch --strategy canary` - Creates app + deploys (1 machine, no machine checks yet) |
| 16 | +2. Add `[[http_service.machine_checks]]` to fly.toml |
| 17 | +3. `fly deploy --buildkit --remote-only` - **HANGS HERE** |
| 18 | + |
| 19 | +## Code Flow Analysis |
| 20 | + |
| 21 | +### Normal Deploy with Machine Checks (Rolling Strategy) |
| 22 | +``` |
| 23 | +deploy.go |
| 24 | + ↓ |
| 25 | +Build image with BuildKit |
| 26 | + ↓ |
| 27 | +Update machines (rolling) |
| 28 | + ↓ |
| 29 | +For each machine update: |
| 30 | + - Run machine checks (test machines) |
| 31 | + - Wait for checks to pass |
| 32 | + ↓ |
| 33 | +Done (~ 60 seconds) |
| 34 | +``` |
| 35 | + |
| 36 | +### Canary Deploy with Machine Checks |
| 37 | +``` |
| 38 | +deploy.go |
| 39 | + ↓ |
| 40 | +Build image with BuildKit |
| 41 | + ↓ |
| 42 | +deployCanaryMachines() [machines_deploymachinesapp.go:262] |
| 43 | + ├─ Create temporary canary machine (nginx) |
| 44 | + ├─ runTestMachines() [machinebasedtest.go:44] |
| 45 | + │ ├─ createTestMachine() - Create curl container |
| 46 | + │ ├─ Wait for test machine to START (5min timeout) |
| 47 | + │ ├─ Wait for test machine to be DESTROYED (5min timeout) |
| 48 | + │ └─ Check exit code |
| 49 | + └─ Destroy temporary canary machine |
| 50 | + ↓ |
| 51 | +Actual canary rollout |
| 52 | + ├─ Update first production machine |
| 53 | + └─ Update rest of machines |
| 54 | + ↓ |
| 55 | +Done (SHOULD be ~2-3 minutes, but HANGS for 15+ minutes) |
| 56 | +``` |
| 57 | + |
| 58 | +## Hypothesis |
| 59 | + |
| 60 | +The hang occurs during `deployCanaryMachines()`, specifically in `runTestMachines()`. |
| 61 | + |
| 62 | +Possible causes: |
| 63 | + |
| 64 | +### 1. Test Machine Never Starts |
| 65 | +- Test machine (curl container) fails to create or start |
| 66 | +- But there's a 5-minute timeout for START state |
| 67 | +- Should return error, not hang indefinitely |
| 68 | + |
| 69 | +### 2. Test Machine Never Auto-Destroys |
| 70 | +- Test machines are configured with `AutoDestroy: true` |
| 71 | +- They should auto-destroy after running the curl command |
| 72 | +- But there's a 5-minute timeout for DESTROYED state |
| 73 | +- Should return error, not hang indefinitely |
| 74 | + |
| 75 | +### 3. Network Routing Issue |
| 76 | +- Test machine can't reach canary machine's private IP |
| 77 | +- Curl hangs indefinitely (no timeout in curl command) |
| 78 | +- Test machine never exits |
| 79 | +- **This could explain the indefinite hang!** |
| 80 | + |
| 81 | +### 4. Private IP Not Populated |
| 82 | +- Canary machine's `PrivateIP` field is empty/null |
| 83 | +- Test machine gets `FLY_TEST_MACHINE_IP=""` |
| 84 | +- Curl tries to connect to invalid address |
| 85 | +- Hangs or fails in unexpected way |
| 86 | + |
| 87 | +## Most Likely Cause: Curl Has No Timeout |
| 88 | + |
| 89 | +The test machine runs: |
| 90 | +```bash |
| 91 | +curl http://[$FLY_TEST_MACHINE_IP]:80 |
| 92 | +``` |
| 93 | + |
| 94 | +If the curl command can't connect (network issue, wrong IP, firewall, etc.), it will hang until: |
| 95 | +- The default curl timeout (which can be several minutes or infinite) |
| 96 | +- The machine is killed externally |
| 97 | + |
| 98 | +The test machine won't auto-destroy until the command exits. |
| 99 | + |
| 100 | +## Recommended Fixes |
| 101 | + |
| 102 | +### Short-term: Add Timeout to Curl |
| 103 | +Modify the test to include a timeout: |
| 104 | +```toml |
| 105 | +[[http_service.machine_checks]] |
| 106 | + image = "curlimages/curl" |
| 107 | + entrypoint = ["/bin/sh", "-c"] |
| 108 | + command = ["curl --max-time 30 --connect-timeout 10 http://[$FLY_TEST_MACHINE_IP]:80"] |
| 109 | +``` |
| 110 | + |
| 111 | +### Medium-term: Add Overall Timeout to Test Machine Execution |
| 112 | +In `machinebasedtest.go`, add a context timeout around the entire test machine execution. |
| 113 | + |
| 114 | +### Long-term: Investigate Why Curl Can't Connect in Canary Scenario |
| 115 | +- Check if temporary canary machines have different network configuration |
| 116 | +- Verify PrivateIP is populated correctly |
| 117 | +- Check if machine-to-machine connectivity works in CI environment |
| 118 | +- Look for differences between temporary canary machines and regular machines |
| 119 | + |
| 120 | +## Reproduction Steps |
| 121 | + |
| 122 | +Toreproduce locally: |
| 123 | +```bash |
| 124 | +cd test/preflight |
| 125 | +# Enable the commented-out test |
| 126 | +# Run just that test |
| 127 | +go test -tags=integration -v -timeout=20m -run TestFlyDeploy_DeployMachinesCheckCanary . |
| 128 | +``` |
| 129 | + |
| 130 | +## Related Code Locations |
| 131 | + |
| 132 | +- `internal/command/deploy/machines_deploymachinesapp.go:260-320` - deployCanaryMachines() |
| 133 | +- `internal/command/deploy/machinebasedtest.go:44-150` - runTestMachines() |
| 134 | +- `internal/appconfig/machines.go:108-220` - ToTestMachineConfig() |
| 135 | +- `test/preflight/fly_deploy_test.go:394-417` - The failing test |
| 136 | + |
| 137 | +## Timeline |
| 138 | + |
| 139 | +- 2025-12-17: Issue discovered during CI runs on use-buildkit branch |
| 140 | +- 2025-12-17: Tests commented out to unblock CI |
| 141 | +- Investigation documented in this file |
0 commit comments