Skip to content

Commit 73fab33

Browse files
committed
dkit retry
1 parent 09fefa2 commit 73fab33

File tree

1 file changed

+339
-0
lines changed

1 file changed

+339
-0
lines changed

internal/cmd/retry/AGENTS.md

Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
# retry Command
2+
3+
## Purpose
4+
Execute commands with automatic retry logic, exponential backoff, and failure recovery strategies. Makes flaky commands reliable and reduces manual intervention in CI/CD pipelines.
5+
6+
## Command Signature
7+
```bash
8+
dkit retry [flags] $ARGS
9+
```
10+
11+
## Flags
12+
- `-n, --attempts <number>` - Maximum number of retry attempts (default: 3)
13+
- `-d, --delay <duration>` - Initial delay between retries (default: 1s)
14+
- `--max-delay <duration>` - Maximum delay for exponential backoff (default: 60s)
15+
- `--backoff <linear|exponential|constant>` - Backoff strategy (default: exponential)
16+
- `--backoff-multiplier <float>` - Multiplier for exponential backoff (default: 2.0)
17+
- `--jitter` - Add random jitter to delays to prevent thundering herd
18+
- `--on-exit <codes>` - Comma-separated exit codes to retry (default: all non-zero)
19+
- `--skip-exit <codes>` - Comma-separated exit codes to NOT retry
20+
- `--on-stderr <pattern>` - Retry if stderr matches regex pattern
21+
- `--skip-stderr <pattern>` - Do NOT retry if stderr matches regex pattern
22+
- `--timeout <duration>` - Timeout for each attempt (no timeout by default)
23+
- `--verbose` - Show detailed retry information
24+
- `-w, --workspace` - Execute in project root directory (auto-detected via git)
25+
26+
## Core Behavior
27+
28+
### Execution Flow
29+
1. **Execute Command**: Run the provided command
30+
2. **Check Result**: Evaluate exit code and output
31+
3. **Determine Retry**: Apply retry conditions
32+
4. **Wait**: Apply backoff delay if retrying
33+
5. **Repeat**: Continue until success or max attempts reached
34+
35+
### Retry Conditions
36+
37+
Commands are retried when:
38+
- Exit code is non-zero (configurable with `--on-exit`)
39+
- Exit code is NOT in skip list (configurable with `--skip-exit`)
40+
- stderr matches retry pattern (if `--on-stderr` specified)
41+
- stderr does NOT match skip pattern (if `--skip-stderr` specified)
42+
43+
### Backoff Strategies
44+
45+
#### Constant Backoff
46+
Wait the same amount of time between each retry.
47+
```bash
48+
dkit retry --backoff constant --delay 5s -- flaky-command
49+
# Delays: 5s, 5s, 5s
50+
```
51+
52+
#### Linear Backoff
53+
Increase delay linearly with each retry.
54+
```bash
55+
dkit retry --backoff linear --delay 2s -- flaky-command
56+
# Delays: 2s, 4s, 6s
57+
```
58+
59+
#### Exponential Backoff (default)
60+
Multiply delay by backoff multiplier with each retry.
61+
```bash
62+
dkit retry --backoff exponential --delay 1s --backoff-multiplier 2 -- flaky-command
63+
# Delays: 1s, 2s, 4s, 8s, 16s...
64+
```
65+
66+
#### With Jitter
67+
Add randomness to prevent synchronized retries.
68+
```bash
69+
dkit retry --jitter --delay 1s -- flaky-command
70+
# Delays: 1.2s, 2.3s, 3.8s (random variation ±20%)
71+
```
72+
73+
### Timeout Handling
74+
```bash
75+
# Timeout each attempt at 30 seconds
76+
dkit retry --timeout 30s -- long-running-command
77+
78+
# If any attempt exceeds 30s, it's killed and retried
79+
```
80+
81+
## Exit Codes
82+
- `0` - Command succeeded (on any attempt)
83+
- `N` - Command failed after all retries (original exit code from last attempt)
84+
- `124` - Command timed out on all attempts
85+
- `130` - Interrupted by user (Ctrl+C)
86+
87+
## Output Format
88+
89+
### Default (concise)
90+
```
91+
[dkit retry] Attempt 1/3...
92+
[dkit retry] ✗ Failed with exit code 1
93+
[dkit retry] Waiting 1s before retry...
94+
95+
[dkit retry] Attempt 2/3...
96+
[dkit retry] ✗ Failed with exit code 1
97+
[dkit retry] Waiting 2s before retry...
98+
99+
[dkit retry] Attempt 3/3...
100+
[dkit retry] ✓ Success!
101+
```
102+
103+
### Verbose
104+
```
105+
[dkit retry] Configuration:
106+
Command: npm install
107+
Max attempts: 3
108+
Backoff: exponential (2.0x, max 60s)
109+
Retry on: all non-zero exits
110+
Timeout: 30s per attempt
111+
112+
[dkit retry] Attempt 1/3 started at 2025-12-23 10:30:00
113+
[dkit retry] Running: npm install
114+
npm ERR! network timeout
115+
[dkit retry] ✗ Failed after 12.3s with exit code 1
116+
[dkit retry] Error output: network timeout
117+
[dkit retry] Retry condition met: exit code 1 (non-zero)
118+
[dkit retry] Waiting 1s before retry...
119+
120+
[dkit retry] Attempt 2/3 started at 2025-12-23 10:30:13
121+
[dkit retry] Running: npm install
122+
added 234 packages in 5.2s
123+
[dkit retry] ✓ Success after 5.2s
124+
[dkit retry] Total time: 18.5s (2 attempts)
125+
```
126+
127+
### On Final Failure
128+
```
129+
[dkit retry] Attempt 3/3...
130+
[dkit retry] ✗ Failed with exit code 1
131+
132+
[dkit retry] All retry attempts exhausted
133+
[dkit retry] Command failed after 3 attempts
134+
[dkit retry] Total time: 45.2s
135+
[dkit retry] Last exit code: 1
136+
```
137+
138+
## Use Cases
139+
140+
### Flaky Network Commands
141+
```bash
142+
# Retry package installation on network failures
143+
dkit retry --attempts 5 --delay 2s -- npm install
144+
145+
# Retry with exponential backoff for API rate limits
146+
dkit retry --delay 1s --max-delay 30s -- curl https://api.example.com/data
147+
```
148+
149+
### CI/CD Pipelines
150+
```bash
151+
# Retry flaky tests
152+
dkit retry --attempts 3 -- npm test
153+
154+
# Retry docker pulls
155+
dkit retry --timeout 60s --attempts 5 -- docker pull image:tag
156+
157+
# Retry deployments
158+
dkit retry --delay 10s --attempts 3 -- kubectl apply -f deploy.yaml
159+
```
160+
161+
### Conditional Retries
162+
```bash
163+
# Only retry on specific exit codes (network errors)
164+
dkit retry --on-exit 7,28,56 -- curl https://example.com
165+
166+
# Don't retry on authentication failures (exit 401)
167+
dkit retry --skip-exit 401 -- some-api-command
168+
169+
# Retry only if stderr contains "timeout"
170+
dkit retry --on-stderr "timeout|timed out" -- flaky-command
171+
172+
# Don't retry if stderr contains "permission denied"
173+
dkit retry --skip-stderr "permission denied|unauthorized" -- secure-command
174+
```
175+
176+
### Database Operations
177+
```bash
178+
# Retry database migrations with backoff
179+
dkit retry --delay 5s --attempts 10 -- db-migrate up
180+
181+
# Retry connection with linear backoff
182+
dkit retry --backoff linear --delay 2s -- psql -c "SELECT 1"
183+
```
184+
185+
### File Downloads
186+
```bash
187+
# Retry large file download with jitter to avoid server overload
188+
dkit retry --jitter --delay 1s --attempts 10 -- wget https://example.com/large-file.zip
189+
190+
# Retry with timeout per attempt
191+
dkit retry --timeout 120s --attempts 5 -- rsync -av remote:/data ./data
192+
```
193+
194+
## Advanced Examples
195+
196+
### Kubernetes Deployment
197+
```bash
198+
# Wait for deployment with progressive backoff
199+
dkit retry \
200+
--attempts 20 \
201+
--delay 5s \
202+
--max-delay 60s \
203+
--timeout 30s \
204+
-- kubectl wait --for=condition=ready pod -l app=myapp
205+
```
206+
207+
### Multi-Region Fallback
208+
```bash
209+
# Try primary region, then fallback
210+
dkit retry --attempts 2 --delay 0s -- deploy-to-us-east-1 || \
211+
dkit retry --attempts 2 --delay 0s -- deploy-to-us-west-2 || \
212+
dkit retry --attempts 2 --delay 0s -- deploy-to-eu-west-1
213+
```
214+
215+
### Smart Test Retries
216+
```bash
217+
# Retry failed tests only
218+
dkit retry \
219+
--attempts 3 \
220+
--skip-exit 0 \
221+
--on-stderr "FAILED|ERROR" \
222+
-- pytest --last-failed
223+
```
224+
225+
### Rate Limited API
226+
```bash
227+
# Respect rate limits with exponential backoff
228+
dkit retry \
229+
--attempts 10 \
230+
--delay 1s \
231+
--max-delay 300s \
232+
--on-stderr "rate limit|429" \
233+
--jitter \
234+
-- api-client fetch-data
235+
```
236+
237+
## Integration with `dkit run`
238+
239+
Can be combined with `dkit run` for persistent logging:
240+
```bash
241+
dkit run -- dkit retry --attempts 3 -- npm test
242+
243+
# Logs stored in .dkit/ with full retry history
244+
```
245+
246+
## Implementation Requirements
247+
248+
### Core
249+
- Must capture and preserve stdout/stderr separately
250+
- Must maintain exit code semantics
251+
- Must handle signals properly (forward to child, cleanup on SIGINT)
252+
- Must support all shell syntax when command contains pipes/redirects
253+
- Should not buffer output excessively (stream in real-time when possible)
254+
255+
### Timing
256+
- Must accurately track attempt duration
257+
- Must respect timeout per attempt (not total timeout)
258+
- Must implement backoff strategies correctly
259+
- Should add jitter using cryptographically secure random
260+
261+
### Conditionals
262+
- Must support exit code matching (exact values and ranges)
263+
- Must support regex pattern matching on stderr
264+
- Must handle edge cases (empty stderr, no output, etc.)
265+
- Should compile regex patterns once for performance
266+
267+
### Output
268+
- Must clearly indicate which attempt is running
269+
- Must show why retry was triggered (in verbose mode)
270+
- Must display total time and attempt count on completion
271+
- Should provide actionable information on final failure
272+
273+
### Safety
274+
- Must not retry indefinitely (require explicit attempt count)
275+
- Must respect max delay to prevent excessive waiting
276+
- Must allow user interruption (Ctrl+C) at any time
277+
- Should warn if backoff delay exceeds reasonable limits
278+
279+
## Error Handling
280+
281+
### Command Not Found
282+
```
283+
[dkit retry] ERROR: Command not found: nonexistent-command
284+
[dkit retry] No retries will be attempted for command not found errors
285+
```
286+
Exit code: 127
287+
288+
### Invalid Configuration
289+
```
290+
[dkit retry] ERROR: Invalid delay duration: "abc"
291+
[dkit retry] Expected format: 1s, 500ms, 1m30s
292+
```
293+
Exit code: 2
294+
295+
### Timeout on All Attempts
296+
```
297+
[dkit retry] Attempt 3/3...
298+
[dkit retry] ✗ Timeout after 30s
299+
[dkit retry] All attempts timed out
300+
[dkit retry] Total time: 90s (3 × 30s)
301+
```
302+
Exit code: 124
303+
304+
### User Interruption
305+
```
306+
[dkit retry] Attempt 2/3...
307+
[dkit retry] Waiting 5s before retry...
308+
^C
309+
[dkit retry] Interrupted by user
310+
[dkit retry] Command did not complete (1 success, 1 failure)
311+
```
312+
Exit code: 130
313+
314+
### Invalid Regex Pattern
315+
```
316+
[dkit retry] ERROR: Invalid regex pattern in --on-stderr: "[invalid"
317+
[dkit retry] Error: unclosed character class
318+
```
319+
Exit code: 2
320+
321+
## Design Principles
322+
323+
- **Reliable**: Make flaky commands succeed without manual intervention
324+
- **Transparent**: Clear visibility into what's happening and why
325+
- **Configurable**: Flexible retry strategies for different scenarios
326+
- **Safe**: Sensible defaults, prevent infinite retries
327+
- **Efficient**: Minimal overhead, smart backoff strategies
328+
- **Composable**: Works well with other dkit commands and shell tools
329+
330+
## Future Enhancements
331+
332+
- Circuit breaker pattern (stop retrying if too many consecutive failures)
333+
- Success rate tracking and reporting
334+
- Retry budget enforcement (max total time across all attempts)
335+
- Adaptive backoff based on error type
336+
- Integration with monitoring/alerting systems
337+
- Retry statistics export (JSON format)
338+
- Support for retry policies via config file
339+
- Multi-command retry (try alternative commands on failure)

0 commit comments

Comments
 (0)