added support for custom exit codes on retry #792 #902

kriyanshii · 2025-04-17T10:21:41Z

Add Custom Exit Code Support for Retry Policy

Description

This PR adds support for specifying custom exit codes that should trigger task retries. Previously, the retry policy would retry on any non-zero exit code. Now users can specify exactly which exit codes should trigger retries, while other non-zero exit codes will cause the task to fail immediately.

Changes

Added exitCode field to the retry policy schema in schemas/dag.schema.json
Updated RetryPolicy struct in internal/digraph/step.go to include ExitCodes field
Added ExitCode field to retryPolicyDef struct in internal/digraph/spec.go
Updated builder in internal/digraph/builder.go to handle the new exitCode field
Enhanced exit code handling in internal/digraph/scheduler/scheduler.go:
- Added robust exit code extraction from different error types
- Added support for parsing exit codes from error strings
- Added handling for signal terminations
- Added detailed debug logging for better troubleshooting
- Improved retry policy decision logic

Example Usage

steps:
  - name: retryable task
    command: main.sh
    retryPolicy:
      limit: 3
      intervalSec: 5
      exitCode: [1, 2]  # Retries if exit code is 1 or 2; otherwise fails immediately

Backward Compatibility

If no exitCode is specified, the behavior remains unchanged (retries on any non-zero exit code)
All existing retry policy configurations will continue to work as before

Testing

Added comprehensive error handling for different types of process terminations
Added detailed debug logging to track exit code determination and retry decisions
Maintained backward compatibility with existing retry policy configurations

Benefits

More precise control over which failures should trigger retries
Better error handling and logging for troubleshooting
Maintains backward compatibility with existing configurations

yottahmd

Thanks, it looks great! I've left a few comments.

yottahmd · 2025-04-19T06:54:46Z

internal/digraph/scheduler/scheduler.go

+							var exitCode int
+							var exitCodeFound bool
+
+							// Try to extract exit code from different error types
+							if exitErr, ok := execErr.(*exec.ExitError); ok {
+								exitCode = exitErr.ExitCode()
+								exitCodeFound = true
+								logger.Debug(ctx, "Found ExitError", "error", execErr, "exitCode", exitCode)
+							} else {
+								// Try to parse exit code from error string
+								errStr := execErr.Error()
+								if strings.Contains(errStr, "exit status") {
+									// Parse "exit status N" format
+									parts := strings.Split(errStr, " ")
+									if len(parts) > 2 {
+										if code, err := strconv.Atoi(parts[2]); err == nil {
+											exitCode = code
+											exitCodeFound = true
+											logger.Debug(ctx, "Parsed exit code from error string", "error", errStr, "exitCode", exitCode)
+										}
+									}
+								} else if strings.Contains(errStr, "signal:") {
+									// Handle signal termination
+									exitCode = -1
+									exitCodeFound = true
+									logger.Debug(ctx, "Process terminated by signal", "error", errStr)
+								}
+							}
+
+							if !exitCodeFound {
+								logger.Debug(ctx, "Could not determine exit code", "error", execErr, "errorType", fmt.Sprintf("%T", execErr))
+								// Default to exit code 1 if we can't determine the actual code
+								exitCode = 1
+							}


Can we define a separate method for this?

yottahmd · 2025-04-19T06:57:54Z

internal/digraph/scheduler/scheduler.go

+							shouldRetry := false
+							if len(node.retryPolicy.ExitCodes) > 0 {
+								// If exit codes are specified, only retry for those codes
+								for _, code := range node.retryPolicy.ExitCodes {
+									if exitCode == code {
+										shouldRetry = true
+										break
+									}
+								}
+								logger.Debug(ctx, "Checking retry policy", "exitCode", exitCode, "allowedCodes", node.retryPolicy.ExitCodes, "shouldRetry", shouldRetry)
+							} else {
+								// If no exit codes specified, retry for any non-zero exit code
+								shouldRetry = exitCode != 0
+								logger.Debug(ctx, "Using default retry policy", "exitCode", exitCode, "shouldRetry", shouldRetry)
+							}


Can we define a ShouldRetry(exitCode int) method on RetryPolicy struct to handle this?

Will try it

internal/digraph/scheduler/scheduler.go

kriyanshii · 2025-04-19T15:40:08Z

@yottahmd resolved all the comments.

yottahmd · 2025-04-19T15:46:36Z

internal/digraph/scheduler/scheduler.go

+	if !shouldRetry {
+		// finish the node with error
+		node.data.SetStatus(NodeStatusError)
+		node.data.MarkError(execErr)
+		sc.setLastError(execErr)
+		return false
+	}


@kriyanshii
Would it be possible to reuse the same logic in the switch statement below? In some cases, a step should be marked as successful even if it fails.

dagu/internal/digraph/scheduler/scheduler.go

Lines 203 to 214 in 2ab2fab

default:

// finish the node

node.data.SetStatus(NodeStatusError)

if node.shouldMarkSuccess(ctx) {

// mark as success if the node should be marked as success

// i.e. continueOn.markSuccess is set to true

node.data.SetStatus(NodeStatusSuccess)

} else {

node.data.MarkError(execErr)

sc.setLastError(execErr)

}

}

Perhaps we could check the return value from the handleNodeRetry fucntion and use fallthrough when the return value is false. What do you think?

Something like this:

case node.retryPolicy.Limit > node.data.GetRetryCount(): if sc.handleNodeRetry(ctx, node, execErr) { continue ExecRepeat } + fallthrough

Yes, that's a good observation! We can reuse the success marking logic by using fallthrough when handleNodeRetry returns false. This would make the code more DRY (Don't Repeat Yourself) and ensure consistent behavior.

yottahmd

It looks great! I've added some nit comments.

yottahmd · 2025-04-19T15:50:21Z

internal/digraph/scheduler/scheduler.go

+	if !exitCodeFound {
+		logger.Debug(ctx, "Could not determine exit code", "error", execErr, "errorType", fmt.Sprintf("%T", execErr))
+		// Default to exit code 1 if we can't determine the actual code
+		exitCode = 1


I thought we can initialize the exitCode with 1 at declaration. WDYT?

yottahmd · 2025-04-29T02:53:06Z

@kriyanshii Would you mind updating the documents to reflect this change? Thanks a lot!

kriyanshii · 2025-04-29T15:04:38Z

hey can you check it? i have updated docs. do i need to update anything else?

yottahmd

Thanks for updating the document and JSON schema! LGTM 🚀🚀🚀

codecov · 2025-04-29T15:48:05Z

Codecov Report

Attention: Patch coverage is 47.82609% with 36 lines in your changes missing coverage. Please review.

Project coverage is 55.41%. Comparing base (b4961b0) to head (363710d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/digraph/scheduler/scheduler.go	54.00%	19 Missing and 4 partials ⚠️
internal/digraph/scheduler/node.go	37.50%	8 Missing and 2 partials ⚠️
internal/digraph/builder.go	0.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
- Coverage   55.48%   55.41%   -0.08%     
==========================================
  Files          75       75              
  Lines        8240     8301      +61     
==========================================
+ Hits         4572     4600      +28     
- Misses       3281     3308      +27     
- Partials      387      393       +6

Files with missing lines	Coverage Δ
internal/digraph/step.go	`91.66% <ø> (ø)`
internal/digraph/builder.go	`57.63% <0.00%> (-0.31%)`	⬇️
internal/digraph/scheduler/node.go	`65.94% <37.50%> (-1.06%)`	⬇️
internal/digraph/scheduler/scheduler.go	`79.38% <54.00%> (-3.69%)`	⬇️

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4961b0...363710d. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

added support for custom exit codes on retry dagu-org#792

f7c8ea7

yottahmd reviewed Apr 19, 2025

View reviewed changes

kriyanshii and others added 2 commits April 19, 2025 17:20

Merge branch 'dagu-org:main' into main

5021e29

improved support for custom exit codes on retry dagu-org#792

2ab2fab

yottahmd reviewed Apr 19, 2025

View reviewed changes

kriyanshii force-pushed the main branch from d620a36 to 2ab2fab Compare April 20, 2025 06:40

yottahmd added this to the v1.16.8 milestone Apr 20, 2025

updated documentation regarding retry on custom codes.

363710d

yottahmd approved these changes Apr 29, 2025

View reviewed changes

yottahmd merged commit 426539e into dagu-org:main Apr 29, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added support for custom exit codes on retry #792 #902

added support for custom exit codes on retry #792 #902

kriyanshii commented Apr 17, 2025

yottahmd left a comment

yottahmd Apr 19, 2025

kriyanshii Apr 19, 2025

yottahmd Apr 19, 2025 •

edited

Loading

kriyanshii Apr 19, 2025

kriyanshii commented Apr 19, 2025

yottahmd Apr 19, 2025 •

edited

Loading

kriyanshii Apr 19, 2025

yottahmd left a comment

yottahmd Apr 19, 2025 •

edited

Loading

yottahmd commented Apr 29, 2025

kriyanshii commented Apr 29, 2025

yottahmd left a comment

codecov bot commented Apr 29, 2025

	default:
	// finish the node
	node.data.SetStatus(NodeStatusError)
	if node.shouldMarkSuccess(ctx) {
	// mark as success if the node should be marked as success
	// i.e. continueOn.markSuccess is set to true
	node.data.SetStatus(NodeStatusSuccess)
	} else {
	node.data.MarkError(execErr)
	sc.setLastError(execErr)
	}
	}

added support for custom exit codes on retry #792 #902

added support for custom exit codes on retry #792 #902

Conversation

kriyanshii commented Apr 17, 2025

Add Custom Exit Code Support for Retry Policy

Description

Changes

Example Usage

Backward Compatibility

Testing

Benefits

yottahmd left a comment

Choose a reason for hiding this comment

yottahmd Apr 19, 2025

Choose a reason for hiding this comment

kriyanshii Apr 19, 2025

Choose a reason for hiding this comment

yottahmd Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

kriyanshii Apr 19, 2025

Choose a reason for hiding this comment

kriyanshii commented Apr 19, 2025

yottahmd Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

kriyanshii Apr 19, 2025

Choose a reason for hiding this comment

yottahmd left a comment

Choose a reason for hiding this comment

yottahmd Apr 19, 2025 • edited Loading

Choose a reason for hiding this comment

yottahmd commented Apr 29, 2025

kriyanshii commented Apr 29, 2025

yottahmd left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 29, 2025

Codecov Report

yottahmd Apr 19, 2025 •

edited

Loading

yottahmd Apr 19, 2025 •

edited

Loading

yottahmd Apr 19, 2025 •

edited

Loading