Skip to content

added support for custom exit codes on retry #792 #902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 29, 2025

Conversation

kriyanshii
Copy link
Contributor

Add Custom Exit Code Support for Retry Policy

Description

This PR adds support for specifying custom exit codes that should trigger task retries. Previously, the retry policy would retry on any non-zero exit code. Now users can specify exactly which exit codes should trigger retries, while other non-zero exit codes will cause the task to fail immediately.

Changes

  1. Added exitCode field to the retry policy schema in schemas/dag.schema.json
  2. Updated RetryPolicy struct in internal/digraph/step.go to include ExitCodes field
  3. Added ExitCode field to retryPolicyDef struct in internal/digraph/spec.go
  4. Updated builder in internal/digraph/builder.go to handle the new exitCode field
  5. Enhanced exit code handling in internal/digraph/scheduler/scheduler.go:
    • Added robust exit code extraction from different error types
    • Added support for parsing exit codes from error strings
    • Added handling for signal terminations
    • Added detailed debug logging for better troubleshooting
    • Improved retry policy decision logic

Example Usage

steps:
  - name: retryable task
    command: main.sh
    retryPolicy:
      limit: 3
      intervalSec: 5
      exitCode: [1, 2]  # Retries if exit code is 1 or 2; otherwise fails immediately

Backward Compatibility

  • If no exitCode is specified, the behavior remains unchanged (retries on any non-zero exit code)
  • All existing retry policy configurations will continue to work as before

Testing

  • Added comprehensive error handling for different types of process terminations
  • Added detailed debug logging to track exit code determination and retry decisions
  • Maintained backward compatibility with existing retry policy configurations

Benefits

  • More precise control over which failures should trigger retries
  • Better error handling and logging for troubleshooting
  • Maintains backward compatibility with existing configurations

Copy link
Collaborator

@yottahmd yottahmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks great! I've left a few comments.

Comment on lines 200 to 233
var exitCode int
var exitCodeFound bool

// Try to extract exit code from different error types
if exitErr, ok := execErr.(*exec.ExitError); ok {
exitCode = exitErr.ExitCode()
exitCodeFound = true
logger.Debug(ctx, "Found ExitError", "error", execErr, "exitCode", exitCode)
} else {
// Try to parse exit code from error string
errStr := execErr.Error()
if strings.Contains(errStr, "exit status") {
// Parse "exit status N" format
parts := strings.Split(errStr, " ")
if len(parts) > 2 {
if code, err := strconv.Atoi(parts[2]); err == nil {
exitCode = code
exitCodeFound = true
logger.Debug(ctx, "Parsed exit code from error string", "error", errStr, "exitCode", exitCode)
}
}
} else if strings.Contains(errStr, "signal:") {
// Handle signal termination
exitCode = -1
exitCodeFound = true
logger.Debug(ctx, "Process terminated by signal", "error", errStr)
}
}

if !exitCodeFound {
logger.Debug(ctx, "Could not determine exit code", "error", execErr, "errorType", fmt.Sprintf("%T", execErr))
// Default to exit code 1 if we can't determine the actual code
exitCode = 1
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define a separate method for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yess sure.

Comment on lines 235 to 249
shouldRetry := false
if len(node.retryPolicy.ExitCodes) > 0 {
// If exit codes are specified, only retry for those codes
for _, code := range node.retryPolicy.ExitCodes {
if exitCode == code {
shouldRetry = true
break
}
}
logger.Debug(ctx, "Checking retry policy", "exitCode", exitCode, "allowedCodes", node.retryPolicy.ExitCodes, "shouldRetry", shouldRetry)
} else {
// If no exit codes specified, retry for any non-zero exit code
shouldRetry = exitCode != 0
logger.Debug(ctx, "Using default retry policy", "exitCode", exitCode, "shouldRetry", shouldRetry)
}
Copy link
Collaborator

@yottahmd yottahmd Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define a ShouldRetry(exitCode int) method on RetryPolicy struct to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try it

@kriyanshii
Copy link
Contributor Author

@yottahmd resolved all the comments.

Comment on lines +631 to +637
if !shouldRetry {
// finish the node with error
node.data.SetStatus(NodeStatusError)
node.data.MarkError(execErr)
sc.setLastError(execErr)
return false
}
Copy link
Collaborator

@yottahmd yottahmd Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kriyanshii
Would it be possible to reuse the same logic in the switch statement below? In some cases, a step should be marked as successful even if it fails.

default:
// finish the node
node.data.SetStatus(NodeStatusError)
if node.shouldMarkSuccess(ctx) {
// mark as success if the node should be marked as success
// i.e. continueOn.markSuccess is set to true
node.data.SetStatus(NodeStatusSuccess)
} else {
node.data.MarkError(execErr)
sc.setLastError(execErr)
}
}

Perhaps we could check the return value from the handleNodeRetry fucntion and use fallthrough when the return value is false. What do you think?

Something like this:

case node.retryPolicy.Limit > node.data.GetRetryCount():
	if sc.handleNodeRetry(ctx, node, execErr) {
		continue ExecRepeat
	}
+	fallthrough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good observation! We can reuse the success marking logic by using fallthrough when handleNodeRetry returns false. This would make the code more DRY (Don't Repeat Yourself) and ensure consistent behavior.

Copy link
Collaborator

@yottahmd yottahmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! I've added some nit comments.

if !exitCodeFound {
logger.Debug(ctx, "Could not determine exit code", "error", execErr, "errorType", fmt.Sprintf("%T", execErr))
// Default to exit code 1 if we can't determine the actual code
exitCode = 1
Copy link
Collaborator

@yottahmd yottahmd Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we can initialize the exitCode with 1 at declaration. WDYT?

@yottahmd
Copy link
Collaborator

@kriyanshii Would you mind updating the documents to reflect this change? Thanks a lot!

@kriyanshii
Copy link
Contributor Author

hey can you check it? i have updated docs. do i need to update anything else?

Copy link
Collaborator

@yottahmd yottahmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the document and JSON schema! LGTM 🚀🚀🚀

@yottahmd yottahmd merged commit 426539e into dagu-org:main Apr 29, 2025
4 checks passed
Copy link

codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 47.82609% with 36 lines in your changes missing coverage. Please review.

Project coverage is 55.41%. Comparing base (b4961b0) to head (363710d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/digraph/scheduler/scheduler.go 54.00% 19 Missing and 4 partials ⚠️
internal/digraph/scheduler/node.go 37.50% 8 Missing and 2 partials ⚠️
internal/digraph/builder.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
- Coverage   55.48%   55.41%   -0.08%     
==========================================
  Files          75       75              
  Lines        8240     8301      +61     
==========================================
+ Hits         4572     4600      +28     
- Misses       3281     3308      +27     
- Partials      387      393       +6     
Files with missing lines Coverage Δ
internal/digraph/step.go 91.66% <ø> (ø)
internal/digraph/builder.go 57.63% <0.00%> (-0.31%) ⬇️
internal/digraph/scheduler/node.go 65.94% <37.50%> (-1.06%) ⬇️
internal/digraph/scheduler/scheduler.go 79.38% <54.00%> (-3.69%) ⬇️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4961b0...363710d. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants