Skip to content

fix(server): handle transient errors before ToStatusError loses type …#15996

Open
HsiuChuanHsu wants to merge 1 commit intoargoproj:mainfrom
HsiuChuanHsu:fix/14106
Open

fix(server): handle transient errors before ToStatusError loses type …#15996
HsiuChuanHsu wants to merge 1 commit intoargoproj:mainfrom
HsiuChuanHsu:fix/14106

Conversation

@HsiuChuanHsu
Copy link
Copy Markdown

@HsiuChuanHsu HsiuChuanHsu commented Apr 19, 2026

Motivation

Kubernetes sometimes returns an HTTP 409 Conflict error when the ResourceQuota hasn't fully synchronized. This is a temporary race condition, and retrying usually fixes it.

However, the server converts this error into a gRPC status, which causes the original error type to be permanently lost. Because of this, the CLI cannot recognize the conflict and incorrectly treats it as a permanent failure instead of retrying.

Modifications

To resolve this, retry logic is implemented directly in the server's CreateWorkflow and SubmitWorkflow functions:

  • Use errorsutil.IsTransientErr while the original error type is still available to identify the conflict.
  • Apply waitutil.Backoff with exponential backoff.

Fixes #14106

Summary by CodeRabbit

  • Bug Fixes
    • Workflow submission operations are now more resilient to transient failures with automatic retry and backoff mechanisms, reducing errors during temporary service disruptions.
    • Enhanced error recovery logic improves the success rate of workflow creation requests when facing temporary network or service availability issues.

@HsiuChuanHsu HsiuChuanHsu marked this pull request as draft April 19, 2026 15:45
…info

Signed-off-by: HsiuChuanHsu <hchsu2106@gmail.com>
@HsiuChuanHsu HsiuChuanHsu marked this pull request as ready for review April 20, 2026 23:14
@isubasinghe
Copy link
Copy Markdown
Member

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Modified workflow creation operations in CreateWorkflow and SubmitWorkflow to wrap Kubernetes API calls with a retry/backoff mechanism that conditionally retries on transient errors, assigning the result within the retry closure and maintaining existing error status mappings.

Changes

Cohort / File(s) Summary
Workflow Server Retry Logic
server/workflow/workflow_server.go
Added retry/backoff wrapper around Kubernetes Workflows(...).Create(...) calls in CreateWorkflow and SubmitWorkflow functions. Integrated transient error detection via errorsutil.IsTransientErr() to determine retry eligibility. Moved workflow result assignment inside retry closure and preserved error handling status conversions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding retry logic to handle transient errors before they are converted to gRPC status, preventing loss of error type information.
Description check ✅ Passed The description covers motivation, modifications, and includes issue reference, though verification and documentation sections are not explicitly addressed.
Linked Issues check ✅ Passed The PR implements retry logic in CreateWorkflow/SubmitWorkflow using errorsutil.IsTransientErr and waitutil.Backoff to handle transient ResourceQuota conflicts [#14106], directly addressing the issue's requirement.
Out of Scope Changes check ✅ Passed The changes are focused on adding retry logic to handle transient errors in workflow creation functions, which directly addresses the linked issue and contains no out-of-scope modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/workflow/workflow_server.go (1)

141-156: ⚠️ Potential issue | 🟡 Minor

SubmitWorkflow should use codes.Internal instead of codes.InvalidArgument for retry exhaustion.

The retry logic in CreateWorkflow (lines 142–156) correctly maps transient errors to appropriate status codes, but SubmitWorkflow (lines 867–874) uses codes.InvalidArgument for all retry exhaustion cases. Since errorsutil.IsTransientErr includes quota conflicts, service unavailability, timeouts, and network errors—conditions that are not client-side validation failures—returning codes.InvalidArgument to the CLI is misleading. Use codes.Internal like CreateWorkflow does, or add the apierr.IsServerTimeout hint block (lines 149–153) to SubmitWorkflow as well for consistency.

Additionally, for CreateWorkflow: the apierr.IsServerTimeout check at line 149 will correctly detect ServerTimeout errors even after waitutil.Backoff wraps them with fmt.Errorf("%w: %w", ...), since the k8s library implementation of apierr.IsServerTimeout uses errors.As which properly unwraps chained errors.

The idempotency concern with GenerateName—where retrying transient network/timeout errors can create duplicate workflows—remains valid but is partially acknowledged by the existing hint. If scope allows, consider narrowing the retry predicate to only quota/429 cases that are server-rejected before state change, as suggested in the original comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/workflow/workflow_server.go` around lines 141 - 156, The
SubmitWorkflow retry-exhaustion handling should use codes.Internal (or mirror
the CreateWorkflow hint logic) instead of returning codes.InvalidArgument;
update the SubmitWorkflow error branch to map retry-backoff exhaustion to
sutils.ToStatusError(errWithHint, codes.Internal) and include the same
apierr.IsServerTimeout + GenerateName/Name hint pattern used in CreateWorkflow
(see CreateWorkflow, errorsutil.IsTransientErr, apierr.IsServerTimeout,
logger.WithError) so timeouts surface a helpful message about possible existing
workflows while non-client transient failures return Internal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/workflow/workflow_server.go`:
- Around line 867-874: The post-retry error return in the SubmitWorkflow path
incorrectly maps retry-exhausted transient API errors to codes.InvalidArgument;
update the error mapping to match CreateWorkflow: convert the final create error
to an Internal gRPC status via sutils.ToStatusError(err, codes.Internal) and, if
apierr.IsServerTimeout(ctx, err) is true, wrap/augment the status with a
DeadlineExceeded hint as CreateWorkflow does. Locate the retry block around
wfClient.ArgoprojV1alpha1().Workflows(...).Create in workflow_server.go (the
SubmitWorkflow handler) and replace the final return that uses
codes.InvalidArgument with the same error translation logic used by
CreateWorkflow (including the apierr.IsServerTimeout check).

---

Outside diff comments:
In `@server/workflow/workflow_server.go`:
- Around line 141-156: The SubmitWorkflow retry-exhaustion handling should use
codes.Internal (or mirror the CreateWorkflow hint logic) instead of returning
codes.InvalidArgument; update the SubmitWorkflow error branch to map
retry-backoff exhaustion to sutils.ToStatusError(errWithHint, codes.Internal)
and include the same apierr.IsServerTimeout + GenerateName/Name hint pattern
used in CreateWorkflow (see CreateWorkflow, errorsutil.IsTransientErr,
apierr.IsServerTimeout, logger.WithError) so timeouts surface a helpful message
about possible existing workflows while non-client transient failures return
Internal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b0653bfb-87b5-41c8-b5a4-bb21a2c3527a

📥 Commits

Reviewing files that changed from the base of the PR and between 30c3c9d and c08dcd7.

📒 Files selected for processing (1)
  • server/workflow/workflow_server.go

Comment on lines +867 to 874
err = waitutil.Backoff(retry.DefaultRetry(ctx), func() (bool, error) {
var createErr error
wf, createErr = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
return !errorsutil.IsTransientErr(ctx, createErr), createErr
})
if err != nil {
return nil, sutils.ToStatusError(err, codes.InvalidArgument)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrong gRPC status code after retry exhaustion; also inconsistent with CreateWorkflow.

On retry exhaustion, err is a wrapped transient API error (e.g., quota conflict, service unavailable, server timeout). Mapping it to codes.InvalidArgument is misleading — the request body wasn't invalid — and it's inconsistent with the sibling CreateWorkflow path which maps create failures to codes.Internal and additionally emits a DeadlineExceeded hint on apierr.IsServerTimeout. CLIs acting on the status code (including the CLI retry behavior this PR aims to enable) will be confused by InvalidArgument here.

Recommend aligning both paths:

🔧 Proposed fix for SubmitWorkflow post-retry error handling
 	err = waitutil.Backoff(retry.DefaultRetry(ctx), func() (bool, error) {
 		var createErr error
 		wf, createErr = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
 		return !errorsutil.IsTransientErr(ctx, createErr), createErr
 	})
+	logger := logging.RequireLoggerFromContext(ctx)
 	if err != nil {
-		return nil, sutils.ToStatusError(err, codes.InvalidArgument)
+		if apierr.IsServerTimeout(err) && wf.GenerateName != "" && wf.Name != "" {
+			errWithHint := fmt.Errorf(`submit request failed due to timeout, but it's possible that workflow "%s" already exists. Original error: %w`, wf.Name, err)
+			logger.WithError(err).Error(ctx, errWithHint.Error())
+			return nil, sutils.ToStatusError(errWithHint, codes.DeadlineExceeded)
+		}
+		logger.WithError(err).Error(ctx, "Submit request failed")
+		return nil, sutils.ToStatusError(err, codes.Internal)
 	}
 	return wf, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/workflow/workflow_server.go` around lines 867 - 874, The post-retry
error return in the SubmitWorkflow path incorrectly maps retry-exhausted
transient API errors to codes.InvalidArgument; update the error mapping to match
CreateWorkflow: convert the final create error to an Internal gRPC status via
sutils.ToStatusError(err, codes.Internal) and, if apierr.IsServerTimeout(ctx,
err) is true, wrap/augment the status with a DeadlineExceeded hint as
CreateWorkflow does. Locate the retry block around
wfClient.ArgoprojV1alpha1().Workflows(...).Create in workflow_server.go (the
SubmitWorkflow handler) and replace the final return that uses
codes.InvalidArgument with the same error translation logic used by
CreateWorkflow (including the apierr.IsServerTimeout check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

argo cli submit needs retries

2 participants