fix(server): handle transient errors before ToStatusError loses type …#15996
fix(server): handle transient errors before ToStatusError loses type …#15996HsiuChuanHsu wants to merge 1 commit intoargoproj:mainfrom
Conversation
…info Signed-off-by: HsiuChuanHsu <hchsu2106@gmail.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughModified workflow creation operations in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/workflow/workflow_server.go (1)
141-156:⚠️ Potential issue | 🟡 MinorSubmitWorkflow should use
codes.Internalinstead ofcodes.InvalidArgumentfor retry exhaustion.The retry logic in CreateWorkflow (lines 142–156) correctly maps transient errors to appropriate status codes, but SubmitWorkflow (lines 867–874) uses
codes.InvalidArgumentfor all retry exhaustion cases. Sinceerrorsutil.IsTransientErrincludes quota conflicts, service unavailability, timeouts, and network errors—conditions that are not client-side validation failures—returningcodes.InvalidArgumentto the CLI is misleading. Usecodes.Internallike CreateWorkflow does, or add theapierr.IsServerTimeouthint block (lines 149–153) to SubmitWorkflow as well for consistency.Additionally, for CreateWorkflow: the
apierr.IsServerTimeoutcheck at line 149 will correctly detectServerTimeouterrors even afterwaitutil.Backoffwraps them withfmt.Errorf("%w: %w", ...), since the k8s library implementation ofapierr.IsServerTimeoutuseserrors.Aswhich properly unwraps chained errors.The idempotency concern with
GenerateName—where retrying transient network/timeout errors can create duplicate workflows—remains valid but is partially acknowledged by the existing hint. If scope allows, consider narrowing the retry predicate to only quota/429 cases that are server-rejected before state change, as suggested in the original comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/workflow/workflow_server.go` around lines 141 - 156, The SubmitWorkflow retry-exhaustion handling should use codes.Internal (or mirror the CreateWorkflow hint logic) instead of returning codes.InvalidArgument; update the SubmitWorkflow error branch to map retry-backoff exhaustion to sutils.ToStatusError(errWithHint, codes.Internal) and include the same apierr.IsServerTimeout + GenerateName/Name hint pattern used in CreateWorkflow (see CreateWorkflow, errorsutil.IsTransientErr, apierr.IsServerTimeout, logger.WithError) so timeouts surface a helpful message about possible existing workflows while non-client transient failures return Internal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/workflow/workflow_server.go`:
- Around line 867-874: The post-retry error return in the SubmitWorkflow path
incorrectly maps retry-exhausted transient API errors to codes.InvalidArgument;
update the error mapping to match CreateWorkflow: convert the final create error
to an Internal gRPC status via sutils.ToStatusError(err, codes.Internal) and, if
apierr.IsServerTimeout(ctx, err) is true, wrap/augment the status with a
DeadlineExceeded hint as CreateWorkflow does. Locate the retry block around
wfClient.ArgoprojV1alpha1().Workflows(...).Create in workflow_server.go (the
SubmitWorkflow handler) and replace the final return that uses
codes.InvalidArgument with the same error translation logic used by
CreateWorkflow (including the apierr.IsServerTimeout check).
---
Outside diff comments:
In `@server/workflow/workflow_server.go`:
- Around line 141-156: The SubmitWorkflow retry-exhaustion handling should use
codes.Internal (or mirror the CreateWorkflow hint logic) instead of returning
codes.InvalidArgument; update the SubmitWorkflow error branch to map
retry-backoff exhaustion to sutils.ToStatusError(errWithHint, codes.Internal)
and include the same apierr.IsServerTimeout + GenerateName/Name hint pattern
used in CreateWorkflow (see CreateWorkflow, errorsutil.IsTransientErr,
apierr.IsServerTimeout, logger.WithError) so timeouts surface a helpful message
about possible existing workflows while non-client transient failures return
Internal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b0653bfb-87b5-41c8-b5a4-bb21a2c3527a
📒 Files selected for processing (1)
server/workflow/workflow_server.go
| err = waitutil.Backoff(retry.DefaultRetry(ctx), func() (bool, error) { | ||
| var createErr error | ||
| wf, createErr = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{}) | ||
| return !errorsutil.IsTransientErr(ctx, createErr), createErr | ||
| }) | ||
| if err != nil { | ||
| return nil, sutils.ToStatusError(err, codes.InvalidArgument) | ||
| } |
There was a problem hiding this comment.
Wrong gRPC status code after retry exhaustion; also inconsistent with CreateWorkflow.
On retry exhaustion, err is a wrapped transient API error (e.g., quota conflict, service unavailable, server timeout). Mapping it to codes.InvalidArgument is misleading — the request body wasn't invalid — and it's inconsistent with the sibling CreateWorkflow path which maps create failures to codes.Internal and additionally emits a DeadlineExceeded hint on apierr.IsServerTimeout. CLIs acting on the status code (including the CLI retry behavior this PR aims to enable) will be confused by InvalidArgument here.
Recommend aligning both paths:
🔧 Proposed fix for SubmitWorkflow post-retry error handling
err = waitutil.Backoff(retry.DefaultRetry(ctx), func() (bool, error) {
var createErr error
wf, createErr = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Create(ctx, wf, metav1.CreateOptions{})
return !errorsutil.IsTransientErr(ctx, createErr), createErr
})
+ logger := logging.RequireLoggerFromContext(ctx)
if err != nil {
- return nil, sutils.ToStatusError(err, codes.InvalidArgument)
+ if apierr.IsServerTimeout(err) && wf.GenerateName != "" && wf.Name != "" {
+ errWithHint := fmt.Errorf(`submit request failed due to timeout, but it's possible that workflow "%s" already exists. Original error: %w`, wf.Name, err)
+ logger.WithError(err).Error(ctx, errWithHint.Error())
+ return nil, sutils.ToStatusError(errWithHint, codes.DeadlineExceeded)
+ }
+ logger.WithError(err).Error(ctx, "Submit request failed")
+ return nil, sutils.ToStatusError(err, codes.Internal)
}
return wf, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/workflow/workflow_server.go` around lines 867 - 874, The post-retry
error return in the SubmitWorkflow path incorrectly maps retry-exhausted
transient API errors to codes.InvalidArgument; update the error mapping to match
CreateWorkflow: convert the final create error to an Internal gRPC status via
sutils.ToStatusError(err, codes.Internal) and, if apierr.IsServerTimeout(ctx,
err) is true, wrap/augment the status with a DeadlineExceeded hint as
CreateWorkflow does. Locate the retry block around
wfClient.ArgoprojV1alpha1().Workflows(...).Create in workflow_server.go (the
SubmitWorkflow handler) and replace the final return that uses
codes.InvalidArgument with the same error translation logic used by
CreateWorkflow (including the apierr.IsServerTimeout check).
Motivation
Kubernetes sometimes returns an HTTP 409 Conflict error when the ResourceQuota hasn't fully synchronized. This is a temporary race condition, and retrying usually fixes it.
However, the server converts this error into a gRPC status, which causes the original error type to be permanently lost. Because of this, the CLI cannot recognize the conflict and incorrectly treats it as a permanent failure instead of retrying.
Modifications
To resolve this, retry logic is implemented directly in the server's CreateWorkflow and SubmitWorkflow functions:
errorsutil.IsTransientErrwhile the original error type is still available to identify the conflict.waitutil.Backoffwith exponential backoff.Fixes #14106
Summary by CodeRabbit