Skip to content

Apply workflow resiliency #206

@sicoyle

Description

@sicoyle

Extract and isolate resiliency logic (retry/backoff, error handling, etc.) from the linked issue: dapr/dapr-agents#167

Evaluate where resiliency should be applied in our DurableAgent and orchestrator workflows, and define a smart resiliency policy (i.e. one that knows when not to retry based on error type, not just blanket backoffs).

This applies to:

  • DurableAgent workflow activities
  • Orchestrator workflow activities
  • External calls, especially LLM provider integrations
  • Other potential areas where resiliency might be required

We need a "smart" resiliency policy that:

  • Applies retry/backoff logic for transient failures (e.g. network timeouts, intermittent service outages)
  • Does not retry for non-transient failures (e.g. invalid credentials, “out of credits”, malformed request)
  • Logs or surfaces the classification of errors so it’s clear when resiliency kicked in vs when we aborted due to non-recoverable error

Acceptance Criteria

  • Identify and specify where resiliency should be added in the DurableAgent workflow activities.
  • Identify and specify where resiliency should be added in the orchestrator workflow activities.
  • Define and bring in the concept of smart resiliency for external calls such as LLM providers—i.e. classify errors, decide when to retry vs abort.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions