Feature Request: Add retry logic and Implement exponential backoff resilience for LiteLLMProvider

Problem Statement
The LiteLLMProvider currently crashes whenever it hits a rate limit or a timeout from an LLM provider. This makes agents fragile during long tasks because they cannot recover from minor network issues or provider instability.

Proposed Solution
Add a retry mechanism with exponential backoff to the LiteLLMProvider. Instead of stopping on an error, the provider should wait and try the request again for transient issues.

Alternatives Considered
I considered adding a simple try/except block with a single retry, but that wouldn't handle multiple failures or scaling wait times as effectively as a dedicated retry library.

Additional Context
I have already implemented a fix using tenacity and verified that it correctly handles 429 and 503 error codes through manual testing.

Implementation Ideas
Use tenacity to wrap the litellm.completion call.

Set up an exponential backoff strategy (e.g., 2s to 60s).

Target specific errors like RateLimitError and Timeout.

Update dependencies in pyproject.toml and requirements.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add retry logic and Implement exponential backoff resilience for LiteLLMProvider #1486

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add retry logic and Implement exponential backoff resilience for LiteLLMProvider #1486

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions