Problem Statement
The LiteLLMProvider currently crashes whenever it hits a rate limit or a timeout from an LLM provider. This makes agents fragile during long tasks because they cannot recover from minor network issues or provider instability.
Proposed Solution
Add a retry mechanism with exponential backoff to the LiteLLMProvider. Instead of stopping on an error, the provider should wait and try the request again for transient issues.
Alternatives Considered
I considered adding a simple try/except block with a single retry, but that wouldn't handle multiple failures or scaling wait times as effectively as a dedicated retry library.
Additional Context
I have already implemented a fix using tenacity and verified that it correctly handles 429 and 503 error codes through manual testing.
Implementation Ideas
Use tenacity to wrap the litellm.completion call.
Set up an exponential backoff strategy (e.g., 2s to 60s).
Target specific errors like RateLimitError and Timeout.
Update dependencies in pyproject.toml and requirements.txt.