fix: tolerate recoverable cloud API failures in APIBasedLLM#378
fix: tolerate recoverable cloud API failures in APIBasedLLM#378officialasishkumar wants to merge 2 commits intokubeedge:mainfrom
Conversation
Return an empty structured response for content-filter and retryable provider errors so a single bad sample does not abort the entire benchmarking run. Add focused unit coverage for the fallback path while preserving fail-fast behavior for invalid request configuration errors. Fixes kubeedge#356 Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
|
@officialasishkumar: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Welcome @officialasishkumar! It looks like this is your first PR to kubeedge/ianvs 🎉 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: officialasishkumar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces robust API error handling to the APIBasedLLM class, defining recoverable status codes and content filter error types. It adds methods to extract error details, determine if an error is recoverable, and construct an empty response with error information for such cases. The _infer method's exception handling is updated to utilize this new logic, logging warnings for recoverable errors and re-raising others. New unit tests are also included to validate this behavior. Feedback indicates a significant issue with the inverted retry logic, where transient errors are not retried and permanent errors are, suggesting a refactoring to correct this. Additionally, it's recommended to preserve the original exception's traceback when re-raising RuntimeError for better debugging.
| throughput = 0 | ||
|
|
||
| except Exception as e: | ||
| if self._is_recoverable_api_error(e): |
There was a problem hiding this comment.
The current implementation inverts the expected retry logic for transient vs. permanent errors.
- Transient Errors (e.g., 503, 429): These are included in
RECOVERABLE_API_STATUS_CODES. Because they are caught inside the_infermethod (which is decorated with@retry), the decorator will see a successful return value (the error response) and will not perform any retries. This reduces the robustness of the benchmark against temporary network or provider issues. - Permanent Errors (e.g., 401, 404): These are not in the recoverable set, so they fall through to line 193 and raise a
RuntimeError. The@retrydecorator will catch this and retry the request 3 times, which is unnecessary and adds latency for errors that will never succeed on retry.
Consider refactoring to ensure transient errors are retried before being tolerated, and that permanent errors fail fast. A common pattern is to use a wrapper method for the toleration logic while keeping the retry logic on the actual API call.
There was a problem hiding this comment.
Updated in 118baf1: transient 408/429/5xx failures are now retried before being downgraded to an empty per-sample response, while content-filter 400s are still tolerated immediately and invalid requests fail fast.
| message, | ||
| ) | ||
| return self._build_error_response(e) | ||
| raise RuntimeError(f"Error during API inference: {e}") |
There was a problem hiding this comment.
When re-raising the exception as a RuntimeError, it is best practice to use from e to preserve the original exception's traceback. This is crucial for debugging the root cause of the failure.
| raise RuntimeError(f"Error during API inference: {e}") | |
| raise RuntimeError(f"Error during API inference: {e}") from e |
There was a problem hiding this comment.
Also handled in 118baf1: the fail-fast path now raises RuntimeError(... ) from e, and the unit test covers the non-retry invalid-request case.
|
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
/kind test
What this PR does / why we need it:
This PR prevents a single recoverable cloud API failure from aborting the cloud-edge collaborative LLM benchmark.
Which issue(s) this PR fixes:
Fixes #356