fix: tolerate recoverable cloud API failures in APIBasedLLM by officialasishkumar · Pull Request #378 · kubeedge/ianvs

officialasishkumar · 2026-04-09T17:19:52Z

What type of PR is this?

/kind bug
/kind test

What this PR does / why we need it:

This PR prevents a single recoverable cloud API failure from aborting the cloud-edge collaborative LLM benchmark.

treat content-filter 400s and retryable provider status failures as per-sample failures
return an empty structured response with error metadata so downstream parsing and metrics can continue
preserve fail-fast behavior for invalid request/configuration errors
add focused unit coverage for the recoverable and fail-fast paths

Which issue(s) this PR fixes:
Fixes #356

Return an empty structured response for content-filter and retryable provider errors so a single bad sample does not abort the entire benchmarking run. Add focused unit coverage for the fallback path while preserving fail-fast behavior for invalid request configuration errors. Fixes kubeedge#356 Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

kubeedge-bot · 2026-04-09T17:19:57Z

@officialasishkumar: The label(s) kind/test cannot be applied, because the repository doesn't have them

Details

In response to this:

What type of PR is this?

/kind bug
/kind test

What this PR does / why we need it:

This PR prevents a single recoverable cloud API failure from aborting the cloud-edge collaborative LLM benchmark.

treat content-filter 400s and retryable provider status failures as per-sample failures

return an empty structured response with error metadata so downstream parsing and metrics can continue

preserve fail-fast behavior for invalid request/configuration errors

add focused unit coverage for the recoverable and fail-fast paths

Which issue(s) this PR fixes:
Fixes #356

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubeedge-bot · 2026-04-09T17:20:04Z

Welcome @officialasishkumar! It looks like this is your first PR to kubeedge/ianvs 🎉

kubeedge-bot · 2026-04-09T17:20:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: officialasishkumar
To complete the pull request process, please assign jaypume after the PR has been reviewed.
You can assign the PR to them by writing /assign @jaypume in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request introduces robust API error handling to the APIBasedLLM class, defining recoverable status codes and content filter error types. It adds methods to extract error details, determine if an error is recoverable, and construct an empty response with error information for such cases. The _infer method's exception handling is updated to utilize this new logic, logging warnings for recoverable errors and re-raising others. New unit tests are also included to validate this behavior. Feedback indicates a significant issue with the inverted retry logic, where transient errors are not retried and permanent errors are, suggesting a refactoring to correct this. Additionally, it's recommended to preserve the original exception's traceback when re-raising RuntimeError for better debugging.

gemini-code-assist · 2026-04-09T17:33:41Z

...es/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/api_llm.py

                throughput = 0

        except Exception as e:
+            if self._is_recoverable_api_error(e):


The current implementation inverts the expected retry logic for transient vs. permanent errors.

Transient Errors (e.g., 503, 429): These are included in RECOVERABLE_API_STATUS_CODES. Because they are caught inside the _infer method (which is decorated with @retry), the decorator will see a successful return value (the error response) and will not perform any retries. This reduces the robustness of the benchmark against temporary network or provider issues.

Permanent Errors (e.g., 401, 404): These are not in the recoverable set, so they fall through to line 193 and raise a RuntimeError. The @retry decorator will catch this and retry the request 3 times, which is unnecessary and adds latency for errors that will never succeed on retry.

Consider refactoring to ensure transient errors are retried before being tolerated, and that permanent errors fail fast. A common pattern is to use a wrapper method for the toleration logic while keeping the retry logic on the actual API call.

Updated in 118baf1: transient 408/429/5xx failures are now retried before being downgraded to an empty per-sample response, while content-filter 400s are still tolerated immediately and invalid requests fail fast.

gemini-code-assist · 2026-04-09T17:33:41Z

...es/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/api_llm.py

+                    message,
+                )
+                return self._build_error_response(e)
            raise RuntimeError(f"Error during API inference: {e}")


When re-raising the exception as a RuntimeError, it is best practice to use from e to preserve the original exception's traceback. This is crucial for debugging the root cause of the failure.

Suggested change

raise RuntimeError(f"Error during API inference: {e}")

raise RuntimeError(f"Error during API inference: {e}") from e

Also handled in 118baf1: the fail-fast path now raises RuntimeError(... ) from e, and the unit test covers the non-retry invalid-request case.

kubeedge-bot · 2026-04-09T18:19:53Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

0edbcd9 fix: tolerate recoverable cloud API failures in APIBasedLLM

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kubeedge-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 9, 2026

kubeedge-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 9, 2026

kubeedge-bot requested review from Poorunga and hsj576 April 9, 2026 17:20

kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 9, 2026

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

fix(api-llm): retry transient provider failures before tolerating them

118baf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tolerate recoverable cloud API failures in APIBasedLLM#378

fix: tolerate recoverable cloud API failures in APIBasedLLM#378
officialasishkumar wants to merge 2 commits intokubeedge:mainfrom
officialasishkumar:fix/joint-inference-cloud-api-failures

officialasishkumar commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

officialasishkumar Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

officialasishkumar Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	raise RuntimeError(f"Error during API inference: {e}")
	raise RuntimeError(f"Error during API inference: {e}") from e

Conversation

officialasishkumar commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

officialasishkumar Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

officialasishkumar Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

kubeedge-bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants