feat(llm): 重构并优化模型请求的重试与降级补偿机制#195
Open
Rat0323 wants to merge 3 commits into
Open
Conversation
Contributor
Reviewer's GuideImplements an LLM provider fallback path so that when a specifically configured provider exhausts its retries, the system performs one final attempt using the default/session provider instead of abandoning the task. Sequence diagram for LLM provider fallback on retry exhaustionsequenceDiagram
participant Caller
participant _invoke_llm
participant GlobalRateLimiter
participant get_provider_id_with_fallback
participant context
participant _call_provider_stream
Caller->>_invoke_llm: _invoke_llm(pid)
loop retries on provider_id
_invoke_llm->>GlobalRateLimiter: get_instance().semaphore
_invoke_llm->>context: llm_generate(chat_provider_id=provider_id)
context-->>_invoke_llm: [response or exception]
end
alt [all retries failed]
_invoke_llm->>get_provider_id_with_fallback: get_provider_id_with_fallback(context, config_manager, None, umo)
get_provider_id_with_fallback-->>_invoke_llm: final_fallback_id
alt [final_fallback_id valid and != provider_id]
_invoke_llm->>GlobalRateLimiter: get_instance().semaphore
alt [enable_streaming_llm_call]
_invoke_llm->>_call_provider_stream: _call_provider_stream(context, final_fallback_id, llm_kwargs)
_call_provider_stream-->>_invoke_llm: stream_response
else [non streaming]
_invoke_llm->>context: llm_generate(chat_provider_id=final_fallback_id)
context-->>_invoke_llm: fallback_response
end
_invoke_llm-->>Caller: response
else [no valid fallback]
_invoke_llm-->>Caller: None
end
else [retry succeeds]
_invoke_llm-->>Caller: response
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The fallback logic duplicates much of the original call path (building
llm_kwargs, handling streaming vs non-streaming, and grabbing the semaphore); consider extracting this into a shared helper to reduce divergence and keep behavior consistent between the primary and fallback flows. - In the fallback
except Exception as ebranch, usinglogger.exception(or otherwise logging the stack trace) would make debugging issues in the final fallback path easier than logging only the message.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The fallback logic duplicates much of the original call path (building `llm_kwargs`, handling streaming vs non-streaming, and grabbing the semaphore); consider extracting this into a shared helper to reduce divergence and keep behavior consistent between the primary and fallback flows.
- In the fallback `except Exception as e` branch, using `logger.exception` (or otherwise logging the stack trace) would make debugging issues in the final fallback path easier than logging only the message.
## Individual Comments
### Comment 1
<location path="src/infrastructure/analysis/utils/llm_utils.py" line_range="370-371" />
<code_context>
- # 最终仍失败,记录错误并返回 None 由调用方处理降级,避免抛出异常
- logger.error(f"LLM请求全部重试失败: {last_exc}")
+ # 最终仍失败,记录错误
+ if last_exc:
+ logger.error(f"LLM请求全部重试失败: {last_exc}")
+
+ # 策略增强: 如果指定了特定 Provider 且失败了,尝试回退到系统默认 Provider 进行最后一次尝试
</code_context>
<issue_to_address>
**issue:** Avoid logging a hard error before attempting the fallback provider, to prevent misleading logs when fallback succeeds.
This log will fire even when the fallback provider later succeeds, causing successful requests to appear as errors in monitoring. Please either move this error log to after the fallback attempt fails, or log a warning before fallback and only emit an error if both primary and fallback providers fail.
</issue_to_address>
### Comment 2
<location path="src/infrastructure/analysis/utils/llm_utils.py" line_range="377-386" />
<code_context>
+ try:
</code_context>
<issue_to_address>
**suggestion:** Use logger.exception or include traceback for the fallback failure to aid debugging.
Since this `except Exception` block only logs `e`, the stack trace is lost, making failures in this compensation path hard to diagnose. Please use `logger.exception` or pass `exc_info=True` so we capture the full traceback.
Suggested implementation:
```python
logger.warning(f"[降级补偿] 指定 Provider ({provider_id_key}) 已耗尽重试机会。尝试使用默认链路进行最终救赎...")
try:
# 传入 None 强制 get_provider_id_with_fallback 走默认/会话路径
final_fallback_id = await get_provider_id_with_fallback(
context, config_manager, None, umo
)
if final_fallback_id and final_fallback_id != provider_id:
logger.info(f"[降级补偿] 发现可用默认 Provider: {final_fallback_id},正在执行最终尝试...")
async with GlobalRateLimiter.get_instance().semaphore:
llm_kwargs: dict[str, object] = {
"chat_provider_id": final_fallback_id,
```
```python
except Exception:
# 使用 logger.exception 以便记录完整堆栈,便于诊断降级补偿路径中的失败
logger.exception("[降级补偿] 使用默认 Provider 最终尝试失败")
```
If this file contains other broad `except Exception as e:` handlers that log only `e` for important failure paths (especially other fallback/compensation logic), consider updating them similarly to use `logger.exception(...)` or `logger.error(..., exc_info=True)` to ensure stack traces are captured consistently.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
b6aea89 to
793ddb7
Compare
Author
|
非常抱歉,@SXP-Simon !由于我本地 Git 环境之前的配置错误,导致这个 PR 中的提交记录被错误地署名为您的账号。这完全是我的疏忽,给您带来了困扰,深表歉意。 |
65c34c7 to
b191b83
Compare
Owner
|
感谢贡献,等我考完期末来 review 一下,最近没时间,抱歉 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
问题描述
当在配置中指定了专属的
topic_provider_id等模型时,如果该模型虽然存在、但运行时不可用(如遇到 429 限流、401 鉴权失败、服务端无容量或 Key 冷却),原始逻辑会在该模型上耗尽所有重试次数并放弃分析,不会继续降级到默认模型。修复与重构方案(基于统一尝试队列)
本次更新对重试逻辑进行了彻底重构,放弃了初版的补丁式改法。
逻辑说明:
将
[指定模型重试N次]和[降级默认模型重试N次]组合成一个统一的attempt_queue队列。并将核心的并发限流、断路器、请求组装代码提取为独立的闭包_execute_llm_request。优势与深度优化:
429限流时的抗拥塞恢复能力,防止惊群效应。response_format触发异常后,会将该参数永久置空,导致后续降级模型(即使支持 JSON)也永久丢失强制结构化输出能力的问题。现已确保在切换 Provider 时精准重置格式约束状态。response_format、但去掉 schema 后可正常响应时,不再把第一次兼容性异常计入熔断器失败,避免把可用 Provider 误判为故障 Provider。llm_backoff的配置提示已同步为指数退避与随机抖动,避免 UI 仍描述为旧的线性退避策略。dict[str, object]规范化为dict[str, Any]。💡 未来优化探讨 (关于 401/404 等致命错误)
目前的队列降级是“无差别尝试”(哪怕是配错了 Key 导致 401,也会执行退避等待后重试)。
理想状态下,可以引入基于错误分类的智能路由:例如捕获到鉴权失败或模型不存在时,直接跳过当前模型剩余的重试额度,秒切默认模型以节省时间。但考虑到目前各大 Provider 抛出的异常结构各不相同,为了保证当前底层逻辑的绝对稳定,本次 PR 选择使用最稳健的队列逐一尝试法。如果未来框架层提供了标准化的统一错误码对象,可在此思路上继续优化。