Skip to content

feat(llm): 重构并优化模型请求的重试与降级补偿机制#195

Open
Rat0323 wants to merge 3 commits into
SXP-Simon:mainfrom
Rat0323:feat/llm-fallback
Open

feat(llm): 重构并优化模型请求的重试与降级补偿机制#195
Rat0323 wants to merge 3 commits into
SXP-Simon:mainfrom
Rat0323:feat/llm-fallback

Conversation

@Rat0323

@Rat0323 Rat0323 commented Jun 7, 2026

Copy link
Copy Markdown

问题描述

当在配置中指定了专属的 topic_provider_id 等模型时,如果该模型虽然存在、但运行时不可用(如遇到 429 限流、401 鉴权失败、服务端无容量或 Key 冷却),原始逻辑会在该模型上耗尽所有重试次数并放弃分析,不会继续降级到默认模型。

修复与重构方案(基于统一尝试队列)

本次更新对重试逻辑进行了彻底重构,放弃了初版的补丁式改法。

逻辑说明:
[指定模型重试N次][降级默认模型重试N次] 组合成一个统一的 attempt_queue 队列。并将核心的并发限流、断路器、请求组装代码提取为独立的闭包 _execute_llm_request

优势与深度优化:

  1. 彻底消除重复代码:符合 DRY 原则,主循环极其干净。
  2. 指数退避与抖动 (Exponential Backoff with Jitter):将原有的线性重试等待升级为带有随机抖动的指数退避算法,极大增强了在面临高并发 API 429 限流时的抗拥塞恢复能力,防止惊群效应。
  3. 修复状态污染 (State Bleeding):修复了当首个模型因不支持 response_format 触发异常后,会将该参数永久置空,导致后续降级模型(即使支持 JSON)也永久丢失强制结构化输出能力的问题。现已确保在切换 Provider 时精准重置格式约束状态。
  4. 避免误触发熔断器:当 Provider 仅是不兼容 response_format、但去掉 schema 后可正常响应时,不再把第一次兼容性异常计入熔断器失败,避免把可用 Provider 误判为故障 Provider。
  5. 同步配置说明llm_backoff 的配置提示已同步为指数退避与随机抖动,避免 UI 仍描述为旧的线性退避策略。
  6. 规范化类型提示:将相关函数的参数类型从 dict[str, object] 规范化为 dict[str, Any]
  7. 保证任务连贯性:即使指定模型彻底瘫痪,也能安全地滑落到系统默认模型进行兜底,确保当天的群分析报告正常生成。

💡 未来优化探讨 (关于 401/404 等致命错误)

目前的队列降级是“无差别尝试”(哪怕是配错了 Key 导致 401,也会执行退避等待后重试)。

理想状态下,可以引入基于错误分类的智能路由:例如捕获到鉴权失败或模型不存在时,直接跳过当前模型剩余的重试额度,秒切默认模型以节省时间。但考虑到目前各大 Provider 抛出的异常结构各不相同,为了保证当前底层逻辑的绝对稳定,本次 PR 选择使用最稳健的队列逐一尝试法。如果未来框架层提供了标准化的统一错误码对象,可在此思路上继续优化。

@sourcery-ai

sourcery-ai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Implements an LLM provider fallback path so that when a specifically configured provider exhausts its retries, the system performs one final attempt using the default/session provider instead of abandoning the task.

Sequence diagram for LLM provider fallback on retry exhaustion

sequenceDiagram
    participant Caller
    participant _invoke_llm
    participant GlobalRateLimiter
    participant get_provider_id_with_fallback
    participant context
    participant _call_provider_stream

    Caller->>_invoke_llm: _invoke_llm(pid)
    loop retries on provider_id
        _invoke_llm->>GlobalRateLimiter: get_instance().semaphore
        _invoke_llm->>context: llm_generate(chat_provider_id=provider_id)
        context-->>_invoke_llm: [response or exception]
    end

    alt [all retries failed]
        _invoke_llm->>get_provider_id_with_fallback: get_provider_id_with_fallback(context, config_manager, None, umo)
        get_provider_id_with_fallback-->>_invoke_llm: final_fallback_id
        alt [final_fallback_id valid and != provider_id]
            _invoke_llm->>GlobalRateLimiter: get_instance().semaphore
            alt [enable_streaming_llm_call]
                _invoke_llm->>_call_provider_stream: _call_provider_stream(context, final_fallback_id, llm_kwargs)
                _call_provider_stream-->>_invoke_llm: stream_response
            else [non streaming]
                _invoke_llm->>context: llm_generate(chat_provider_id=final_fallback_id)
                context-->>_invoke_llm: fallback_response
            end
            _invoke_llm-->>Caller: response
        else [no valid fallback]
            _invoke_llm-->>Caller: None
        end
    else [retry succeeds]
        _invoke_llm-->>Caller: response
    end
Loading

File-Level Changes

Change Details Files
Add a final fallback call to the session/default provider when the explicitly configured provider exhausts all retries and still fails.
  • Keep logging of the last exception but only if it exists, to avoid logging None.
  • When a specific provider_id was used (provider_id_key is not None) and all retries fail, log a warning and resolve a new provider via get_provider_id_with_fallback by passing None to force default/session resolution.
  • If the resolved fallback provider differs from the original provider and is non-empty, acquire the GlobalRateLimiter semaphore and perform a final LLM call using the fallback provider with the same prompt/system_prompt/response_format arguments, supporting both streaming and non-streaming paths.
  • If the fallback provider is invalid or identical to the original, log an error and skip the compensation attempt.
  • Wrap the fallback logic in a try/except to ensure any errors during the final attempt are logged but do not propagate, and in all failure cases return None as before.
src/infrastructure/analysis/utils/llm_utils.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The fallback logic duplicates much of the original call path (building llm_kwargs, handling streaming vs non-streaming, and grabbing the semaphore); consider extracting this into a shared helper to reduce divergence and keep behavior consistent between the primary and fallback flows.
  • In the fallback except Exception as e branch, using logger.exception (or otherwise logging the stack trace) would make debugging issues in the final fallback path easier than logging only the message.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The fallback logic duplicates much of the original call path (building `llm_kwargs`, handling streaming vs non-streaming, and grabbing the semaphore); consider extracting this into a shared helper to reduce divergence and keep behavior consistent between the primary and fallback flows.
- In the fallback `except Exception as e` branch, using `logger.exception` (or otherwise logging the stack trace) would make debugging issues in the final fallback path easier than logging only the message.

## Individual Comments

### Comment 1
<location path="src/infrastructure/analysis/utils/llm_utils.py" line_range="370-371" />
<code_context>

-    # 最终仍失败,记录错误并返回 None 由调用方处理降级,避免抛出异常
-    logger.error(f"LLM请求全部重试失败: {last_exc}")
+    # 最终仍失败,记录错误
+    if last_exc:
+        logger.error(f"LLM请求全部重试失败: {last_exc}")
+    
+    # 策略增强: 如果指定了特定 Provider 且失败了,尝试回退到系统默认 Provider 进行最后一次尝试
</code_context>
<issue_to_address>
**issue:** Avoid logging a hard error before attempting the fallback provider, to prevent misleading logs when fallback succeeds.

This log will fire even when the fallback provider later succeeds, causing successful requests to appear as errors in monitoring. Please either move this error log to after the fallback attempt fails, or log a warning before fallback and only emit an error if both primary and fallback providers fail.
</issue_to_address>

### Comment 2
<location path="src/infrastructure/analysis/utils/llm_utils.py" line_range="377-386" />
<code_context>
+        try:
</code_context>
<issue_to_address>
**suggestion:** Use logger.exception or include traceback for the fallback failure to aid debugging.

Since this `except Exception` block only logs `e`, the stack trace is lost, making failures in this compensation path hard to diagnose. Please use `logger.exception` or pass `exc_info=True` so we capture the full traceback.

Suggested implementation:

```python
        logger.warning(f"[降级补偿] 指定 Provider ({provider_id_key}) 已耗尽重试机会。尝试使用默认链路进行最终救赎...")
        try:
            # 传入 None 强制 get_provider_id_with_fallback 走默认/会话路径
            final_fallback_id = await get_provider_id_with_fallback(
                context, config_manager, None, umo
            )

            if final_fallback_id and final_fallback_id != provider_id:
                logger.info(f"[降级补偿] 发现可用默认 Provider: {final_fallback_id},正在执行最终尝试...")
                async with GlobalRateLimiter.get_instance().semaphore:
                    llm_kwargs: dict[str, object] = {
                        "chat_provider_id": final_fallback_id,

```

```python
        except Exception:
            # 使用 logger.exception 以便记录完整堆栈,便于诊断降级补偿路径中的失败
            logger.exception("[降级补偿] 使用默认 Provider 最终尝试失败")

```

If this file contains other broad `except Exception as e:` handlers that log only `e` for important failure paths (especially other fallback/compensation logic), consider updating them similarly to use `logger.exception(...)` or `logger.error(..., exc_info=True)` to ensure stack traces are captured consistently.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/infrastructure/analysis/utils/llm_utils.py Outdated
Comment thread src/infrastructure/analysis/utils/llm_utils.py Outdated
@Rat0323 Rat0323 force-pushed the feat/llm-fallback branch from b6aea89 to 793ddb7 Compare June 7, 2026 20:46
@Rat0323

Rat0323 commented Jun 7, 2026

Copy link
Copy Markdown
Author

非常抱歉,@SXP-Simon !由于我本地 Git 环境之前的配置错误,导致这个 PR 中的提交记录被错误地署名为您的账号。这完全是我的疏忽,给您带来了困扰,深表歉意。

@Rat0323 Rat0323 force-pushed the feat/llm-fallback branch from 65c34c7 to b191b83 Compare June 8, 2026 16:51
@Rat0323 Rat0323 changed the title feat: implement LLM fallback mechanism to prevent task abandonment on 429 errors feat(llm): 重构并优化模型请求的重试与降级补偿机制 Jun 8, 2026
@SXP-Simon

Copy link
Copy Markdown
Owner

感谢贡献,等我考完期末来 review 一下,最近没时间,抱歉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants