Skip to content

fix(config): use flash-lite for utility model configs to preserve quota#25684

Open
kazukinakai wants to merge 5 commits intogoogle-gemini:mainfrom
kazukinakai:fix/utility-flash-lite-fallback
Open

fix(config): use flash-lite for utility model configs to preserve quota#25684
kazukinakai wants to merge 5 commits intogoogle-gemini:mainfrom
kazukinakai:fix/utility-flash-lite-fallback

Conversation

@kazukinakai
Copy link
Copy Markdown

@kazukinakai kazukinakai commented Apr 20, 2026

Fixes #23397
Fixes #18059
Related to #24937 (capacity/429 tracking issue)

Problem

When gemini-3-flash-preview quota is exhausted (100%), the CLI becomes completely unusable even if the user explicitly switches to gemini-3.1-flash-lite-preview. The "Usage limit reached for gemini-3-flash-preview" error keeps firing regardless of the selected model.

Root cause: all six internal utility configs are hardcoded to gemini-3-flash-basegemini-3-flash-preview, so they consume Flash quota independently of the user's model selection:

Config key Role Before After
loop-detection UTILITY_LOOP_DETECTOR Flash Flash Lite ✓
llm-edit-fixer UTILITY_EDIT_CORRECTOR Flash Flash Lite ✓
next-speaker-checker UTILITY_NEXT_SPEAKER Flash Flash Lite ✓
web-fetch-fallback fallback path (no tools) Flash Flash Lite ✓
web-search Grounding with Google Search Flash Flash Lite ✓
web-fetch URL context tool Flash Flash Lite ✓

Fix

Add gemini-3-flash-lite-base targeting gemini-3.1-flash-lite-preview and switch all six configs to use it.

Why Flash Lite is safe:

  • Utility tasks (loop detection, edit fixing, next-speaker routing): lightweight reasoning — same pattern as edit-corrector, fast-ack-helper, summarizer-*, classifier which already use Flash Lite variants
  • web-search and web-fetch: gemini-3.1-flash-lite-preview officially supports both googleSearch (Grounding) and urlContext tools per Gemini API docs

Impact

  • Users with exhausted Flash but available Flash Lite quota can keep working
  • Reduces Flash consumption for all utility calls, preserving quota for the main model
  • No functional regression

Reproduction (from #23397)

# Set Flash Lite as main model explicitly:
gemini -m 'gemini-3.1-flash-lite-preview'

# Still gets:
Usage limit reached for gemini-3-flash-preview.
/model to switch models.

Loop detection, LLM edit fixer, and next-speaker checker were hardcoded
to gemini-3-flash-preview via gemini-3-flash-base. When the Flash quota
is exhausted (e.g. 100% usage), these internal utility calls fail even
when the user switches to Pro or Flash Lite as their main model, making
the CLI unusable.

These utilities perform simple reasoning tasks that do not require Flash's
full capabilities. Switch them to a new gemini-3-flash-lite-base that
targets gemini-3.1-flash-lite-preview, which has a separate quota bucket
and is well-suited for lightweight inference tasks.

web-search and web-fetch remain on Flash because they rely on googleSearch
and urlContext tool support which requires Flash.

Fixes: utility_loop_detector, utility_tool, and next-speaker failures
when gemini-3-flash-preview quota is exhausted.
@kazukinakai kazukinakai requested a review from a team as a code owner April 20, 2026 07:17
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical usability issue where internal utility tasks were hardcoded to use the standard Flash model, causing failures when that specific quota was exhausted. By migrating these lightweight utility tasks to a Flash Lite base configuration, the system now effectively manages quota consumption and prevents unnecessary service interruptions for users.

Highlights

  • New Base Configuration: Introduced 'gemini-3-flash-lite-base' to target the 'gemini-3.1-flash-lite-preview' model.
  • Utility Model Migration: Updated 'loop-detection', 'llm-edit-fixer', and 'next-speaker-checker' to use the new Flash Lite base configuration.
  • Quota Management: Ensures utility tasks continue to function even when the primary Flash quota is exhausted by leveraging available Flash Lite capacity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@google-cla
Copy link
Copy Markdown

google-cla bot commented Apr 20, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-cli gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new model configuration, gemini-3-flash-lite-base, which utilizes the gemini-3.1-flash-lite-preview model. Additionally, it updates the loop-detection, llm-edit-fixer, and next-speaker-checker configurations to extend this new base instead of gemini-3-flash-base. I have no feedback to provide.

web-fetch-fallback extends gemini-3-flash-base but does not configure
any tools (no urlContext), so it is just a plain model call. Flash Lite
is equally capable and avoids consuming Flash quota.
gemini-3.1-flash-lite-preview officially supports both googleSearch
(Grounding with Google Search) and urlContext tools per the Gemini API
docs. Switching these configs to gemini-3-flash-lite-base reduces Flash
quota consumption and keeps web tools functional when Flash is exhausted.
@kazukinakai
Copy link
Copy Markdown
Author

Ran npm run preflight locally (Node 20, per .nvmrc): all checks passed (clean → npm ci → format → build → lint → typecheck → test).

Auto (Gemini 3) mode used Pro → Flash with Flash as isLastResort.
When Flash quota is exhausted, the CLI had nowhere to fall back to.

Add Flash Lite (gemini-3.1-flash-lite-preview when Gemini 3.1 is enabled,
gemini-2.5-flash-lite otherwise) as the new isLastResort, demoting Flash
to an intermediate step. Users no longer need to manually switch models
when Flash is exhausted — Auto mode will silently continue on Flash Lite.
@gemini-cli gemini-cli bot added area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt and removed status/need-issue Pull requests that need to have an associated issue. labels Apr 20, 2026
…n preview chain

Flash and Flash Lite policies in the preview chain were using DEFAULT_ACTIONS
(which prompts the user on quota exhaustion). This caused an unwanted dialog
when Flash quota was hit during Auto (Gemini 3) mode.

Use SILENT_ACTIONS for both Flash and Flash Lite so the fallback from
Flash→Flash Lite happens automatically without user intervention, matching
the behavior of FLASH_LITE_CHAIN.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Gemini CLI not respecting the set model Run out of all pro models (using flash)

1 participant