Skip to content

GoogleLLMService: apply model-aware default thinking config for Gemini 3+ #3993

@yydrift-code

Description

@yydrift-code

pipecat version

0.0.104

Python version

3.12

Operating System

macOS

Issue description

GoogleLLMService currently applies a default thinking override only for gemini-2.5-flash* via _maybe_unset_thinking_budget() in pipecat/services/google/llm.py.

That works for Gemini 2.5 Flash, but Gemini 3+ uses a different thinking surface (thinking_level, not thinking_budget). If a caller uses GoogleLLMService or GoogleVertexLLMService with a Gemini 3 Flash model and does not pass an explicit params.thinking, Pipecat leaves thinking_config unset and the model falls back to provider defaults.

For latency-sensitive voice use cases this is surprising because:

  • gemini-2.5-flash* gets an automatic low-latency default (thinking_budget=0)
  • gemini-3* gets no automatic default at all, even though the model family requires a different control surface

This is especially easy to hit now that newer Gemini 3 / 3.1 Flash models are becoming default choices.

Related Google docs:

Reproduction steps

  1. Initialize GoogleLLMService or GoogleVertexLLMService with a Gemini 3 Flash model, for example gemini-3.1-flash-lite-preview.
  2. Do not pass params.thinking.
  3. Inspect the generation params or run a normal call.
  4. Observe that Pipecat does not populate thinking_config at all for Gemini 3 models.

Minimal example:

from pipecat.services.google.llm_vertex import GoogleVertexLLMService

llm = GoogleVertexLLMService(
    credentials_path="...",
    project_id="...",
    location="europe-west2",
    model="gemini-3.1-flash-lite-preview",
    system_instruction="You are a voice assistant.",
)

params = llm._build_generation_params(system_instruction="x", tools=[], tool_config=None)
llm._maybe_unset_thinking_budget(params)
print(params)
# No thinking_config is added for Gemini 3 models.

Expected behavior

If the caller did not specify params.thinking, Pipecat should apply model-aware defaults:

  • gemini-2.5-flash* -> thinking_budget=0, include_thoughts=False
  • gemini-3-flash* / gemini-3.1-flash* -> thinking_level="minimal", include_thoughts=False
  • do not override explicit user-provided thinking config
  • probably leave Pro-family models untouched unless there is a clear safe default

Actual behavior

Only the Gemini 2.5 Flash family gets a default low-thinking config. Gemini 3+ models get provider-default thinking behavior unless every application remembers to configure it explicitly.

Suggested fix

Replace _maybe_unset_thinking_budget() with a model-aware default-thinking helper, for example:

  • detect the model family
  • apply the appropriate control surface for that family
  • skip entirely when thinking_config is already present

This would make latency behavior much more predictable across Gemini model upgrades.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions