Skip to content

update docs on how to disable thinking for gemini models #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 187 additions & 11 deletions integrations/llms/gemini.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -238,21 +238,197 @@ Grounding is invoked by passing the `google_search` tool (for newer models like
If you mix regular tools with grounding tools, vertex might throw an error saying only one tool can be used at a time.
</Warning>

## gemini-2.0-flash-thinking-exp and other thinking models
## thinking models

`gemini-2.0-flash-thinking-exp` models return a Chain of Thought response along with the actual inference text,
this is not openai compatible, however, Portkey supports this by adding a `\r\n\r\n` and appending the two responses together.
You can split the response along this pattern to get the Chain of Thought response and the actual inference text.

If you require the Chain of Thought response along with the actual inference text, pass the [strict open ai compliance flag](/product/ai-gateway/strict-open-ai-compliance) as `false` in the request.

If you want to get the inference text only, pass the [strict open ai compliance flag](/product/ai-gateway/strict-open-ai-compliance) as `true` in the request.
<CodeGroup>
```py Python
from portkey_ai import Portkey

# Initialize the Portkey client
portkey = Portkey(
api_key="PORTKEY_API_KEY", # Replace with your Portkey API key
virtual_key="VIRTUAL_KEY", # Add your provider's virtual key
strict_open_ai_compliance=False
)

# Create the request
response = portkey.chat.completions.create(
model="gemini-2.5-flash-preview-04-17",
max_tokens=3000,
thinking={
"type": "enabled",
"budget_tokens": 2030
},
stream=True,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
}
]
}
]
)
print(response)
```
```ts NodeJS
import Portkey from 'portkey-ai';

// Initialize the Portkey client
const portkey = new Portkey({
apiKey: "PORTKEY_API_KEY", // Replace with your Portkey API key
virtualKey: "VIRTUAL_KEY", // your vertex-ai virtual key
strictOpenAiCompliance: false
});

// Generate a chat completion
async function getChatCompletionFunctions() {
const response = await portkey.chat.completions.create({
model: "gemini-2.5-flash-preview-04-17",
max_tokens: 3000,
thinking: {
type: "enabled",
budget_tokens: 2030
},
stream: true,
messages: [
{
role: "user",
content: [
{
type: "text",
text: "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
}
]
}
]
});
console.log(response);

// Call the function
getChatCompletionFunctions();
```
```js OpenAI NodeJS
import OpenAI from 'openai'; // We're using the v4 SDK
import { PORTKEY_GATEWAY_URL, createHeaders } from 'portkey-ai'

const openai = new OpenAI({
apiKey: 'VERTEX_API_KEY', // defaults to process.env["OPENAI_API_KEY"],
baseURL: PORTKEY_GATEWAY_URL,
defaultHeaders: createHeaders({
provider: "vertex-ai",
apiKey: "PORTKEY_API_KEY", // defaults to process.env["PORTKEY_API_KEY"]
strictOpenAiCompliance: false
})
});

// Generate a chat completion with streaming
async function getChatCompletionFunctions(){
const response = await openai.chat.completions.create({
model: "gemini-2.5-flash-preview-04-17",
max_tokens: 3000,
thinking: {
type: "enabled",
budget_tokens: 2030
},
stream: true,
messages: [
{
role: "user",
content: [
{
type: "text",
text: "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
}
]
}
],
});

console.log(response)
}
await getChatCompletionFunctions();
```
```py OpenAI Python
from openai import OpenAI
from portkey_ai import PORTKEY_GATEWAY_URL, createHeaders

openai = OpenAI(
api_key='VERTEX_API_KEY',
base_url=PORTKEY_GATEWAY_URL,
default_headers=createHeaders(
provider="vertex-ai",
api_key="PORTKEY_API_KEY",
strict_open_ai_compliance=False
)
)

## Managing Google Gemini Prompts

You can manage all prompts to Google Gemini in the [Prompt Library](/product/prompt-library). All the current models of Google Gemini are supported and you can easily start testing different prompts.
response = openai.chat.completions.create(
model="gemini-2.5-flash-preview-04-17",
max_tokens=3000,
thinking={
"type": "enabled",
"budget_tokens": 2030
},
stream=True,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
}
]
}
]
)

print(response)
```
```sh cURL
curl "https://api.portkey.ai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "x-portkey-api-key: $PORTKEY_API_KEY" \
-H "x-portkey-provider: vertex-ai" \
-H "x-api-key: $VERTEX_API_KEY" \
-H "x-portkey-strict-open-ai-compliance: false" \
-d '{
"model": "gemini-2.5-flash-preview-04-17",
"max_tokens": 3000,
"thinking": {
"type": "enabled",
"budget_tokens": 2030
},
"stream": true,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
}
]
}
]
}'
```
</CodeGroup>

Once you're ready with your prompt, you can use the `portkey.prompts.completions.create` interface to use the prompt in your application.
<Note>
To disable thinking for gemini models like `google.gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0` and `type` to `disabled`.
```json
"thinking": {
"type": "disabled",
"budget_tokens": 0
}
```
</Note>

<Info>
Gemini grounding mode may not work via Portkey SDK. Contact [email protected] for assistance.
Expand Down
25 changes: 18 additions & 7 deletions integrations/llms/vertex-ai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -264,9 +264,11 @@ curl --location 'https://api.portkey.ai/v1/chat/completions' \

<Note>
The assistants thinking response is returned in the `response_chunk.choices[0].delta.content_blocks` array, not the `response.choices[0].message.content` string.

Gemini models do no return their chain-of-thought-messages, so content_blocks are not required for Gemini models.
</Note>

Models like `anthropic.claude-3-7-sonnet@20250219` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).
Models like `google.gemini-2.5-flash-preview-04-17` `anthropic.claude-3-7-sonnet@20250219` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).
This is similar to openai thinking, but you get the model's reasoning as it processes the request as well.

Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-gateway/strict-open-ai-compliance) in the headers to use this feature.
Expand Down Expand Up @@ -484,6 +486,16 @@ Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-g
```
</CodeGroup>

<Note>
To disable thinking for gemini models like `google.gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0` and `type` to `disabled`.
```json
"thinking": {
"type": "disabled",
"budget_tokens": 0
}
```
</Note>

### Multi turn conversation

<CodeGroup>
Expand Down Expand Up @@ -737,19 +749,18 @@ Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-g
```
</CodeGroup>

<Note>
This same message format also works for all other media types — just send your media file in the `url` field, like `"url": "gs://cloud-samples-data/video/animals.mp4"` for google cloud urls and `"url":"https://download.samplelib.com/mp3/sample-3s.mp3"` for public urls

Your URL should have the file extension, this is used for inferring `MIME_TYPE` which is a required parameter for prompting Gemini models with files
</Note>

### Sending `base64` Image

Here, you can send the `base64` image data along with the `url` field too:&#x20;

```json
"url": "....."
```
<Note>
This same message format also works for all other media types — just send your media file in the `url` field, like `"url": "gs://cloud-samples-data/video/animals.mp4"` for google cloud urls and `"url":"https://download.samplelib.com/mp3/sample-3s.mp3"` for public urls

Your URL should have the file extension, this is used for inferring `MIME_TYPE` which is a required parameter for prompting Gemini models with files
</Note>

## Text Embedding Models

Expand Down