Merge pull request #323 from Portkey-AI/fix/vertex-thinking

vrushankportkey · web-flow · commit a6367952bf77 · 2025-05-12T17:12:03.000+05:30
update docs on how to disable thinking for gemini models
diff --git a/integrations/llms/gemini.mdx b/integrations/llms/gemini.mdx
@@ -238,21 +238,197 @@ Grounding is invoked by passing the `google_search` tool (for newer models like
 If you mix regular tools with grounding tools, vertex might throw an error saying only one tool can be used at a time.
 </Warning>
 
-## gemini-2.0-flash-thinking-exp and other thinking models
+## thinking models
 
-`gemini-2.0-flash-thinking-exp` models return a Chain of Thought response along with the actual inference text,
-this is not openai compatible, however, Portkey supports this by adding a `\r\n\r\n` and appending the two responses together. 
-You can split the response along this pattern to get the Chain of Thought response and the actual inference text.
-
-If you require the Chain of Thought response along with the actual inference text, pass the [strict open ai compliance flag](/product/ai-gateway/strict-open-ai-compliance) as `false` in the request.
-
-If you want to get the inference text only, pass the [strict open ai compliance flag](/product/ai-gateway/strict-open-ai-compliance) as `true` in the request.
+<CodeGroup>
+    ```py Python
+    from portkey_ai import Portkey
+
+    # Initialize the Portkey client
+    portkey = Portkey(
+        api_key="PORTKEY_API_KEY",  # Replace with your Portkey API key
+        virtual_key="VIRTUAL_KEY",   # Add your provider's virtual key
+        strict_open_ai_compliance=False
+    )
+
+    # Create the request
+    response = portkey.chat.completions.create(
+      model="gemini-2.5-flash-preview-04-17",
+      max_tokens=3000,
+      thinking={
+          "type": "enabled",
+          "budget_tokens": 2030
+      },
+      stream=True,
+      messages=[
+          {
+              "role": "user",
+              "content": [
+                  {
+                      "type": "text",
+                      "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
+                  }
+              ]
+          }
+      ]
+    )
+    print(response)
+    ```
+    ```ts NodeJS
+    import Portkey from 'portkey-ai';
+
+    // Initialize the Portkey client
+    const portkey = new Portkey({
+      apiKey: "PORTKEY_API_KEY", // Replace with your Portkey API key
+      virtualKey: "VIRTUAL_KEY", // your vertex-ai virtual key
+      strictOpenAiCompliance: false
+    });
+
+    // Generate a chat completion
+    async function getChatCompletionFunctions() {
+        const response = await portkey.chat.completions.create({
+          model: "gemini-2.5-flash-preview-04-17",
+          max_tokens: 3000,
+          thinking: {
+              type: "enabled",
+              budget_tokens: 2030
+          },
+          stream: true,
+          messages: [
+              {
+                  role: "user",
+                  content: [
+                      {
+                          type: "text",
+                          text: "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
+                      }
+                  ]
+              }
+          ]
+        });
+        console.log(response);
+
+    // Call the function
+    getChatCompletionFunctions();
+    ```
+    ```js OpenAI NodeJS
+    import OpenAI from 'openai'; // We're using the v4 SDK
+    import { PORTKEY_GATEWAY_URL, createHeaders } from 'portkey-ai'
+
+    const openai = new OpenAI({
+      apiKey: 'VERTEX_API_KEY', // defaults to process.env["OPENAI_API_KEY"],
+      baseURL: PORTKEY_GATEWAY_URL,
+      defaultHeaders: createHeaders({
+        provider: "vertex-ai",
+        apiKey: "PORTKEY_API_KEY", // defaults to process.env["PORTKEY_API_KEY"]
+        strictOpenAiCompliance: false
+      })
+    });
+
+    // Generate a chat completion with streaming
+    async function getChatCompletionFunctions(){
+      const response = await openai.chat.completions.create({
+        model: "gemini-2.5-flash-preview-04-17",
+        max_tokens: 3000,
+        thinking: {
+            type: "enabled",
+            budget_tokens: 2030
+        },
+        stream: true,
+        messages: [
+            {
+                role: "user",
+                content: [
+                    {
+                        type: "text",
+                        text: "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
+                    }
+                ]
+            }
+        ],
+      });
+
+      console.log(response)
+    }
+    await getChatCompletionFunctions();
+    ```
+    ```py OpenAI Python
+    from openai import OpenAI
+    from portkey_ai import PORTKEY_GATEWAY_URL, createHeaders
+
+    openai = OpenAI(
+        api_key='VERTEX_API_KEY',
+        base_url=PORTKEY_GATEWAY_URL,
+        default_headers=createHeaders(
+            provider="vertex-ai",
+            api_key="PORTKEY_API_KEY",
+            strict_open_ai_compliance=False
+        )
+    )
 
-## Managing Google Gemini Prompts
 
-You can manage all prompts to Google Gemini in the [Prompt Library](/product/prompt-library). All the current models of Google Gemini are supported and you can easily start testing different prompts.
+    response = openai.chat.completions.create(
+        model="gemini-2.5-flash-preview-04-17",
+        max_tokens=3000,
+        thinking={
+            "type": "enabled",
+            "budget_tokens": 2030
+        },
+        stream=True,
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
+                    }
+                ]
+            }
+        ]
+    )
+
+    print(response)
+    ```
+    ```sh cURL
+    curl "https://api.portkey.ai/v1/chat/completions" \
+      -H "Content-Type: application/json" \
+      -H "x-portkey-api-key: $PORTKEY_API_KEY" \
+      -H "x-portkey-provider: vertex-ai" \
+      -H "x-api-key: $VERTEX_API_KEY" \
+      -H "x-portkey-strict-open-ai-compliance: false" \
+      -d '{
+        "model": "gemini-2.5-flash-preview-04-17",
+        "max_tokens": 3000,
+        "thinking": {
+          "type": "enabled",
+          "budget_tokens": 2030
+        },
+        "stream": true,
+        "messages": [
+          {
+            "role": "user",
+            "content": [
+              {
+                "type": "text",
+                "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
+              }
+            ]
+          }
+        ]
+      }'
+    ```
+</CodeGroup>
 
-Once you're ready with your prompt, you can use the `portkey.prompts.completions.create` interface to use the prompt in your application.
+<Note>
+    To disable thinking for gemini models like `google.gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0` and `type` to `disabled`.
+    ```json
+    "thinking": {
+        "type": "disabled",
+        "budget_tokens": 0
+    }
+    ```
+</Note>
 
 <Info>
 Gemini grounding mode may not work via Portkey SDK. Contact support@portkey.ai for assistance.
diff --git a/integrations/llms/vertex-ai.mdx b/integrations/llms/vertex-ai.mdx
@@ -264,9 +264,11 @@ curl --location 'https://api.portkey.ai/v1/chat/completions' \
 
 <Note>
 The assistants thinking response is returned in the `response_chunk.choices[0].delta.content_blocks` array, not the `response.choices[0].message.content` string.
+
+Gemini models do no return their chain-of-thought-messages, so content_blocks are not required for Gemini models.
 </Note>
 
-Models like `anthropic.claude-3-7-sonnet@20250219` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).
+Models like `google.gemini-2.5-flash-preview-04-17` `anthropic.claude-3-7-sonnet@20250219` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).
 This is similar to openai thinking, but you get the model's reasoning as it processes the request as well.
 
 Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-gateway/strict-open-ai-compliance) in the headers to use this feature.
@@ -484,6 +486,16 @@ Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-g
     ```
 </CodeGroup>
 
+<Note>
+    To disable thinking for gemini models like `google.gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0` and `type` to `disabled`.
+    ```json
+    "thinking": {
+        "type": "disabled",
+        "budget_tokens": 0
+    }
+    ```
+</Note>
+
 ### Multi turn conversation
 
 <CodeGroup>
@@ -737,19 +749,18 @@ Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-g
     ```
 </CodeGroup>
 
-<Note>
-This same message format also works for all other media types — just send your media file in the `url` field, like `"url": "gs://cloud-samples-data/video/animals.mp4"` for google cloud urls and `"url":"https://download.samplelib.com/mp3/sample-3s.mp3"` for public urls
-
-Your URL should have the file extension, this is used for inferring `MIME_TYPE` which is a required parameter for prompting Gemini models with files
-</Note>
-
 ### Sending `base64` Image
 
 Here, you can send the `base64` image data along with the `url` field too:&#x20;
 
 ```json
 "url": "data:image/png;base64,UklGRkacAABXRUJQVlA4IDqcAAC....."
 ```
+<Note>
+This same message format also works for all other media types — just send your media file in the `url` field, like `"url": "gs://cloud-samples-data/video/animals.mp4"` for google cloud urls and `"url":"https://download.samplelib.com/mp3/sample-3s.mp3"` for public urls
+
+Your URL should have the file extension, this is used for inferring `MIME_TYPE` which is a required parameter for prompting Gemini models with files
+</Note>
 
 ## Text Embedding Models