Significant-Gravitas · devbyteai · Dec 18, 2025 · Dec 27, 2025 · Dec 27, 2025 · Dec 28, 2025
@@ -0,0 +1,117 @@
+## Summary
+
+Migrates OpenAI native API calls from the deprecated `chat.completions.create` endpoint to the new `responses.create` endpoint as recommended by OpenAI.
+
+Fixes #11624
+
+## Changes
+
+### Core Changes
+1. Updated OpenAI provider in `llm.py` to use `client.responses.create()`
+2. Added `extract_responses_api_reasoning()` helper to parse reasoning output (handles both string and array summary formats)
+3. Added `extract_responses_api_tool_calls()` helper to parse function calls
+4. Added error handling for API errors (matching Anthropic provider pattern)
+5. Extract system messages to `instructions` parameter (Responses API requirement)
+
+### Parameter Mapping (Chat Completions → Responses API)
+1. `messages` → `input` (non-system messages only)
+2. System messages → `instructions` parameter
+3. `max_completion_tokens` → `max_output_tokens`
+4. `response_format={...}` → `text={"format":{...}}`
+
+### Response Parsing (Chat Completions → Responses API)
+1. `choices[0].message.content` → `output_text`
+2. `usage.prompt_tokens` → `usage.input_tokens`
+3. `usage.completion_tokens` → `usage.output_tokens`
+4. `choices[0].message.tool_calls` → `output` items with `type="function_call"`
+
+## Compatibility
+
+### SDK Version
+1. **Required:** openai >= 1.66.0 (Responses API added in [v1.66.0](https://github.com/openai/openai-python/releases/tag/v1.66.0))
+2. **AutoGPT uses:** ^1.97.1 (COMPATIBLE)
+
+### API Compatibility
+1. `llm_call()` function signature - UNCHANGED
+2. `LLMResponse` class structure - UNCHANGED
+3. Return type and fields - UNCHANGED
+
+### Provider Impact
+1. `openai` - YES, modified (Native OpenAI - uses Responses API)
+2. `anthropic` - NO (Different SDK entirely)
+3. `groq` - NO (Third-party API, Chat Completions compatible)
+4. `open_router` - NO (Third-party API, Chat Completions compatible)
+5. `llama_api` - NO (Third-party API, Chat Completions compatible)
+6. `ollama` - NO (Uses ollama SDK)
+7. `aiml_api` - NO (Third-party API, Chat Completions compatible)
+8. `v0` - NO (Third-party API, Chat Completions compatible)
+
+### Dependent Blocks Verified
+1. `smart_decision_maker.py` (Line 508) - Uses: response, tool_calls, prompt_tokens, completion_tokens, reasoning - COMPATIBLE
+2. `ai_condition.py` (Line 113) - Uses: response, prompt_tokens, completion_tokens, prompt - COMPATIBLE
+3. `perplexity.py` - Does not use llm_call (uses different API) - NOT AFFECTED
+
+### Streaming Service
+`backend/server/v2/chat/service.py` is NOT affected - it uses OpenRouter by default which requires Chat Completions API format.
+
+## Testing
+
+### Test File Updates
+1. Updated `test_llm.py` mocks to use `output_text` instead of `choices[0].message.content`
+2. Updated mocks to use `output` array for tool calls
+3. Updated mocks to use `usage.input_tokens` / `usage.output_tokens`
+
+### Verification Performed
+1. SDK version compatibility verified (1.97.1 > 1.66.0)
+2. Function signature unchanged
+3. LLMResponse class unchanged
+4. All 7 other providers unchanged
+5. Dependent blocks use only public API
+6. Streaming service unaffected (uses OpenRouter)
+7. Error handling matches Anthropic provider pattern
+8. Tool call extraction handles `call_id` with fallback to `id`
+9. Reasoning extraction handles both string and array `summary` formats
+
+### Recommended Manual Testing
+1. Test with GPT-4o model using native OpenAI API
+2. Test with tool/function calling enabled
+3. Test with JSON mode (`force_json_output=True`)
+4. Verify token counting works correctly
+
+## Files Modified
+
+### 1. `autogpt_platform/backend/backend/blocks/llm.py`
+1. Added `extract_responses_api_reasoning()` helper
+2. Added `extract_responses_api_tool_calls()` helper
+3. Updated OpenAI provider section to use `responses.create`
+4. Added error handling with try/except
+5. Extract system messages to `instructions` parameter
+
+### 2. `autogpt_platform/backend/backend/blocks/test/test_llm.py`
+1. Updated mocks for Responses API format
+
+## References
+
+1. [OpenAI Responses API Docs](https://platform.openai.com/docs/api-reference/responses)
+2. [OpenAI Function Calling Docs](https://platform.openai.com/docs/guides/function-calling)
+3. [OpenAI Reasoning Docs](https://platform.openai.com/docs/guides/reasoning)
+4. [Simon Willison's Comparison](https://simonwillison.net/2025/Mar/11/responses-vs-chat-completions/)
+5. [OpenAI Python SDK v1.66.0 Release](https://github.com/openai/openai-python/releases/tag/v1.66.0)
+
+## Checklist
+
+### Changes
+- [x] I have clearly listed my changes in the PR description
+- [x] I have made a test plan
+- [x] I have tested my changes according to the test plan:
+  - [x] Updated unit test mocks to use Responses API format
+  - [x] Verified function signature unchanged
+  - [x] Verified LLMResponse class unchanged
+  - [x] Verified dependent blocks compatible
+  - [x] Verified other providers unchanged
+
+### Code Quality
+- [x] My code follows the project's style guidelines
+- [x] I have commented my code where necessary
+- [x] My changes generate no new warnings
+- [x] I have added error handling matching existing patterns
@@ -362,8 +362,7 @@ def convert_openai_tool_fmt_to_anthropic(
 
 
 def extract_openai_reasoning(response) -> str | None:
-    """Extract reasoning from OpenAI-compatible response if available."""
-    """Note: This will likely not working since the reasoning is not present in another Response API"""
+    """Extract reasoning from OpenAI Chat Completions response if available."""
     reasoning = None
     choice = response.choices[0]
     if hasattr(choice, "reasoning") and getattr(choice, "reasoning", None):
@@ -378,7 +377,7 @@ def extract_openai_reasoning(response) -> str | None:
 
 
 def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None:
-    """Extract tool calls from OpenAI-compatible response."""
+    """Extract tool calls from OpenAI Chat Completions response."""
     if response.choices[0].message.tool_calls:
         return [
             ToolContentBlock(
@@ -394,9 +393,47 @@ def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None:
     return None
 
 
+def extract_responses_api_reasoning(response) -> str | None:
+    """Extract reasoning from OpenAI Responses API response if available.
+
+    The summary field can be either a string or an array of summary items,
+    so we handle both cases appropriately.
+    """
+    # The Responses API stores reasoning in output items with type "reasoning"
+    if hasattr(response, "output") and response.output:
+        for item in response.output:
+            if hasattr(item, "type") and item.type == "reasoning":
+                if hasattr(item, "summary") and item.summary:
+                    # Handle both string and array summary formats
+                    if isinstance(item.summary, list):
+                        # Join array items into a single string
+                        return " ".join(str(s) for s in item.summary if s)
+                    return str(item.summary)
+    return None
+
+
+def extract_responses_api_tool_calls(response) -> list[ToolContentBlock] | None:
+    """Extract tool calls from OpenAI Responses API response."""
+    tool_calls = []
+    if hasattr(response, "output") and response.output:
+        for item in response.output:
+            if hasattr(item, "type") and item.type == "function_call":
+                tool_calls.append(
+                    ToolContentBlock(
+                        id=getattr(item, "call_id", getattr(item, "id", "")),
+                        type="function",
+                        function=ToolCall(
+                            name=item.name,
+                            arguments=item.arguments,
+                        ),
+                    )
+                )
+    return tool_calls if tool_calls else None
+
+
 def get_parallel_tool_calls_param(
     llm_model: LlmModel, parallel_tool_calls: bool | None
-):
+) -> bool | openai.NotGiven:
     """Get the appropriate parallel_tool_calls parameter for OpenAI-compatible APIs."""
     if llm_model.startswith("o") or parallel_tool_calls is None:
         return openai.NOT_GIVEN
@@ -454,34 +491,81 @@ async def llm_call(
     if provider == "openai":
         tools_param = tools if tools else openai.NOT_GIVEN
         oai_client = openai.AsyncOpenAI(api_key=credentials.api_key.get_secret_value())
-        response_format = None
 
-        parallel_tool_calls = get_parallel_tool_calls_param(
+        parallel_tool_calls_param = get_parallel_tool_calls_param(
             llm_model, parallel_tool_calls
         )
 
+        # Extract system messages for instructions parameter
+        system_messages = [p["content"] for p in prompt if p["role"] == "system"]
+        instructions = " ".join(system_messages) if system_messages else None
+
+        # Filter out system messages for input (Responses API expects them in instructions)
+        input_messages = [p for p in prompt if p["role"] != "system"]
+
+        # Build Responses API parameters
+        responses_params: dict = {
+            "model": llm_model.value,
+            "input": input_messages,
+            "max_output_tokens": max_tokens,
+        }
+
+        if instructions:
+            responses_params["instructions"] = instructions
+
+        if tools_param is not openai.NOT_GIVEN:
+            responses_params["tools"] = tools_param
+            if parallel_tool_calls_param is not openai.NOT_GIVEN:
+                responses_params["parallel_tool_calls"] = parallel_tool_calls_param
+
         if force_json_output:
-            response_format = {"type": "json_object"}
+            responses_params["text"] = {"format": {"type": "json_object"}}
 
-        response = await oai_client.chat.completions.create(
-            model=llm_model.value,
-            messages=prompt,  # type: ignore
-            response_format=response_format,  # type: ignore
-            max_completion_tokens=max_tokens,
-            tools=tools_param,  # type: ignore
-            parallel_tool_calls=parallel_tool_calls,
-        )
+        try:
+            response = await oai_client.responses.create(
+                **responses_params, timeout=600
+            )
+        except openai.APIError as e:
+            error_message = (
+                f"OpenAI Responses API error for model {llm_model.value}: {str(e)}"
+            )
+            logger.error(error_message)
+            raise ValueError(error_message) from e
+        except TimeoutError as e:
+            error_message = f"OpenAI Responses API timeout for model {llm_model.value}"
+            logger.error(error_message)
+            raise ValueError(error_message) from e
 
-        tool_calls = extract_openai_tool_calls(response)
-        reasoning = extract_openai_reasoning(response)
+        tool_calls = extract_responses_api_tool_calls(response)
+        reasoning = extract_responses_api_reasoning(response)
+
+        # Build a message dict for raw_response that matches the expected format
+        # for conversation history (role, content, and optionally tool_calls)
+        raw_response_dict: dict = {
+            "role": "assistant",
+            "content": response.output_text or "",
+        }
+        # Add tool_calls in OpenAI format if present
+        if tool_calls:
+            raw_response_dict["tool_calls"] = [
+                {
+                    "id": tc.id,
+                    "type": "function",
+                    "function": {
+                        "name": tc.function.name,
+                        "arguments": tc.function.arguments,
+                    },
+                }
+                for tc in tool_calls
+            ]
 
         return LLMResponse(
-            raw_response=response.choices[0].message,
+            raw_response=raw_response_dict,
             prompt=prompt,
-            response=response.choices[0].message.content or "",
+            response=response.output_text or "",
             tool_calls=tool_calls,
-            prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
-            completion_tokens=response.usage.completion_tokens if response.usage else 0,
+            prompt_tokens=response.usage.input_tokens if response.usage else 0,
+            completion_tokens=response.usage.output_tokens if response.usage else 0,
             reasoning=reasoning,
         )
     elif provider == "anthropic":

@@ -13,18 +13,17 @@ async def test_llm_call_returns_token_counts(self):
         """Test that llm_call returns proper token counts in LLMResponse."""
         import backend.blocks.llm as llm
 
-        # Mock the OpenAI client
+        # Mock the OpenAI Responses API response
         mock_response = MagicMock()
-        mock_response.choices = [
-            MagicMock(message=MagicMock(content="Test response", tool_calls=None))
-        ]
-        mock_response.usage = MagicMock(prompt_tokens=10, completion_tokens=20)
+        mock_response.output_text = "Test response"
+        mock_response.output = []  # No tool calls
+        mock_response.usage = MagicMock(input_tokens=10, output_tokens=20)
 
-        # Test with mocked OpenAI response
+        # Test with mocked OpenAI Responses API
         with patch("openai.AsyncOpenAI") as mock_openai:
             mock_client = AsyncMock()
             mock_openai.return_value = mock_client
-            mock_client.chat.completions.create = AsyncMock(return_value=mock_response)
+            mock_client.responses.create = AsyncMock(return_value=mock_response)
 
             response = await llm.llm_call(
                 credentials=llm.TEST_CREDENTIALS,
@@ -41,8 +40,6 @@ async def test_llm_call_returns_token_counts(self):
     @pytest.mark.asyncio
     async def test_ai_structured_response_block_tracks_stats(self):
         """Test that AIStructuredResponseGeneratorBlock correctly tracks stats."""
-        from unittest.mock import patch
-
         import backend.blocks.llm as llm
 
         block = llm.AIStructuredResponseGeneratorBlock()
@@ -255,13 +252,11 @@ async def mock_llm_call(input_data, credentials):
     @pytest.mark.asyncio
     async def test_ai_text_summarizer_real_llm_call_stats(self):
         """Test AITextSummarizer with real LLM call mocking to verify llm_call_count."""
-        from unittest.mock import AsyncMock, MagicMock, patch
-
         import backend.blocks.llm as llm
 
         block = llm.AITextSummarizerBlock()
 
-        # Mock the actual LLM call instead of the llm_call method
+        # Mock the actual LLM call using Responses API format
         call_count = 0
 
         async def mock_create(*args, **kwargs):
@@ -271,30 +266,17 @@ async def mock_create(*args, **kwargs):
             mock_response = MagicMock()
             # Return different responses for chunk summary vs final summary
             if call_count == 1:
-                mock_response.choices = [
-                    MagicMock(
-                        message=MagicMock(
-                            content='<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>',
-                            tool_calls=None,
-                        )
-                    )
-                ]
+                mock_response.output_text = '<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>'
             else:
-                mock_response.choices = [
-                    MagicMock(
-                        message=MagicMock(
-                            content='<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>',
-                            tool_calls=None,
-                        )
-                    )
-                ]
-            mock_response.usage = MagicMock(prompt_tokens=50, completion_tokens=30)
+                mock_response.output_text = '<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>'
+            mock_response.output = []  # No tool calls
+            mock_response.usage = MagicMock(input_tokens=50, output_tokens=30)
             return mock_response
 
         with patch("openai.AsyncOpenAI") as mock_openai:
             mock_client = AsyncMock()
             mock_openai.return_value = mock_client
-            mock_client.chat.completions.create = mock_create
+            mock_client.responses.create = mock_create
 
             # Test with very short text (should only need 1 chunk + 1 final summary)
             input_data = llm.AITextSummarizerBlock.Input(
@@ -312,10 +294,6 @@ async def mock_create(*args, **kwargs):
                 ):
                     outputs[output_name] = output_data
 
-            print(f"Actual calls made: {call_count}")
-            print(f"Block stats: {block.execution_stats}")
-            print(f"LLM call count: {block.execution_stats.llm_call_count}")
-
             # Should have made 2 calls: 1 for chunk summary + 1 for final summary
             assert block.execution_stats.llm_call_count >= 1
             assert block.execution_stats.input_token_count > 0