-
Notifications
You must be signed in to change notification settings - Fork 46.2k
fix: migrate OpenAI provider to use Responses API #11674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
4a7bc00
8bdc5d8
c558293
eea40ff
b3c7cb4
e60823a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| ## Summary | ||
|
|
||
| Migrates OpenAI native API calls from the deprecated `chat.completions.create` endpoint to the new `responses.create` endpoint as recommended by OpenAI. | ||
|
|
||
| Fixes #11624 | ||
|
|
||
| ## Changes | ||
|
|
||
| ### Core Changes | ||
| 1. Updated OpenAI provider in `llm.py` to use `client.responses.create()` | ||
| 2. Added `extract_responses_api_reasoning()` helper to parse reasoning output (handles both string and array summary formats) | ||
| 3. Added `extract_responses_api_tool_calls()` helper to parse function calls | ||
| 4. Added error handling for API errors (matching Anthropic provider pattern) | ||
| 5. Extract system messages to `instructions` parameter (Responses API requirement) | ||
|
|
||
| ### Parameter Mapping (Chat Completions → Responses API) | ||
| 1. `messages` → `input` (non-system messages only) | ||
| 2. System messages → `instructions` parameter | ||
| 3. `max_completion_tokens` → `max_output_tokens` | ||
| 4. `response_format={...}` → `text={"format":{...}}` | ||
|
|
||
| ### Response Parsing (Chat Completions → Responses API) | ||
| 1. `choices[0].message.content` → `output_text` | ||
| 2. `usage.prompt_tokens` → `usage.input_tokens` | ||
| 3. `usage.completion_tokens` → `usage.output_tokens` | ||
| 4. `choices[0].message.tool_calls` → `output` items with `type="function_call"` | ||
|
|
||
| ## Compatibility | ||
|
|
||
| ### SDK Version | ||
| 1. **Required:** openai >= 1.66.0 (Responses API added in [v1.66.0](https://github.com/openai/openai-python/releases/tag/v1.66.0)) | ||
| 2. **AutoGPT uses:** ^1.97.1 (COMPATIBLE) | ||
|
|
||
| ### API Compatibility | ||
| 1. `llm_call()` function signature - UNCHANGED | ||
| 2. `LLMResponse` class structure - UNCHANGED | ||
| 3. Return type and fields - UNCHANGED | ||
|
|
||
| ### Provider Impact | ||
| 1. `openai` - YES, modified (Native OpenAI - uses Responses API) | ||
| 2. `anthropic` - NO (Different SDK entirely) | ||
| 3. `groq` - NO (Third-party API, Chat Completions compatible) | ||
| 4. `open_router` - NO (Third-party API, Chat Completions compatible) | ||
| 5. `llama_api` - NO (Third-party API, Chat Completions compatible) | ||
| 6. `ollama` - NO (Uses ollama SDK) | ||
| 7. `aiml_api` - NO (Third-party API, Chat Completions compatible) | ||
| 8. `v0` - NO (Third-party API, Chat Completions compatible) | ||
|
|
||
| ### Dependent Blocks Verified | ||
| 1. `smart_decision_maker.py` (Line 508) - Uses: response, tool_calls, prompt_tokens, completion_tokens, reasoning - COMPATIBLE | ||
| 2. `ai_condition.py` (Line 113) - Uses: response, prompt_tokens, completion_tokens, prompt - COMPATIBLE | ||
| 3. `perplexity.py` - Does not use llm_call (uses different API) - NOT AFFECTED | ||
|
|
||
| ### Streaming Service | ||
| `backend/server/v2/chat/service.py` is NOT affected - it uses OpenRouter by default which requires Chat Completions API format. | ||
|
|
||
| ## Testing | ||
|
|
||
| ### Test File Updates | ||
| 1. Updated `test_llm.py` mocks to use `output_text` instead of `choices[0].message.content` | ||
| 2. Updated mocks to use `output` array for tool calls | ||
| 3. Updated mocks to use `usage.input_tokens` / `usage.output_tokens` | ||
|
|
||
| ### Verification Performed | ||
| 1. SDK version compatibility verified (1.97.1 > 1.66.0) | ||
| 2. Function signature unchanged | ||
| 3. LLMResponse class unchanged | ||
| 4. All 7 other providers unchanged | ||
| 5. Dependent blocks use only public API | ||
| 6. Streaming service unaffected (uses OpenRouter) | ||
| 7. Error handling matches Anthropic provider pattern | ||
| 8. Tool call extraction handles `call_id` with fallback to `id` | ||
| 9. Reasoning extraction handles both string and array `summary` formats | ||
|
|
||
| ### Recommended Manual Testing | ||
| 1. Test with GPT-4o model using native OpenAI API | ||
| 2. Test with tool/function calling enabled | ||
| 3. Test with JSON mode (`force_json_output=True`) | ||
| 4. Verify token counting works correctly | ||
|
|
||
| ## Files Modified | ||
|
|
||
| ### 1. `autogpt_platform/backend/backend/blocks/llm.py` | ||
| 1. Added `extract_responses_api_reasoning()` helper | ||
| 2. Added `extract_responses_api_tool_calls()` helper | ||
| 3. Updated OpenAI provider section to use `responses.create` | ||
| 4. Added error handling with try/except | ||
| 5. Extract system messages to `instructions` parameter | ||
|
|
||
| ### 2. `autogpt_platform/backend/backend/blocks/test/test_llm.py` | ||
| 1. Updated mocks for Responses API format | ||
|
|
||
| ## References | ||
|
|
||
| 1. [OpenAI Responses API Docs](https://platform.openai.com/docs/api-reference/responses) | ||
| 2. [OpenAI Function Calling Docs](https://platform.openai.com/docs/guides/function-calling) | ||
| 3. [OpenAI Reasoning Docs](https://platform.openai.com/docs/guides/reasoning) | ||
| 4. [Simon Willison's Comparison](https://simonwillison.net/2025/Mar/11/responses-vs-chat-completions/) | ||
| 5. [OpenAI Python SDK v1.66.0 Release](https://github.com/openai/openai-python/releases/tag/v1.66.0) | ||
|
|
||
| ## Checklist | ||
|
|
||
| ### Changes | ||
| - [x] I have clearly listed my changes in the PR description | ||
| - [x] I have made a test plan | ||
| - [x] I have tested my changes according to the test plan: | ||
| - [x] Updated unit test mocks to use Responses API format | ||
| - [x] Verified function signature unchanged | ||
| - [x] Verified LLMResponse class unchanged | ||
| - [x] Verified dependent blocks compatible | ||
| - [x] Verified other providers unchanged | ||
|
|
||
| ### Code Quality | ||
| - [x] My code follows the project's style guidelines | ||
| - [x] I have commented my code where necessary | ||
| - [x] My changes generate no new warnings | ||
| - [x] I have added error handling matching existing patterns |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -362,8 +362,7 @@ def convert_openai_tool_fmt_to_anthropic( | |
|
|
||
|
|
||
| def extract_openai_reasoning(response) -> str | None: | ||
| """Extract reasoning from OpenAI-compatible response if available.""" | ||
| """Note: This will likely not working since the reasoning is not present in another Response API""" | ||
| """Extract reasoning from OpenAI Chat Completions response if available.""" | ||
| reasoning = None | ||
| choice = response.choices[0] | ||
| if hasattr(choice, "reasoning") and getattr(choice, "reasoning", None): | ||
|
|
@@ -378,7 +377,7 @@ def extract_openai_reasoning(response) -> str | None: | |
|
|
||
|
|
||
| def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None: | ||
| """Extract tool calls from OpenAI-compatible response.""" | ||
| """Extract tool calls from OpenAI Chat Completions response.""" | ||
| if response.choices[0].message.tool_calls: | ||
devbyteai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return [ | ||
| ToolContentBlock( | ||
|
|
@@ -394,9 +393,47 @@ def extract_openai_tool_calls(response) -> list[ToolContentBlock] | None: | |
| return None | ||
|
|
||
|
|
||
| def extract_responses_api_reasoning(response) -> str | None: | ||
| """Extract reasoning from OpenAI Responses API response if available. | ||
|
|
||
| The summary field can be either a string or an array of summary items, | ||
| so we handle both cases appropriately. | ||
| """ | ||
| # The Responses API stores reasoning in output items with type "reasoning" | ||
| if hasattr(response, "output") and response.output: | ||
| for item in response.output: | ||
| if hasattr(item, "type") and item.type == "reasoning": | ||
| if hasattr(item, "summary") and item.summary: | ||
| # Handle both string and array summary formats | ||
| if isinstance(item.summary, list): | ||
| # Join array items into a single string | ||
| return " ".join(str(s) for s in item.summary if s) | ||
| return str(item.summary) | ||
| return None | ||
|
|
||
|
|
||
| def extract_responses_api_tool_calls(response) -> list[ToolContentBlock] | None: | ||
| """Extract tool calls from OpenAI Responses API response.""" | ||
| tool_calls = [] | ||
| if hasattr(response, "output") and response.output: | ||
| for item in response.output: | ||
| if hasattr(item, "type") and item.type == "function_call": | ||
| tool_calls.append( | ||
| ToolContentBlock( | ||
| id=getattr(item, "call_id", getattr(item, "id", "")), | ||
| type="function", | ||
| function=ToolCall( | ||
| name=item.name, | ||
| arguments=item.arguments, | ||
| ), | ||
| ) | ||
| ) | ||
| return tool_calls if tool_calls else None | ||
|
|
||
|
|
||
| def get_parallel_tool_calls_param( | ||
| llm_model: LlmModel, parallel_tool_calls: bool | None | ||
| ): | ||
| ) -> bool | openai.NotGiven: | ||
| """Get the appropriate parallel_tool_calls parameter for OpenAI-compatible APIs.""" | ||
| if llm_model.startswith("o") or parallel_tool_calls is None: | ||
| return openai.NOT_GIVEN | ||
|
|
@@ -454,34 +491,81 @@ async def llm_call( | |
| if provider == "openai": | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟠 HIGH - God Function with 329 lines handling 8 different LLM providers Category: quality Description: Suggestion: Why this matters: Framework coupling makes code harder to test and migrate. Confidence: 75% |
||
| tools_param = tools if tools else openai.NOT_GIVEN | ||
| oai_client = openai.AsyncOpenAI(api_key=credentials.api_key.get_secret_value()) | ||
| response_format = None | ||
|
|
||
| parallel_tool_calls = get_parallel_tool_calls_param( | ||
| parallel_tool_calls_param = get_parallel_tool_calls_param( | ||
| llm_model, parallel_tool_calls | ||
| ) | ||
|
|
||
| # Extract system messages for instructions parameter | ||
| system_messages = [p["content"] for p in prompt if p["role"] == "system"] | ||
| instructions = " ".join(system_messages) if system_messages else None | ||
|
|
||
| # Filter out system messages for input (Responses API expects them in instructions) | ||
| input_messages = [p for p in prompt if p["role"] != "system"] | ||
|
Comment on lines
+500
to
+504
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 MEDIUM - Repeated list comprehension filters for system messages Category: performance Description: Suggestion: Confidence: 65% |
||
|
|
||
| # Build Responses API parameters | ||
| responses_params: dict = { | ||
| "model": llm_model.value, | ||
| "input": input_messages, | ||
| "max_output_tokens": max_tokens, | ||
| } | ||
|
|
||
| if instructions: | ||
| responses_params["instructions"] = instructions | ||
|
|
||
| if tools_param is not openai.NOT_GIVEN: | ||
| responses_params["tools"] = tools_param | ||
| if parallel_tool_calls_param is not openai.NOT_GIVEN: | ||
| responses_params["parallel_tool_calls"] = parallel_tool_calls_param | ||
|
|
||
| if force_json_output: | ||
| response_format = {"type": "json_object"} | ||
| responses_params["text"] = {"format": {"type": "json_object"}} | ||
|
|
||
| response = await oai_client.chat.completions.create( | ||
| model=llm_model.value, | ||
| messages=prompt, # type: ignore | ||
| response_format=response_format, # type: ignore | ||
| max_completion_tokens=max_tokens, | ||
| tools=tools_param, # type: ignore | ||
| parallel_tool_calls=parallel_tool_calls, | ||
| ) | ||
| try: | ||
| response = await oai_client.responses.create( | ||
| **responses_params, timeout=600 | ||
| ) | ||
| except openai.APIError as e: | ||
| error_message = ( | ||
| f"OpenAI Responses API error for model {llm_model.value}: {str(e)}" | ||
| ) | ||
| logger.error(error_message) | ||
| raise ValueError(error_message) from e | ||
| except TimeoutError as e: | ||
| error_message = f"OpenAI Responses API timeout for model {llm_model.value}" | ||
| logger.error(error_message) | ||
| raise ValueError(error_message) from e | ||
|
|
||
| tool_calls = extract_openai_tool_calls(response) | ||
| reasoning = extract_openai_reasoning(response) | ||
| tool_calls = extract_responses_api_tool_calls(response) | ||
| reasoning = extract_responses_api_reasoning(response) | ||
|
|
||
| # Build a message dict for raw_response that matches the expected format | ||
| # for conversation history (role, content, and optionally tool_calls) | ||
| raw_response_dict: dict = { | ||
| "role": "assistant", | ||
| "content": response.output_text or "", | ||
| } | ||
| # Add tool_calls in OpenAI format if present | ||
| if tool_calls: | ||
| raw_response_dict["tool_calls"] = [ | ||
| { | ||
| "id": tc.id, | ||
| "type": "function", | ||
| "function": { | ||
| "name": tc.function.name, | ||
| "arguments": tc.function.arguments, | ||
| }, | ||
| } | ||
| for tc in tool_calls | ||
| ] | ||
|
|
||
| return LLMResponse( | ||
| raw_response=response.choices[0].message, | ||
| raw_response=raw_response_dict, | ||
| prompt=prompt, | ||
| response=response.choices[0].message.content or "", | ||
| response=response.output_text or "", | ||
| tool_calls=tool_calls, | ||
| prompt_tokens=response.usage.prompt_tokens if response.usage else 0, | ||
| completion_tokens=response.usage.completion_tokens if response.usage else 0, | ||
| prompt_tokens=response.usage.input_tokens if response.usage else 0, | ||
| completion_tokens=response.usage.output_tokens if response.usage else 0, | ||
| reasoning=reasoning, | ||
| ) | ||
| elif provider == "anthropic": | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,18 +13,17 @@ async def test_llm_call_returns_token_counts(self): | |
| """Test that llm_call returns proper token counts in LLMResponse.""" | ||
| import backend.blocks.llm as llm | ||
|
|
||
| # Mock the OpenAI client | ||
| # Mock the OpenAI Responses API response | ||
| mock_response = MagicMock() | ||
| mock_response.choices = [ | ||
| MagicMock(message=MagicMock(content="Test response", tool_calls=None)) | ||
| ] | ||
| mock_response.usage = MagicMock(prompt_tokens=10, completion_tokens=20) | ||
| mock_response.output_text = "Test response" | ||
| mock_response.output = [] # No tool calls | ||
| mock_response.usage = MagicMock(input_tokens=10, output_tokens=20) | ||
|
|
||
| # Test with mocked OpenAI response | ||
| # Test with mocked OpenAI Responses API | ||
| with patch("openai.AsyncOpenAI") as mock_openai: | ||
| mock_client = AsyncMock() | ||
| mock_openai.return_value = mock_client | ||
| mock_client.chat.completions.create = AsyncMock(return_value=mock_response) | ||
| mock_client.responses.create = AsyncMock(return_value=mock_response) | ||
|
|
||
| response = await llm.llm_call( | ||
| credentials=llm.TEST_CREDENTIALS, | ||
devbyteai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
@@ -41,8 +40,6 @@ async def test_llm_call_returns_token_counts(self): | |
| @pytest.mark.asyncio | ||
| async def test_ai_structured_response_block_tracks_stats(self): | ||
| """Test that AIStructuredResponseGeneratorBlock correctly tracks stats.""" | ||
| from unittest.mock import patch | ||
|
|
||
| import backend.blocks.llm as llm | ||
|
|
||
| block = llm.AIStructuredResponseGeneratorBlock() | ||
|
|
@@ -255,13 +252,11 @@ async def mock_llm_call(input_data, credentials): | |
| @pytest.mark.asyncio | ||
| async def test_ai_text_summarizer_real_llm_call_stats(self): | ||
| """Test AITextSummarizer with real LLM call mocking to verify llm_call_count.""" | ||
| from unittest.mock import AsyncMock, MagicMock, patch | ||
|
|
||
| import backend.blocks.llm as llm | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 MEDIUM - Misleading test name suggests live network calls but uses mocks Category: quality Description: Suggestion: Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code. Confidence: 65%
devbyteai marked this conversation as resolved.
Show resolved
Hide resolved
devbyteai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| block = llm.AITextSummarizerBlock() | ||
|
|
||
| # Mock the actual LLM call instead of the llm_call method | ||
| # Mock the actual LLM call using Responses API format | ||
| call_count = 0 | ||
|
|
||
| async def mock_create(*args, **kwargs): | ||
|
|
@@ -271,30 +266,17 @@ async def mock_create(*args, **kwargs): | |
| mock_response = MagicMock() | ||
| # Return different responses for chunk summary vs final summary | ||
| if call_count == 1: | ||
| mock_response.choices = [ | ||
| MagicMock( | ||
| message=MagicMock( | ||
| content='<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>', | ||
| tool_calls=None, | ||
| ) | ||
| ) | ||
| ] | ||
| mock_response.output_text = '<json_output id="test123456">{"summary": "Test chunk summary"}</json_output>' | ||
| else: | ||
| mock_response.choices = [ | ||
| MagicMock( | ||
| message=MagicMock( | ||
| content='<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>', | ||
| tool_calls=None, | ||
| ) | ||
| ) | ||
| ] | ||
| mock_response.usage = MagicMock(prompt_tokens=50, completion_tokens=30) | ||
| mock_response.output_text = '<json_output id="test123456">{"final_summary": "Test final summary"}</json_output>' | ||
| mock_response.output = [] # No tool calls | ||
| mock_response.usage = MagicMock(input_tokens=50, output_tokens=30) | ||
| return mock_response | ||
|
|
||
| with patch("openai.AsyncOpenAI") as mock_openai: | ||
| mock_client = AsyncMock() | ||
| mock_openai.return_value = mock_client | ||
| mock_client.chat.completions.create = mock_create | ||
| mock_client.responses.create = mock_create | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟠 HIGH - Inconsistent mock setup for async HTTP calls Category: bug Description: Suggestion: Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code. Confidence: 70% |
||
|
|
||
| # Test with very short text (should only need 1 chunk + 1 final summary) | ||
| input_data = llm.AITextSummarizerBlock.Input( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 MEDIUM - Debug print statements in test code Category: quality Description: Suggestion: Confidence: 85% |
||
|
|
@@ -312,10 +294,6 @@ async def mock_create(*args, **kwargs): | |
| ): | ||
| outputs[output_name] = output_data | ||
|
|
||
| print(f"Actual calls made: {call_count}") | ||
| print(f"Block stats: {block.execution_stats}") | ||
| print(f"LLM call count: {block.execution_stats.llm_call_count}") | ||
|
|
||
| # Should have made 2 calls: 1 for chunk summary + 1 for final summary | ||
| assert block.execution_stats.llm_call_count >= 1 | ||
| assert block.execution_stats.input_token_count > 0 | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.