-
Notifications
You must be signed in to change notification settings - Fork 46.2k
fix: migrate OpenAI provider to use Responses API #11674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
fix: migrate OpenAI provider to use Responses API #11674
Conversation
…vitas#11639) ### Changes 🏗️ Chat should be disabled by default; otherwise, it flashes, and if Launch Darkly fails to fail, it is dangerous. ### Checklist 📋 #### For code changes: - [x] I have clearly listed my changes in the PR description - [x] I have made a test plan - [x] I have tested my changes according to the test plan: - [x] Run locally with Launch Darkly disabled and test the above
Updates the OpenAI provider in llm.py to use the newer Responses API (responses.create) instead of the legacy Chat Completions API (chat.completions.create). Changes: - Replace chat.completions.create with responses.create - Use 'input' parameter instead of 'messages' - Use 'max_output_tokens' instead of 'max_completion_tokens' - Parse response.output_text instead of choices[0].message.content - Use input_tokens/output_tokens for token usage tracking - Add helper functions for extracting reasoning and tool calls from the Responses API response format - Update tests to mock the new API format The chat service is not migrated as it uses OpenRouter which requires the Chat Completions API format. Closes Significant-Gravitas#11624
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
This PR targets the Automatically setting the base branch to |
wasnt in my first commit dunno wtf with this
Changes SummaryThis PR migrates the native OpenAI API integration from the deprecated Chat Completions endpoint to the new Responses API (v1.66.0+). It introduces two new helper functions to parse Responses API output, updates the OpenAI provider section to use the new endpoint, and adjusts parameter mapping and response parsing accordingly. Type: feature Components Affected: backend.blocks.llm (OpenAI provider), test.test_llm (test mocks), LLM infrastructure Files Changed
Architecture Impact
Risk Areas: Response field mapping differences: output_text vs message.content, input_tokens/output_tokens vs prompt_tokens/completion_tokens, Tool call extraction: handling call_id vs id field in Responses API output, Reasoning extraction: handling both string and array summary formats in reasoning output, Error handling: new try/catch for openai.APIError (follows Anthropic pattern), System message extraction: moved from messages to instructions parameter, Response format parameter mapping: response_format to text parameter Suggestions
Full review in progress... | Powered by diffray |
Review Summary
Validated 78 issues: 42 kept, 36 filtered Issues Found: 42💬 See 17 individual line comment(s) for details. 📊 23 unique issue type(s) across 42 location(s) 📋 Full issue list (click to expand)🔴 CRITICAL - Array bounds check missing for response.choices[0] (3 occurrences)Agent: bugs Category: bug Why this matters: Accessing empty arrays throws IndexError/undefined access. 📍 View all locations
Rule: 🔴 CRITICAL - Redundant Optional usage with union type syntaxAgent: python Category: quality Why this matters: Type hints enable IDE autocomplete and catch type errors early. File: Description: Line 326 uses both 'Optional[List[ToolContentBlock]]' and '| None' on the same field, creating redundancy and confusion. Optional[X] is equivalent to X | None. Suggestion: Change to: tool_calls: list[ToolContentBlock] | None = None Confidence: 98% Rule: 🟠 HIGH - Redundant and illogical None comparison (2 occurrences)Agent: python Category: bug Why this matters: Redundant None check indicates logic error or dead branch. 📍 View all locations
Rule: 🟠 HIGH - Sequential async calls in loop instead of parallel gatheringAgent: performance Category: performance Why this matters: N+1 queries cause severe performance degradation. File: Description: The _run method in AITextSummarizerBlock processes chunks sequentially using a for loop with await on each chunk. With multiple chunks, this becomes O(n) serialized operations when they could be parallelized. Suggestion: Use asyncio.gather() or asyncio.TaskGroup to summarize all chunks concurrently, reducing wall-clock time from O(n*t) to O(t). Confidence: 88% Rule: 🟠 HIGH - Inconsistent mock setup for async HTTP calls (2 occurrences)Agent: microservices Category: bug Why this matters: Live I/O introduces slowness, nondeterminism, and external failures unrelated to the code. 📍 View all locations
Rule: 🟠 HIGH - Error logged without sufficient context for debuggingAgent: bugs Category: bug Why this matters: Context-free errors are impossible to trace in production logs. File: Description: When catching openai.APIError, the error message only includes the error text but lacks context about what model was being called, the input parameters, or the operation context. This makes debugging and monitoring difficult in production logs. Suggestion: Include relevant context in error message: add llm_model value, a summary of input messages, and operation context. Example: f'OpenAI Responses API error for model {llm_model.value}: {str(e)}' Confidence: 72% Rule: 🟠 HIGH - God Function with 329 lines handling 8 different LLM providersAgent: architecture Category: quality Why this matters: Framework coupling makes code harder to test and migrate. File: Description: The llm_call function is a monolithic function with completely separate logic paths for 8 LLM providers. Each provider has its own client instantiation, request building, response parsing, error handling, and token counting. This violates Single Responsibility Principle. Suggestion: Refactor using Strategy pattern: create an abstract LLMProvider base class with provider-specific implementations. Use a factory to instantiate the correct provider based on model metadata. Confidence: 75% Rule: 🟠 HIGH - Expensive regex substitution on every parse failureAgent: performance Category: performance File: Description: Line 1048 executes re.sub(r'[A-Za-z0-9]', '*', response_text) to censor responses for logging on every parse failure. The regex is recompiled each time and operates on potentially large text. Suggestion: Pre-compile the regex at module level: CENSOR_PATTERN = re.compile(r'[A-Za-z0-9]'). Better: only censor a fixed-size snippet instead of entire response. Confidence: 85% Rule: 🟠 HIGH - Missing timeout on external service callAgent: python Category: bug File: Description: The oai_client.responses.create() call at line 525 has no explicit timeout parameter. Compare with anthropic client at line 575 which properly includes timeout=600. Suggestion: Add a timeout parameter to the responses_params dictionary before making the call, similar to the Anthropic call. Confidence: 90% Rule: 🟠 HIGH - Incomplete error handling for external service callAgent: python Category: bug File: Description: The oai_client.responses.create() call only catches openai.APIError. Other exceptions like asyncio.TimeoutError, ConnectionError are not caught. Suggestion: Expand the try-except to catch additional exceptions: asyncio.TimeoutError, ConnectionError, etc. Confidence: 80% Rule: 🟠 HIGH - Overly broad exception handlingAgent: python Category: quality File: Description: Bare Exception clause catches all exceptions including SystemExit and KeyboardInterrupt. Should catch specific exception types. Suggestion: Replace 'except Exception as e' with specific exception types like (openai.APIError, anthropic.APIError, ValueError). Confidence: 90% Rule: 🟠 HIGH - Input parameter 'input_data' is modified in-place (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟠 HIGH - Duplicate import in function body (5 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test mocks internal method - limited integration coverage (2 occurrences)Agent: testing Category: quality Why this matters: Improper mocks make tests brittle and unreliable. 📍 View all locations
Rule: 🟡 MEDIUM - Missing return type annotation (2 occurrences)Agent: python Category: quality Why this matters: Type hints prevent whole classes of bugs and improve IDE/refactor support. 📍 View all locations
Rule: 🟡 MEDIUM - Double loop through response.content blocks (2 occurrences)Agent: performance Category: performance 📍 View all locations
Rule: 🟡 MEDIUM - Debug print statements in test codeAgent: python Category: quality File: Description: Debug print() statements left in test function. Should be removed or replaced with logging. Suggestion: Replace with logger.debug() calls or use pytest's caplog fixture, or remove if no longer needed. Confidence: 85% Rule: 🟡 MEDIUM - Test assertion comments contradict expected behaviorAgent: refactoring Category: quality File: Description: Line 187 states 'llm_call_count is only set on success, so it shows 1' but line 190 asserts llm_call_count == 2. The comment explanation is misleading - it actually should be retry_count + 1 = 2 as stated in line 190's inline comment. Suggestion: Remove or correct the misleading comment at line 187. The assertion at line 190 with its inline comment 'retry_count + 1 = 1 + 1 = 2' is correct, but line 187 contradicts this. Confidence: 75% Rule: 🟡 MEDIUM - User input formatted into prompts via Jinja2Agent: security Category: security File: Description: User-supplied prompt_values are formatted into system and user prompts using Jinja2 SandboxedEnvironment (via fmt.format_string()). While sandboxed, autoescape=False at initialization allows potential content manipulation in prompts. Suggestion: Consider enabling autoescape for the TextFormatter used in LLM blocks, or validate prompt_values don't contain suspicious patterns. The SandboxedEnvironment mitigates template injection but doesn't prevent prompt manipulation. Confidence: 65% Rule: 🟡 MEDIUM - User input directly embedded into LLM promptsAgent: security Category: security File: Description: User-supplied input_data.focus and input_data.source_data are directly embedded into prompts using f-strings. While this is common for LLM applications, sensitive data could be sent to external providers. Suggestion: Consider implementing PII detection or data sanitization for sensitive patterns before embedding user input into prompts sent to external LLM providers. Confidence: 62% Rule: 🟡 MEDIUM - Missing Input Validation for max_tokens Parameter (2 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - OpenAI Responses API Call Without Explicit Timeout (6 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - Test assertions too vague to catch bugs (2 occurrences)Agent: testing Category: quality 📍 View all locations
Rule: ℹ️ 25 issue(s) outside PR diff (click to expand)
🔴 CRITICAL - Redundant Optional usage with union type syntaxAgent: python Category: quality Why this matters: Type hints enable IDE autocomplete and catch type errors early. File: Description: Line 326 uses both 'Optional[List[ToolContentBlock]]' and '| None' on the same field, creating redundancy and confusion. Optional[X] is equivalent to X | None. Suggestion: Change to: tool_calls: list[ToolContentBlock] | None = None Confidence: 98% Rule: 🟠 HIGH - Redundant and illogical None comparison (2 occurrences)Agent: python Category: bug Why this matters: Redundant None check indicates logic error or dead branch. 📍 View all locations
Rule: 🟠 HIGH - Sequential async calls in loop instead of parallel gatheringAgent: performance Category: performance Why this matters: N+1 queries cause severe performance degradation. File: Description: The _run method in AITextSummarizerBlock processes chunks sequentially using a for loop with await on each chunk. With multiple chunks, this becomes O(n) serialized operations when they could be parallelized. Suggestion: Use asyncio.gather() or asyncio.TaskGroup to summarize all chunks concurrently, reducing wall-clock time from O(n*t) to O(t). Confidence: 88% Rule: 🟠 HIGH - Expensive regex substitution on every parse failureAgent: performance Category: performance File: Description: Line 1048 executes re.sub(r'[A-Za-z0-9]', '*', response_text) to censor responses for logging on every parse failure. The regex is recompiled each time and operates on potentially large text. Suggestion: Pre-compile the regex at module level: CENSOR_PATTERN = re.compile(r'[A-Za-z0-9]'). Better: only censor a fixed-size snippet instead of entire response. Confidence: 85% Rule: 🟠 HIGH - Overly broad exception handlingAgent: python Category: quality File: Description: Bare Exception clause catches all exceptions including SystemExit and KeyboardInterrupt. Should catch specific exception types. Suggestion: Replace 'except Exception as e' with specific exception types like (openai.APIError, anthropic.APIError, ValueError). Confidence: 90% Rule: 🟠 HIGH - Input parameter 'input_data' is modified in-place (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test mocks internal method - limited integration coverage (2 occurrences)Agent: testing Category: quality Why this matters: Improper mocks make tests brittle and unreliable. 📍 View all locations
Rule: 🟡 MEDIUM - Missing type annotation for error parameterAgent: python Category: quality Why this matters: Type hints prevent whole classes of bugs and improve IDE/refactor support. File: Description: Parameter 'error' in method 'invalid_response_feedback' lacks type annotation. This method accepts different error types and should document them. Suggestion: Add type annotation: 'def invalid_response_feedback(self, error: Union[ValueError, JSONDecodeError, str], ...) -> str:' Confidence: 70% Rule: 🟡 MEDIUM - Double loop through response.content blocksAgent: performance Category: performance File: Description: The function loops through resp.content twice: lines 582-597 for tool_use and lines 605-608 for thinking type extraction. Suggestion: Combine into a single iteration through resp.content, checking for both types in one pass. Confidence: 70% Rule: 🟡 MEDIUM - Import inside function body (3 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Test assertion comments contradict expected behaviorAgent: refactoring Category: quality File: Description: Line 187 states 'llm_call_count is only set on success, so it shows 1' but line 190 asserts llm_call_count == 2. The comment explanation is misleading - it actually should be retry_count + 1 = 2 as stated in line 190's inline comment. Suggestion: Remove or correct the misleading comment at line 187. The assertion at line 190 with its inline comment 'retry_count + 1 = 1 + 1 = 2' is correct, but line 187 contradicts this. Confidence: 75% Rule: 🟡 MEDIUM - User input formatted into prompts via Jinja2Agent: security Category: security File: Description: User-supplied prompt_values are formatted into system and user prompts using Jinja2 SandboxedEnvironment (via fmt.format_string()). While sandboxed, autoescape=False at initialization allows potential content manipulation in prompts. Suggestion: Consider enabling autoescape for the TextFormatter used in LLM blocks, or validate prompt_values don't contain suspicious patterns. The SandboxedEnvironment mitigates template injection but doesn't prevent prompt manipulation. Confidence: 65% Rule: 🟡 MEDIUM - User input directly embedded into LLM promptsAgent: security Category: security File: Description: User-supplied input_data.focus and input_data.source_data are directly embedded into prompts using f-strings. While this is common for LLM applications, sensitive data could be sent to external providers. Suggestion: Consider implementing PII detection or data sanitization for sensitive patterns before embedding user input into prompts sent to external LLM providers. Confidence: 62% Rule: 🟡 MEDIUM - Missing Range Validation for chunk_overlap ParameterAgent: security Category: security File: Description: chunk_overlap has ge=0 but no upper bound. No cross-field validation ensures overlap < max_tokens, which could create invalid chunk configurations. Suggestion: Add validation to ensure chunk_overlap < max_tokens to prevent invalid chunking behavior. Confidence: 70% Rule: 🟡 MEDIUM - Groq API Call Without Explicit Timeout (5 occurrences)Agent: security Category: security 📍 View all locations
Rule: 🟡 MEDIUM - Test assertions too vague to catch bugsAgent: testing Category: quality File: Description: The test uses assertions like Suggestion: Replace Confidence: 75% Rule: Review ID: |
- Use getattr with fallback for tool call ID extraction - Add return type annotation to get_parallel_tool_calls_param - Add timeout (600s) to OpenAI Responses API call - Add model context to error messages - Handle TimeoutError in addition to APIError - Remove duplicate imports in test file - Remove debug print statements in test file
The raw_response field is used by smart_decision_maker.py for conversation history. It expects a message dict with 'role' and 'content' keys, not the raw Response object. - Construct message dict with role='assistant' and content - Include tool_calls in OpenAI format when present - Fixes multi-turn conversation and retry logic
ⓘ Your monthly quota for Qodo has expired. Upgrade your plan ⓘ Paying users. Check that your Qodo account is linked with this Git user account |
Summary
Migrates OpenAI native API calls from the deprecated
chat.completions.createendpoint to the newresponses.createendpoint as recommended by OpenAI.Fixes #11624
Changes
Core Changes
llm.pyto useclient.responses.create()extract_responses_api_reasoning()helper to parse reasoning output (handles both string and array summary formats)extract_responses_api_tool_calls()helper to parse function callsinstructionsparameter (Responses API requirement)Parameter Mapping (Chat Completions → Responses API)
messages→input(non-system messages only)instructionsparametermax_completion_tokens→max_output_tokensresponse_format={...}→text={"format":{...}}Response Parsing (Chat Completions → Responses API)
choices[0].message.content→output_textusage.prompt_tokens→usage.input_tokensusage.completion_tokens→usage.output_tokenschoices[0].message.tool_calls→outputitems withtype="function_call"Compatibility
SDK Version
API Compatibility
llm_call()function signature - UNCHANGEDLLMResponseclass structure - UNCHANGEDProvider Impact
openai- YES, modified (Native OpenAI - uses Responses API)anthropic- NO (Different SDK entirely)groq- NO (Third-party API, Chat Completions compatible)open_router- NO (Third-party API, Chat Completions compatible)llama_api- NO (Third-party API, Chat Completions compatible)ollama- NO (Uses ollama SDK)aiml_api- NO (Third-party API, Chat Completions compatible)v0- NO (Third-party API, Chat Completions compatible)Dependent Blocks Verified
smart_decision_maker.py(Line 508) - Uses: response, tool_calls, prompt_tokens, completion_tokens, reasoning - COMPATIBLEai_condition.py(Line 113) - Uses: response, prompt_tokens, completion_tokens, prompt - COMPATIBLEperplexity.py- Does not use llm_call (uses different API) - NOT AFFECTEDStreaming Service
backend/server/v2/chat/service.pyis NOT affected - it uses OpenRouter by default which requires Chat Completions API format.Testing
Test File Updates
test_llm.pymocks to useoutput_textinstead ofchoices[0].message.contentoutputarray for tool callsusage.input_tokens/usage.output_tokensVerification Performed
call_idwith fallback toidsummaryformatsRecommended Manual Testing
force_json_output=True)Files Modified
1.
autogpt_platform/backend/backend/blocks/llm.pyextract_responses_api_reasoning()helperextract_responses_api_tool_calls()helperresponses.createinstructionsparameter2.
autogpt_platform/backend/backend/blocks/test/test_llm.pyReferences
Checklist
Changes
Code Quality