fix(sdk): update default critic route#3055
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed
[RISK ASSESSMENT]
Changing the default critic endpoint could affect agent behavior in benchmarks if the new endpoint returns different classifications than the old one. Per this repo's review policy, PRs that could affect benchmark performance should not be approved without eval evidence.
VERDICT
❌ Needs eval validation: The code changes are well-structured and the test coverage is appropriate for verifying defaults. However, before merging:
- Verify the new endpoint is functional and returns reasonable results
- Consider running lightweight evals to confirm no unexpected behavior changes, OR
- Have a human maintainer review and confirm the change is safe
KEY INSIGHT
The constants-based refactoring is good for maintainability, but the functional impact of the new endpoint needs validation before auto-approval.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Successfully verified that the PR updates the critic default route to the new hosted classification endpoint. All defaults changed as intended, new test passes, and no regressions detected.
Does this PR achieve its stated goal?
Yes. The PR successfully updates the APIBasedCritic default server URL and model name to point to the new hosted classification service. Testing confirms:
- Default server URL changed from
critic-qwen3-4b-serve.modal.runto the new endpoint with training metadata - Default model name changed from generic
critic-qwen3-4bto the specific trained modelcritic-qwen3-4b-5e-06lr-withtools-step9658-success-survival - New constants are properly exported and usable
- All existing critic tests continue to pass
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv environment configured, dependencies synced |
| CI Status | ✅ Key checks passing (pre-commit, sdk-tests, Python API) |
| Functional Verification | ✅ Defaults updated correctly, all 47 tests pass |
Functional Verification
Test 1: Verify old defaults (baseline)
Step 1 — Establish baseline (main branch):
Checked out main branch and inspected the old defaults:
git checkout main
uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"Output:
server_url: https://all-hands-ai--critic-qwen3-4b-serve.modal.run
model_name: critic-qwen3-4b
This confirms the baseline defaults before the PR changes.
Step 2 — Apply the PR's changes:
Checked out PR branch:
git checkout fix-critic-default-routingStep 3 — Re-run with the fix in place:
Verified new defaults on PR branch:
uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient, DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"Output:
server_url: https://all-hands-ai--critic-qwen3-4b-5e-06lr-withtools-step9658-e06544.modal.run
model_name: critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival
This shows both defaults have been successfully updated to the new hosted endpoint and trained model identifier.
Test 2: Verify constants are exported and usable
Ran import test:
uv run python -c "from openhands.sdk.critic.impl.api.client import DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; print('✅ Constants successfully imported')"Result: ✅ Constants successfully imported
This confirms the new constants can be imported by users who need to reference or override the defaults.
Test 3: Verify new test coverage
Ran the new test:
uv run pytest tests/sdk/critic/test_critic_client.py::test_critic_client_uses_current_default_route -vResult: PASSED ✅
The test verifies that a CriticClient instantiated without explicit server_url or model_name uses the new constants.
Test 4: Verify no regressions
Ran full critic test suite:
uv run pytest tests/sdk/critic/ -vResult: 47 passed, 5 warnings in 0.26s ✅
All existing tests continue to pass, confirming no regressions from the default changes.
Issues Found
None.
Co-authored-by: openhands <openhands@all-hands.dev>
a751c63 to
0040a49
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - clean refactoring, but eval validation required per repo policy
[RISK ASSESSMENT]
Changing the default critic endpoint may affect agent benchmark behavior if the new service returns different classifications than the old Modal endpoint. The code changes are solid, but this falls under the repo's eval-risk policy.
VERDICT
⏸️ Awaiting eval validation: Per this repo's review policy, changes that could affect benchmark performance need eval evidence before approval. Please provide a link to an eval monitor run (openhands-eval-monitor.vercel.app) showing the new endpoint performs comparably, or confirm a human maintainer has validated the change.
KEY INSIGHT
The refactoring itself is well-executed (constants, test coverage), but production impact needs verification before merge.
|
Validation update for latest commit 0040a49:
@all-hands-bot please re-review the latest commit with this validation context. |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
This PR successfully updates the critic service default route from a deprecated Modal endpoint to a new hosted proxy endpoint. The new endpoint is verified to be reachable and functional.
Does this PR achieve its stated goal?
Yes. The PR updates the default critic server URL from https://all-hands-ai--critic-qwen3-4b-serve.modal.run (which now returns 404) to https://llm-proxy.app.all-hands.dev/vllm (which is reachable and returns 401 for authentication as expected). The model name is also updated from critic-qwen3-4b to critic to match the new service. Both CriticClient and APIBasedCritic correctly use the new defaults, and the old endpoint is confirmed to be unavailable.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Environment built successfully with uv |
| CI Status | ✅ All 30 checks passing |
| Functional Verification | ✅ Defaults updated, new endpoint reachable, backward compatibility preserved |
Functional Verification
Test 1: Verify default constants are correctly set
Step 1 — Establish what the old defaults were (before the PR):
Checked the main branch:
git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'server_url: str = Field'Output:
server_url: str = Field(
default="https://all-hands-ai--critic-qwen3-4b-serve.modal.run",
description="Base URL of the vLLM classification service",
)
git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'model_name: str = Field'Output:
model_name: str = Field(
default="critic-qwen3-4b", description="Name of the model to use"
)
This confirms the old defaults were:
- Server URL:
https://all-hands-ai--critic-qwen3-4b-serve.modal.run - Model name:
critic-qwen3-4b
Step 2 — Apply the PR's changes (already on PR branch):
Already on branch fix-critic-default-routing (commit 0040a49).
Step 3 — Verify the new defaults:
Ran Python script:
from openhands.sdk.critic.impl.api.client import (
DEFAULT_CRITIC_SERVER_URL,
DEFAULT_CRITIC_MODEL_NAME,
CriticClient,
)
print(f"DEFAULT_CRITIC_SERVER_URL = {DEFAULT_CRITIC_SERVER_URL}")
print(f"DEFAULT_CRITIC_MODEL_NAME = {DEFAULT_CRITIC_MODEL_NAME}")
client = CriticClient(api_key="test_key")
print(f"client.server_url = {client.server_url}")
print(f"client.model_name = {client.model_name}")Output:
DEFAULT_CRITIC_SERVER_URL = https://llm-proxy.app.all-hands.dev/vllm
DEFAULT_CRITIC_MODEL_NAME = critic
client.server_url = https://llm-proxy.app.all-hands.dev/vllm
client.model_name = critic
✅ Default constants are correct
✅ CriticClient uses correct defaults
This confirms the new defaults are correctly set and used by CriticClient.
Test 2: Verify APIBasedCritic inherits the new defaults
Verification:
Ran Python script:
from openhands.sdk.critic.impl.api import APIBasedCritic
from openhands.sdk.critic.impl.api.client import (
DEFAULT_CRITIC_SERVER_URL,
DEFAULT_CRITIC_MODEL_NAME,
)
critic = APIBasedCritic(api_key="test_key")
print(f"critic.server_url = {critic.server_url}")
print(f"critic.model_name = {critic.model_name}")Output:
critic.server_url = https://llm-proxy.app.all-hands.dev/vllm
critic.model_name = critic
✅ APIBasedCritic correctly inherits the default server URL and model name
This confirms APIBasedCritic (which inherits from CriticClient) correctly uses the new defaults.
Test 3: Verify endpoint reachability
Step 1 — Check if the old endpoint is still available:
Ran HTTP POST request to old endpoint:
import httpx
resp = httpx.post(
"https://all-hands-ai--critic-qwen3-4b-serve.modal.run/classify",
json={"model": "critic-qwen3-4b", "input": "test"},
headers={"Authorization": "Bearer test_key"},
timeout=10.0
)
print(f"Status: {resp.status_code}")Output:
Status: 404
Response preview: modal-http: invalid function call
⚠️ Endpoint returned unexpected status: 404
This confirms the old endpoint is no longer available (404 = not found), justifying the need for this PR.
Step 2 — Verify the new endpoint is reachable:
Ran HTTP POST request to new endpoint:
import httpx
resp = httpx.post(
"https://llm-proxy.app.all-hands.dev/vllm/classify",
json={"model": "critic", "input": "test"},
headers={"Authorization": "Bearer test_key"},
timeout=10.0
)
print(f"Status: {resp.status_code}")Output:
Status: 401
Response preview: {"error":{"message":"Authentication Error, LiteLLM Virtual Key expected. Received=****, expected to start with 'sk-'.","type":"auth_error","param":"None","code":"401"}}
✅ Endpoint is reachable (401 = authentication required, as expected)
This confirms the new endpoint is up and running. The 401 status is expected behavior when using a test API key — it proves the endpoint is reachable and requires valid authentication.
Test 4: Verify defaults can be overridden
Verification:
Ran Python script:
client_custom = CriticClient(
api_key="test_key",
server_url="https://custom.example.com",
model_name="custom-model"
)
print(f"client_custom.server_url = {client_custom.server_url}")
print(f"client_custom.model_name = {client_custom.model_name}")Output:
client_custom.server_url = https://custom.example.com
client_custom.model_name = custom-model
✅ Defaults can be overridden
This confirms backward compatibility — users can still override the defaults when needed.
Issues Found
None.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy
[RISK ASSESSMENT]
Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per this repository's review policy in .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.
VERDICT
❌ Awaiting eval validation: The code changes are well-structured and properly tested, but repository policy blocks approval until eval evidence is provided. Either:
- Run a lightweight benchmark (e.g., via eval monitor at
openhands-eval-monitor.vercel.app) and link results, OR - Get explicit maintainer sign-off that eval validation is not needed for this infrastructure change
KEY INSIGHT:
This is an infrastructure change with clean code, but the critic service is in the agent's decision loop, so behavior validation is required before merge.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
|
Maintainer sign-off: eval validation is not needed for this infrastructure routing fix. This PR fixes the SDK default for fresh OpenHands UI users to use the production LiteLLM critic proxy route and model alias. The previous SDK default points at an unavailable endpoint, so this is not a benchmark-tuning change; it restores the intended production critic service path. I confirm this is safe to review without an eval monitor run. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - clean implementation, but eval validation required per repo policy
[RISK ASSESSMENT]
Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.
VERDICT
⏸️ Awaiting eval validation: The code changes are well-structured and the test coverage is appropriate. However, per this repository's review policy, changes that could plausibly affect benchmark/evaluation performance should not be approved without:
- A link to
openhands-eval-monitor.vercel.appshowing a completed benchmark run, AND - A human maintainer comment confirming the eval results look acceptable
Once eval evidence is provided (or a maintainer decides to merge without it), this PR will be ready to merge.
KEY INSIGHT
The implementation is solid - extracting the constants improves maintainability and the test prevents regression. The hold-up is purely about validating that the endpoint swap doesn't introduce unexpected behavior changes in production benchmarks.
|
Evaluation Triggered
|
🎉 Swebench Evaluation CompleteEvaluation: 📊 Results
🔗 Links |
|
Human maintainer review of eval results: this looks acceptable for this routing fix. Eval monitor: https://openhands-eval-monitor.vercel.app/?date=2026-05-04&text=25335478724 Given this PR only updates the SDK default critic route/model alias to the production LiteLLM proxy path, these eval results are sufficient for review. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy
[RISK ASSESSMENT]
Changing the default critic endpoint could affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.
VERDICT
⏸️ Awaiting eval validation: The code changes are solid and test coverage is appropriate. However, this repo's review policy requires eval evidence (link to openhands-eval-monitor.vercel.app + human maintainer confirmation) for changes affecting agent behavior. Once provided, this is ready to merge.
KEY INSIGHT
Clean refactoring with proper test coverage, but eval-risk policy gates approval on benchmark validation.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Clean infrastructure fix with proper validation
[RISK ASSESSMENT]
This is a necessary infrastructure update that moves from a deprecated Modal endpoint (returns 404) to the production LiteLLM proxy. The change is well-validated with eval evidence showing 1/1 resolved and human maintainer approval. Test coverage locks down the new defaults to prevent regression.
VERDICT:
✅ Approved: Clean fix with proper eval validation and test coverage.
KEY INSIGHT:
Simple constant extraction makes the defaults testable and documents the production infrastructure dependency.
Co-authored-by: openhands <openhands@all-hands.dev>
Summary
/vllmpass-throughcriticaliasValidation
uv run pytest tests/sdk/critic/test_critic_client.py -quv run pre-commit run --files openhands-sdk/openhands/sdk/critic/impl/api/client.py tests/sdk/critic/test_critic_client.pycriticto the hosted critic backend and exposes it through the/vllmpass-through path.https://llm-proxy.app.all-hands.dev/vllm/classifyreaches proxy auth, while bare/classifyis not served.Eval Validation
0040a4964f613519a197f679d2c43403675f681awith 1/1 resolved, 0 unresolved, 0 errors, and 0 empty patches.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:0040a49-pythonRun
All tags pushed for this build
About Multi-Architecture Support
0040a49-python) is a multi-arch manifest supporting both amd64 and arm640040a49-python-amd64) are also available if needed