fix(sdk): update default critic route by xingyaoww · Pull Request #3055 · OpenHands/software-agent-sdk

xingyaoww · 2026-05-04T16:25:24Z

Summary

update APIBasedCritic default server URL to the production LiteLLM /vllm pass-through
update the default critic model name to the LiteLLM critic alias
add coverage that locks the SDK defaults to the production proxy route

Validation

uv run pytest tests/sdk/critic/test_critic_client.py -q
uv run pre-commit run --files openhands-sdk/openhands/sdk/critic/impl/api/client.py tests/sdk/critic/test_critic_client.py
Verified production infra maps model alias critic to the hosted critic backend and exposes it through the /vllm pass-through path.
Sanity checked the proxy path shape: unauthenticated https://llm-proxy.app.all-hands.dev/vllm/classify reaches proxy auth, while bare /classify is not served.

Eval Validation

Eval monitor: https://openhands-eval-monitor.vercel.app/?date=2026-05-04&text=25335478724
Eval result: run-eval-1 completed successfully on SDK commit 0040a4964f613519a197f679d2c43403675f681a with 1/1 resolved, 0 unresolved, 0 errors, and 0 empty patches.
Human review done: eval looks good and is acceptable for this routing fix.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0040a49-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0040a49-python \
  ghcr.io/openhands/agent-server:0040a49-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0040a49-golang-amd64
ghcr.io/openhands/agent-server:0040a49-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0040a49-golang-arm64
ghcr.io/openhands/agent-server:0040a49-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0040a49-java-amd64
ghcr.io/openhands/agent-server:0040a49-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0040a49-java-arm64
ghcr.io/openhands/agent-server:0040a49-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0040a49-python-amd64
ghcr.io/openhands/agent-server:0040a49-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:0040a49-python-arm64
ghcr.io/openhands/agent-server:0040a49-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:0040a49-golang
ghcr.io/openhands/agent-server:0040a49-java
ghcr.io/openhands/agent-server:0040a49-python

About Multi-Architecture Support

Each variant tag (e.g., 0040a49-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 0040a49-python-amd64) are also available if needed

github-actions · 2026-05-04T16:25:52Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-04T16:26:06Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint could affect agent behavior in benchmarks if the new endpoint returns different classifications than the old one. Per this repo's review policy, PRs that could affect benchmark performance should not be approved without eval evidence.

VERDICT
❌ Needs eval validation: The code changes are well-structured and the test coverage is appropriate for verifying defaults. However, before merging:

Verify the new endpoint is functional and returns reasonable results
Consider running lightweight evals to confirm no unexpected behavior changes, OR
Have a human maintainer review and confirm the change is safe

KEY INSIGHT
The constants-based refactoring is good for maintainability, but the functional impact of the new endpoint needs validation before auto-approval.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

github-actions · 2026-05-04T16:27:58Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/critic/impl/api
client.py	125	53	57%	190, 196–198, 205–206, 209, 214–218, 223–227, 234–235, 238, 241–242, 245, 253, 258–260, 268, 270, 272–274, 276, 284–286, 295–296, 298–299, 311–312, 314–318, 323–325, 331–332, 335
TOTAL	24344	7062	70%

all-hands-bot

✅ QA Report: PASS

Successfully verified that the PR updates the critic default route to the new hosted classification endpoint. All defaults changed as intended, new test passes, and no regressions detected.

Does this PR achieve its stated goal?

Yes. The PR successfully updates the APIBasedCritic default server URL and model name to point to the new hosted classification service. Testing confirms:

Default server URL changed from critic-qwen3-4b-serve.modal.run to the new endpoint with training metadata
Default model name changed from generic critic-qwen3-4b to the specific trained model critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival
New constants are properly exported and usable
All existing critic tests continue to pass

Phase	Result
Environment Setup	✅ uv environment configured, dependencies synced
CI Status	✅ Key checks passing (pre-commit, sdk-tests, Python API)
Functional Verification	✅ Defaults updated correctly, all 47 tests pass

Functional Verification

Test 1: Verify old defaults (baseline)

Step 1 — Establish baseline (main branch):

Checked out main branch and inspected the old defaults:

git checkout main
uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"

Output:

server_url: https://all-hands-ai--critic-qwen3-4b-serve.modal.run
model_name: critic-qwen3-4b

This confirms the baseline defaults before the PR changes.

Step 2 — Apply the PR's changes:

Checked out PR branch:

git checkout fix-critic-default-routing

Step 3 — Re-run with the fix in place:

Verified new defaults on PR branch:

uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient, DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"

Output:

server_url: https://all-hands-ai--critic-qwen3-4b-5e-06lr-withtools-step9658-e06544.modal.run
model_name: critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival

This shows both defaults have been successfully updated to the new hosted endpoint and trained model identifier.

Test 2: Verify constants are exported and usable

Ran import test:

uv run python -c "from openhands.sdk.critic.impl.api.client import DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; print('✅ Constants successfully imported')"

Result: ✅ Constants successfully imported

This confirms the new constants can be imported by users who need to reference or override the defaults.

Test 3: Verify new test coverage

Ran the new test:

uv run pytest tests/sdk/critic/test_critic_client.py::test_critic_client_uses_current_default_route -v

Result: PASSED ✅

The test verifies that a CriticClient instantiated without explicit server_url or model_name uses the new constants.

Test 4: Verify no regressions

Ran full critic test suite:

uv run pytest tests/sdk/critic/ -v

Result: 47 passed, 5 warnings in 0.26s ✅

All existing tests continue to pass, confirming no regressions from the default changes.

Issues Found

None.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - clean refactoring, but eval validation required per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent benchmark behavior if the new service returns different classifications than the old Modal endpoint. The code changes are solid, but this falls under the repo's eval-risk policy.

VERDICT
⏸️ Awaiting eval validation: Per this repo's review policy, changes that could affect benchmark performance need eval evidence before approval. Please provide a link to an eval monitor run (openhands-eval-monitor.vercel.app) showing the new endpoint performs comparably, or confirm a human maintainer has validated the change.

KEY INSIGHT
The refactoring itself is well-executed (constants, test coverage), but production impact needs verification before merge.

xingyaoww · 2026-05-04T18:08:44Z

Validation update for latest commit 0040a49:

The SDK default now targets the production LiteLLM pass-through: https://llm-proxy.app.all-hands.dev/vllm with model alias critic. It no longer points SDK users directly at the Modal backend.
Production infra maps model_name: critic to hosted_vllm/critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival and explicitly exposes the /vllm pass-through for critic classification.
Proxy route sanity check: unauthenticated /vllm/classify reaches proxy auth, while bare /classify is not served. This confirms the SDK base URL must include /vllm.
This is the intended production behavior for fresh OpenHands UI users: the UI only stores the Critic / Iterative Refinement toggles, OpenHands injects the deployment-level critic credential, and SDK defaults provide the route/model when the settings remain null.
Maintainer validation: Xingyao confirmed this is the required production default and that OpenHands should inherit the SDK default instead of configuring a separate endpoint/model env override.

@all-hands-bot please re-review the latest commit with this validation context.

all-hands-bot

✅ QA Report: PASS

This PR successfully updates the critic service default route from a deprecated Modal endpoint to a new hosted proxy endpoint. The new endpoint is verified to be reachable and functional.

Does this PR achieve its stated goal?

Yes. The PR updates the default critic server URL from https://all-hands-ai--critic-qwen3-4b-serve.modal.run (which now returns 404) to https://llm-proxy.app.all-hands.dev/vllm (which is reachable and returns 401 for authentication as expected). The model name is also updated from critic-qwen3-4b to critic to match the new service. Both CriticClient and APIBasedCritic correctly use the new defaults, and the old endpoint is confirmed to be unavailable.

Phase	Result
Environment Setup	✅ Environment built successfully with uv
CI Status	✅ All 30 checks passing
Functional Verification	✅ Defaults updated, new endpoint reachable, backward compatibility preserved

Functional Verification

Test 1: Verify default constants are correctly set

Step 1 — Establish what the old defaults were (before the PR):

Checked the main branch:

git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'server_url: str = Field'

Output:

server_url: str = Field(
        default="https://all-hands-ai--critic-qwen3-4b-serve.modal.run",
        description="Base URL of the vLLM classification service",
    )

git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'model_name: str = Field'

Output:

model_name: str = Field(
        default="critic-qwen3-4b", description="Name of the model to use"
    )

This confirms the old defaults were:

Server URL: https://all-hands-ai--critic-qwen3-4b-serve.modal.run
Model name: critic-qwen3-4b

Step 2 — Apply the PR's changes (already on PR branch):

Already on branch fix-critic-default-routing (commit 0040a49).

Step 3 — Verify the new defaults:

Ran Python script:

from openhands.sdk.critic.impl.api.client import (
    DEFAULT_CRITIC_SERVER_URL,
    DEFAULT_CRITIC_MODEL_NAME,
    CriticClient,
)

print(f"DEFAULT_CRITIC_SERVER_URL = {DEFAULT_CRITIC_SERVER_URL}")
print(f"DEFAULT_CRITIC_MODEL_NAME = {DEFAULT_CRITIC_MODEL_NAME}")

client = CriticClient(api_key="test_key")
print(f"client.server_url = {client.server_url}")
print(f"client.model_name = {client.model_name}")

Output:

DEFAULT_CRITIC_SERVER_URL = https://llm-proxy.app.all-hands.dev/vllm
DEFAULT_CRITIC_MODEL_NAME = critic
client.server_url = https://llm-proxy.app.all-hands.dev/vllm
client.model_name = critic
✅ Default constants are correct
✅ CriticClient uses correct defaults

This confirms the new defaults are correctly set and used by CriticClient.

Test 2: Verify APIBasedCritic inherits the new defaults

Verification:

Ran Python script:

from openhands.sdk.critic.impl.api import APIBasedCritic
from openhands.sdk.critic.impl.api.client import (
    DEFAULT_CRITIC_SERVER_URL,
    DEFAULT_CRITIC_MODEL_NAME,
)

critic = APIBasedCritic(api_key="test_key")
print(f"critic.server_url = {critic.server_url}")
print(f"critic.model_name = {critic.model_name}")

Output:

critic.server_url = https://llm-proxy.app.all-hands.dev/vllm
critic.model_name = critic
✅ APIBasedCritic correctly inherits the default server URL and model name

This confirms APIBasedCritic (which inherits from CriticClient) correctly uses the new defaults.

Test 3: Verify endpoint reachability

Step 1 — Check if the old endpoint is still available:

Ran HTTP POST request to old endpoint:

import httpx
resp = httpx.post(
    "https://all-hands-ai--critic-qwen3-4b-serve.modal.run/classify",
    json={"model": "critic-qwen3-4b", "input": "test"},
    headers={"Authorization": "Bearer test_key"},
    timeout=10.0
)
print(f"Status: {resp.status_code}")

Output:

Status: 404
Response preview: modal-http: invalid function call
⚠️  Endpoint returned unexpected status: 404

This confirms the old endpoint is no longer available (404 = not found), justifying the need for this PR.

Step 2 — Verify the new endpoint is reachable:

Ran HTTP POST request to new endpoint:

import httpx
resp = httpx.post(
    "https://llm-proxy.app.all-hands.dev/vllm/classify",
    json={"model": "critic", "input": "test"},
    headers={"Authorization": "Bearer test_key"},
    timeout=10.0
)
print(f"Status: {resp.status_code}")

Output:

Status: 401
Response preview: {"error":{"message":"Authentication Error, LiteLLM Virtual Key expected. Received=****, expected to start with 'sk-'.","type":"auth_error","param":"None","code":"401"}}
✅ Endpoint is reachable (401 = authentication required, as expected)

This confirms the new endpoint is up and running. The 401 status is expected behavior when using a test API key — it proves the endpoint is reachable and requires valid authentication.

Test 4: Verify defaults can be overridden

Verification:

Ran Python script:

client_custom = CriticClient(
    api_key="test_key",
    server_url="https://custom.example.com",
    model_name="custom-model"
)
print(f"client_custom.server_url = {client_custom.server_url}")
print(f"client_custom.model_name = {client_custom.model_name}")

Output:

client_custom.server_url = https://custom.example.com
client_custom.model_name = custom-model
✅ Defaults can be overridden

This confirms backward compatibility — users can still override the defaults when needed.

Issues Found

None.

all-hands-bot

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per this repository's review policy in .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
❌ Awaiting eval validation: The code changes are well-structured and properly tested, but repository policy blocks approval until eval evidence is provided. Either:

Run a lightweight benchmark (e.g., via eval monitor at openhands-eval-monitor.vercel.app) and link results, OR
Get explicit maintainer sign-off that eval validation is not needed for this infrastructure change

KEY INSIGHT:
This is an infrastructure change with clean code, but the critic service is in the agent's decision loop, so behavior validation is required before merge.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

xingyaoww · 2026-05-04T18:12:11Z

Maintainer sign-off: eval validation is not needed for this infrastructure routing fix.

This PR fixes the SDK default for fresh OpenHands UI users to use the production LiteLLM critic proxy route and model alias. The previous SDK default points at an unavailable endpoint, so this is not a benchmark-tuning change; it restores the intended production critic service path. I confirm this is safe to review without an eval monitor run.

all-hands-bot

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation required per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
⏸️ Awaiting eval validation: The code changes are well-structured and the test coverage is appropriate. However, per this repository's review policy, changes that could plausibly affect benchmark/evaluation performance should not be approved without:

A link to openhands-eval-monitor.vercel.app showing a completed benchmark run, AND
A human maintainer comment confirming the eval results look acceptable

Once eval evidence is provided (or a maintainer decides to merge without it), this PR will be ready to merge.

KEY INSIGHT
The implementation is solid - extracting the constants improves maintainability and the test prevents regression. The hold-up is purely about validating that the endpoint swap doesn't introduce unexpected behavior changes in production benchmarks.

github-actions · 2026-05-04T18:15:57Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR fix(sdk): update default critic route #3055
SDK: 0040a49
Eval limit: 1
Models: claude-sonnet-4-5-20250929

all-hands-bot · 2026-05-04T18:52:49Z

🎉 Swebench Evaluation Complete

Evaluation: 25335478724-claude-son-eval
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: benchmark (test)
Commit: 0040a4964f613519a197f679d2c43403675f681a
Timestamp: 2026-05-04 18:52:44 UTC
Triggered by: @xingyao Wang

📊 Results

Total instances: 500
Submitted instances: 1
Resolved instances: 1
Unresolved instances: 0
Empty patch instances: 0
Error instances: 0
Eval limit: 1
Success rate: 1/1 (100.0%)

🔗 Links

Full Archive
Eval Monitor

xingyaoww · 2026-05-04T18:53:27Z

Human maintainer review of eval results: this looks acceptable for this routing fix.

Eval monitor: https://openhands-eval-monitor.vercel.app/?date=2026-05-04&text=25335478724
Result: run-eval-1 completed successfully with 1/1 resolved, 0 unresolved, 0 errors, and 0 empty patches on SDK commit 0040a49.

Given this PR only updates the SDK default critic route/model alias to the production LiteLLM proxy path, these eval results are sufficient for review.

all-hands-bot

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint could affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
⏸️ Awaiting eval validation: The code changes are solid and test coverage is appropriate. However, this repo's review policy requires eval evidence (link to openhands-eval-monitor.vercel.app + human maintainer confirmation) for changes affecting agent behavior. Once provided, this is ready to merge.

KEY INSIGHT
Clean refactoring with proper test coverage, but eval-risk policy gates approval on benchmark validation.

all-hands-bot

Taste Rating: 🟢 Good taste - Clean infrastructure fix with proper validation

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW

This is a necessary infrastructure update that moves from a deprecated Modal endpoint (returns 404) to the production LiteLLM proxy. The change is well-validated with eval evidence showing 1/1 resolved and human maintainer approval. Test coverage locks down the new defaults to prevent regression.

VERDICT:
✅ Approved: Clean fix with proper eval validation and test coverage.

KEY INSIGHT:
Simple constant extraction makes the defaults testable and documents the production infrastructure dependency.

Co-authored-by: openhands <openhands@all-hands.dev>