Skip to content

fix(sdk): update default critic route#3055

Merged
xingyaoww merged 1 commit into
mainfrom
fix-critic-default-routing
May 4, 2026
Merged

fix(sdk): update default critic route#3055
xingyaoww merged 1 commit into
mainfrom
fix-critic-default-routing

Conversation

@xingyaoww

@xingyaoww xingyaoww commented May 4, 2026

Copy link
Copy Markdown
Member

Summary

  • update APIBasedCritic default server URL to the production LiteLLM /vllm pass-through
  • update the default critic model name to the LiteLLM critic alias
  • add coverage that locks the SDK defaults to the production proxy route

Validation

  • uv run pytest tests/sdk/critic/test_critic_client.py -q
  • uv run pre-commit run --files openhands-sdk/openhands/sdk/critic/impl/api/client.py tests/sdk/critic/test_critic_client.py
  • Verified production infra maps model alias critic to the hosted critic backend and exposes it through the /vllm pass-through path.
  • Sanity checked the proxy path shape: unauthenticated https://llm-proxy.app.all-hands.dev/vllm/classify reaches proxy auth, while bare /classify is not served.

Eval Validation


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:0040a49-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-0040a49-python \
  ghcr.io/openhands/agent-server:0040a49-python

All tags pushed for this build

ghcr.io/openhands/agent-server:0040a49-golang-amd64
ghcr.io/openhands/agent-server:0040a49-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:0040a49-golang-arm64
ghcr.io/openhands/agent-server:0040a49-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:0040a49-java-amd64
ghcr.io/openhands/agent-server:0040a49-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:0040a49-java-arm64
ghcr.io/openhands/agent-server:0040a49-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:0040a49-python-amd64
ghcr.io/openhands/agent-server:0040a49-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:0040a49-python-arm64
ghcr.io/openhands/agent-server:0040a49-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:0040a49-golang
ghcr.io/openhands/agent-server:0040a49-java
ghcr.io/openhands/agent-server:0040a49-python

About Multi-Architecture Support

  • Each variant tag (e.g., 0040a49-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 0040a49-python-amd64) are also available if needed

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint could affect agent behavior in benchmarks if the new endpoint returns different classifications than the old one. Per this repo's review policy, PRs that could affect benchmark performance should not be approved without eval evidence.

VERDICT
Needs eval validation: The code changes are well-structured and the test coverage is appropriate for verifying defaults. However, before merging:

  • Verify the new endpoint is functional and returns reasonable results
  • Consider running lightweight evals to confirm no unexpected behavior changes, OR
  • Have a human maintainer review and confirm the change is safe

KEY INSIGHT
The constants-based refactoring is good for maintainability, but the functional impact of the new endpoint needs validation before auto-approval.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/critic/impl/api
   client.py1255357%190, 196–198, 205–206, 209, 214–218, 223–227, 234–235, 238, 241–242, 245, 253, 258–260, 268, 270, 272–274, 276, 284–286, 295–296, 298–299, 311–312, 314–318, 323–325, 331–332, 335
TOTAL24344706270% 

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Successfully verified that the PR updates the critic default route to the new hosted classification endpoint. All defaults changed as intended, new test passes, and no regressions detected.

Does this PR achieve its stated goal?

Yes. The PR successfully updates the APIBasedCritic default server URL and model name to point to the new hosted classification service. Testing confirms:

  1. Default server URL changed from critic-qwen3-4b-serve.modal.run to the new endpoint with training metadata
  2. Default model name changed from generic critic-qwen3-4b to the specific trained model critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival
  3. New constants are properly exported and usable
  4. All existing critic tests continue to pass
Phase Result
Environment Setup ✅ uv environment configured, dependencies synced
CI Status ✅ Key checks passing (pre-commit, sdk-tests, Python API)
Functional Verification ✅ Defaults updated correctly, all 47 tests pass
Functional Verification

Test 1: Verify old defaults (baseline)

Step 1 — Establish baseline (main branch):

Checked out main branch and inspected the old defaults:

git checkout main
uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"

Output:

server_url: https://all-hands-ai--critic-qwen3-4b-serve.modal.run
model_name: critic-qwen3-4b

This confirms the baseline defaults before the PR changes.

Step 2 — Apply the PR's changes:

Checked out PR branch:

git checkout fix-critic-default-routing

Step 3 — Re-run with the fix in place:

Verified new defaults on PR branch:

uv run python -c "from openhands.sdk.critic.impl.api.client import CriticClient, DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; c = CriticClient(api_key='test'); print(f'server_url: {c.server_url}'); print(f'model_name: {c.model_name}')"

Output:

server_url: https://all-hands-ai--critic-qwen3-4b-5e-06lr-withtools-step9658-e06544.modal.run
model_name: critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival

This shows both defaults have been successfully updated to the new hosted endpoint and trained model identifier.


Test 2: Verify constants are exported and usable

Ran import test:

uv run python -c "from openhands.sdk.critic.impl.api.client import DEFAULT_CRITIC_SERVER_URL, DEFAULT_CRITIC_MODEL_NAME; print('✅ Constants successfully imported')"

Result: ✅ Constants successfully imported

This confirms the new constants can be imported by users who need to reference or override the defaults.


Test 3: Verify new test coverage

Ran the new test:

uv run pytest tests/sdk/critic/test_critic_client.py::test_critic_client_uses_current_default_route -v

Result: PASSED

The test verifies that a CriticClient instantiated without explicit server_url or model_name uses the new constants.


Test 4: Verify no regressions

Ran full critic test suite:

uv run pytest tests/sdk/critic/ -v

Result: 47 passed, 5 warnings in 0.26s

All existing tests continue to pass, confirming no regressions from the default changes.

Issues Found

None.

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww force-pushed the fix-critic-default-routing branch from a751c63 to 0040a49 Compare May 4, 2026 17:12
@xingyaoww xingyaoww requested a review from all-hands-bot May 4, 2026 18:04

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - clean refactoring, but eval validation required per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent benchmark behavior if the new service returns different classifications than the old Modal endpoint. The code changes are solid, but this falls under the repo's eval-risk policy.

VERDICT
⏸️ Awaiting eval validation: Per this repo's review policy, changes that could affect benchmark performance need eval evidence before approval. Please provide a link to an eval monitor run (openhands-eval-monitor.vercel.app) showing the new endpoint performs comparably, or confirm a human maintainer has validated the change.

KEY INSIGHT
The refactoring itself is well-executed (constants, test coverage), but production impact needs verification before merge.

@xingyaoww

xingyaoww commented May 4, 2026

Copy link
Copy Markdown
Member Author

Validation update for latest commit 0040a49:

  • The SDK default now targets the production LiteLLM pass-through: https://llm-proxy.app.all-hands.dev/vllm with model alias critic. It no longer points SDK users directly at the Modal backend.
  • Production infra maps model_name: critic to hosted_vllm/critic-qwen3-4b-5e-06lr-withtools-step9658-success-survival and explicitly exposes the /vllm pass-through for critic classification.
  • Proxy route sanity check: unauthenticated /vllm/classify reaches proxy auth, while bare /classify is not served. This confirms the SDK base URL must include /vllm.
  • This is the intended production behavior for fresh OpenHands UI users: the UI only stores the Critic / Iterative Refinement toggles, OpenHands injects the deployment-level critic credential, and SDK defaults provide the route/model when the settings remain null.
  • Maintainer validation: Xingyao confirmed this is the required production default and that OpenHands should inherit the SDK default instead of configuring a separate endpoint/model env override.

@all-hands-bot please re-review the latest commit with this validation context.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

This PR successfully updates the critic service default route from a deprecated Modal endpoint to a new hosted proxy endpoint. The new endpoint is verified to be reachable and functional.

Does this PR achieve its stated goal?

Yes. The PR updates the default critic server URL from https://all-hands-ai--critic-qwen3-4b-serve.modal.run (which now returns 404) to https://llm-proxy.app.all-hands.dev/vllm (which is reachable and returns 401 for authentication as expected). The model name is also updated from critic-qwen3-4b to critic to match the new service. Both CriticClient and APIBasedCritic correctly use the new defaults, and the old endpoint is confirmed to be unavailable.

Phase Result
Environment Setup ✅ Environment built successfully with uv
CI Status ✅ All 30 checks passing
Functional Verification ✅ Defaults updated, new endpoint reachable, backward compatibility preserved
Functional Verification

Test 1: Verify default constants are correctly set

Step 1 — Establish what the old defaults were (before the PR):

Checked the main branch:

git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'server_url: str = Field'

Output:

server_url: str = Field(
        default="https://all-hands-ai--critic-qwen3-4b-serve.modal.run",
        description="Base URL of the vLLM classification service",
    )
git show origin/main:openhands-sdk/openhands/sdk/critic/impl/api/client.py | grep -A 3 'model_name: str = Field'

Output:

model_name: str = Field(
        default="critic-qwen3-4b", description="Name of the model to use"
    )

This confirms the old defaults were:

  • Server URL: https://all-hands-ai--critic-qwen3-4b-serve.modal.run
  • Model name: critic-qwen3-4b

Step 2 — Apply the PR's changes (already on PR branch):

Already on branch fix-critic-default-routing (commit 0040a49).

Step 3 — Verify the new defaults:

Ran Python script:

from openhands.sdk.critic.impl.api.client import (
    DEFAULT_CRITIC_SERVER_URL,
    DEFAULT_CRITIC_MODEL_NAME,
    CriticClient,
)

print(f"DEFAULT_CRITIC_SERVER_URL = {DEFAULT_CRITIC_SERVER_URL}")
print(f"DEFAULT_CRITIC_MODEL_NAME = {DEFAULT_CRITIC_MODEL_NAME}")

client = CriticClient(api_key="test_key")
print(f"client.server_url = {client.server_url}")
print(f"client.model_name = {client.model_name}")

Output:

DEFAULT_CRITIC_SERVER_URL = https://llm-proxy.app.all-hands.dev/vllm
DEFAULT_CRITIC_MODEL_NAME = critic
client.server_url = https://llm-proxy.app.all-hands.dev/vllm
client.model_name = critic
✅ Default constants are correct
✅ CriticClient uses correct defaults

This confirms the new defaults are correctly set and used by CriticClient.

Test 2: Verify APIBasedCritic inherits the new defaults

Verification:

Ran Python script:

from openhands.sdk.critic.impl.api import APIBasedCritic
from openhands.sdk.critic.impl.api.client import (
    DEFAULT_CRITIC_SERVER_URL,
    DEFAULT_CRITIC_MODEL_NAME,
)

critic = APIBasedCritic(api_key="test_key")
print(f"critic.server_url = {critic.server_url}")
print(f"critic.model_name = {critic.model_name}")

Output:

critic.server_url = https://llm-proxy.app.all-hands.dev/vllm
critic.model_name = critic
✅ APIBasedCritic correctly inherits the default server URL and model name

This confirms APIBasedCritic (which inherits from CriticClient) correctly uses the new defaults.

Test 3: Verify endpoint reachability

Step 1 — Check if the old endpoint is still available:

Ran HTTP POST request to old endpoint:

import httpx
resp = httpx.post(
    "https://all-hands-ai--critic-qwen3-4b-serve.modal.run/classify",
    json={"model": "critic-qwen3-4b", "input": "test"},
    headers={"Authorization": "Bearer test_key"},
    timeout=10.0
)
print(f"Status: {resp.status_code}")

Output:

Status: 404
Response preview: modal-http: invalid function call
⚠️  Endpoint returned unexpected status: 404

This confirms the old endpoint is no longer available (404 = not found), justifying the need for this PR.

Step 2 — Verify the new endpoint is reachable:

Ran HTTP POST request to new endpoint:

import httpx
resp = httpx.post(
    "https://llm-proxy.app.all-hands.dev/vllm/classify",
    json={"model": "critic", "input": "test"},
    headers={"Authorization": "Bearer test_key"},
    timeout=10.0
)
print(f"Status: {resp.status_code}")

Output:

Status: 401
Response preview: {"error":{"message":"Authentication Error, LiteLLM Virtual Key expected. Received=****, expected to start with 'sk-'.","type":"auth_error","param":"None","code":"401"}}
✅ Endpoint is reachable (401 = authentication required, as expected)

This confirms the new endpoint is up and running. The 401 status is expected behavior when using a test API key — it proves the endpoint is reachable and requires valid authentication.

Test 4: Verify defaults can be overridden

Verification:

Ran Python script:

client_custom = CriticClient(
    api_key="test_key",
    server_url="https://custom.example.com",
    model_name="custom-model"
)
print(f"client_custom.server_url = {client_custom.server_url}")
print(f"client_custom.model_name = {client_custom.model_name}")

Output:

client_custom.server_url = https://custom.example.com
client_custom.model_name = custom-model
✅ Defaults can be overridden

This confirms backward compatibility — users can still override the defaults when needed.

Issues Found

None.

@xingyaoww xingyaoww added the review-this This label triggers a PR review by OpenHands label May 4, 2026

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per this repository's review policy in .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
Awaiting eval validation: The code changes are well-structured and properly tested, but repository policy blocks approval until eval evidence is provided. Either:

  1. Run a lightweight benchmark (e.g., via eval monitor at openhands-eval-monitor.vercel.app) and link results, OR
  2. Get explicit maintainer sign-off that eval validation is not needed for this infrastructure change

KEY INSIGHT:
This is an infrastructure change with clean code, but the critic service is in the agent's decision loop, so behavior validation is required before merge.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

@xingyaoww

Copy link
Copy Markdown
Member Author

Maintainer sign-off: eval validation is not needed for this infrastructure routing fix.

This PR fixes the SDK default for fresh OpenHands UI users to use the production LiteLLM critic proxy route and model alias. The previous SDK default points at an unavailable endpoint, so this is not a benchmark-tuning change; it restores the intended production critic service path. I confirm this is safe to review without an eval monitor run.

@xingyaoww xingyaoww added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels May 4, 2026

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation required per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint may affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
⏸️ Awaiting eval validation: The code changes are well-structured and the test coverage is appropriate. However, per this repository's review policy, changes that could plausibly affect benchmark/evaluation performance should not be approved without:

  1. A link to openhands-eval-monitor.vercel.app showing a completed benchmark run, AND
  2. A human maintainer comment confirming the eval results look acceptable

Once eval evidence is provided (or a maintainer decides to merge without it), this PR will be ready to merge.

KEY INSIGHT
The implementation is solid - extracting the constants improves maintainability and the test prevents regression. The hold-up is purely about validating that the endpoint swap doesn't introduce unexpected behavior changes in production benchmarks.

@xingyaoww xingyaoww added the run-eval-1 Runs evaluation on 1 SWE-bench instance label May 4, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Evaluation Triggered

@all-hands-bot

Copy link
Copy Markdown
Collaborator

🎉 Swebench Evaluation Complete

Evaluation: 25335478724-claude-son-eval
Model: litellm_proxy/claude-sonnet-4-5-20250929
Dataset: benchmark (test)
Commit: 0040a4964f613519a197f679d2c43403675f681a
Timestamp: 2026-05-04 18:52:44 UTC
Triggered by: @xingyao Wang

📊 Results

  • Total instances: 500
  • Submitted instances: 1
  • Resolved instances: 1
  • Unresolved instances: 0
  • Empty patch instances: 0
  • Error instances: 0
  • Eval limit: 1
  • Success rate: 1/1 (100.0%)

🔗 Links

Full Archive
Eval Monitor

@xingyaoww

Copy link
Copy Markdown
Member Author

Human maintainer review of eval results: this looks acceptable for this routing fix.

Eval monitor: https://openhands-eval-monitor.vercel.app/?date=2026-05-04&text=25335478724
Result: run-eval-1 completed successfully with 1/1 resolved, 0 unresolved, 0 errors, and 0 empty patches on SDK commit 0040a49.

Given this PR only updates the SDK default critic route/model alias to the production LiteLLM proxy path, these eval results are sufficient for review.

@xingyaoww xingyaoww added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels May 4, 2026

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - clean implementation, but eval validation needed per repo policy

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

Changing the default critic endpoint could affect agent behavior in benchmarks if the new service returns different classifications than the old Modal endpoint. Per .agents/skills/custom-codereview-guide.md, changes that could affect benchmark performance require eval evidence before approval.

VERDICT
⏸️ Awaiting eval validation: The code changes are solid and test coverage is appropriate. However, this repo's review policy requires eval evidence (link to openhands-eval-monitor.vercel.app + human maintainer confirmation) for changes affecting agent behavior. Once provided, this is ready to merge.

KEY INSIGHT
Clean refactoring with proper test coverage, but eval-risk policy gates approval on benchmark validation.

@xingyaoww xingyaoww added review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels May 4, 2026

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste - Clean infrastructure fix with proper validation

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW

This is a necessary infrastructure update that moves from a deprecated Modal endpoint (returns 404) to the production LiteLLM proxy. The change is well-validated with eval evidence showing 1/1 resolved and human maintainer approval. Test coverage locks down the new defaults to prevent regression.

VERDICT:
Approved: Clean fix with proper eval validation and test coverage.

KEY INSIGHT:
Simple constant extraction makes the defaults testable and documents the production infrastructure dependency.

@xingyaoww xingyaoww merged commit b5497e6 into main May 4, 2026
138 checks passed
@xingyaoww xingyaoww deleted the fix-critic-default-routing branch May 4, 2026 19:28
StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026
Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this This label triggers a PR review by OpenHands run-eval-1 Runs evaluation on 1 SWE-bench instance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants