Release v1.20.1 by all-hands-bot · Pull Request #3070 · OpenHands/software-agent-sdk

all-hands-bot · 2026-05-05T20:49:30Z

Release v1.20.1

This PR prepares the release for version 1.20.1.

Release Checklist

Version set to 1.20.1
Fix any deprecation deadlines if they exist
Integration tests pass (tagged with integration-test)
Behavior tests pass (tagged with behavior-test)
Example tests pass (tagged with test-examples)
Evaluation on OpenHands Index

What happens on merge

When this PR is merged, the create-release.yml workflow will automatically:

Create a GitHub release with tag v1.20.1 and auto-generated notes
Trigger pypi-release.yml to publish all packages to PyPI
Trigger version-bump-prs.yml to create downstream version bump PRs

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:cee855a-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-cee855a-python \
  ghcr.io/openhands/agent-server:cee855a-python

All tags pushed for this build

ghcr.io/openhands/agent-server:cee855a-golang-amd64
ghcr.io/openhands/agent-server:cee855a-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:cee855a-golang-arm64
ghcr.io/openhands/agent-server:cee855a-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:cee855a-java-amd64
ghcr.io/openhands/agent-server:cee855a-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:cee855a-java-arm64
ghcr.io/openhands/agent-server:cee855a-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:cee855a-python-amd64
ghcr.io/openhands/agent-server:cee855a-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:cee855a-python-arm64
ghcr.io/openhands/agent-server:cee855a-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:cee855a-golang
ghcr.io/openhands/agent-server:cee855a-java
ghcr.io/openhands/agent-server:cee855a-python

About Multi-Architecture Support

Each variant tag (e.g., cee855a-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., cee855a-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-05T20:49:44Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-05T20:49:45Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-05-05T20:49:55Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-05T20:50:05Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

Taste Rating: 🟢 Good taste

Clean release PR with consistent version bumps across all packages. The uv.lock changes appear to be from a uv version update that shifts from absolute timestamps to relative span-based exclusions while maintaining the 7-day guardrail.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Version-only changes with no code logic modifications. Standard release process. The lockfile update maintains the workspace 7-day freshness guardrail via exclude-newer-span = "P7D".

VERDICT:
✅ Worth merging: Version bumps are correct and consistent. Complete the release checklist (tests, eval) before merging.

KEY INSIGHT:
The uv.lock exclude-newer change to epoch-zero is a uv implementation detail; the 7-day supply-chain guardrail remains enforced via the span field.

all-hands-bot

✅ QA Report: PASS

Version bump from 1.20.0 to 1.20.1 is correctly applied across all packages, builds succeed, and packages are functional.

Does this PR achieve its stated goal?

Yes. This PR successfully prepares version 1.20.1 for release. All four packages (openhands-sdk, openhands-tools, openhands-workspace, openhands-agent-server) have been updated to version 1.20.1 in their pyproject.toml files, the GitHub workflow default value is updated, the lockfile reflects the new versions, and all packages build successfully into distributable wheels and source distributions. The packages can be imported and function correctly with the new version.

Phase	Result
Environment Setup	✅ `make build` completed successfully, all dependencies installed
CI Status	⏳ 22 passing, 12 pending, 0 failing — core checks (pre-commit, API breakage, tests) all green
Functional Verification	✅ All version updates verified, packages build and import correctly

Functional Verification

Test 1: Version Number Updates

Step 1 — Establish baseline (main branch):
Checked version on main branch:

git show origin/main:openhands-sdk/pyproject.toml | grep "^version"

Output:

version = "1.20.0"

This confirms the current release version is 1.20.0.

Step 2 — Verify PR changes:
On the PR branch (rel-1.20.1), checked all package versions:

grep -n "^version" openhands-sdk/pyproject.toml openhands-tools/pyproject.toml \
  openhands-workspace/pyproject.toml openhands-agent-server/pyproject.toml

Output:

openhands-sdk/pyproject.toml:3:version = "1.20.1"
openhands-tools/pyproject.toml:3:version = "1.20.1"
openhands-workspace/pyproject.toml:3:version = "1.20.1"
openhands-agent-server/pyproject.toml:3:version = "1.20.1"

✅ Verdict: All four packages correctly updated from 1.20.0 to 1.20.1

Step 3 — Verify workflow default update:

grep -A3 "sdk_ref:" .github/workflows/run-eval.yml | grep -E "(default:|sdk_ref:)"

Output:

sdk_ref:
                default: v1.20.1

✅ Verdict: GitHub workflow default correctly updated to v1.20.1

Test 2: Package Build Verification

Step 1 — Set up development environment:

make build

Output (excerpt):

Resolved 402 packages in 1ms
      Built openhands-workspace @ file:///.../openhands-workspace
      Built openhands-agent-server @ file:///.../openhands-agent-server
      Built openhands-sdk @ file:///.../openhands-sdk
      Built openhands-tools @ file:///.../openhands-tools
...
Installed 233 packages in 454ms
 + openhands-agent-server==1.20.1
 + openhands-sdk==1.20.1
 + openhands-tools==1.20.1
 + openhands-workspace==1.20.1
...
Build complete! Development environment is ready.

✅ Verdict: All packages install successfully with version 1.20.1

Step 2 — Build distribution packages:

uv build --all-packages -o /tmp/dist

Output:

[openhands-agent-server] Building source distribution...
[openhands-sdk] Building source distribution...
[openhands-tools] Building source distribution...
[openhands-workspace] Building source distribution...
...
Successfully built /tmp/dist/openhands_agent_server-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_agent_server-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_sdk-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_sdk-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_tools-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_tools-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_workspace-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_workspace-1.20.1-py3-none-any.whl

Verified artifacts:

ls -lh /tmp/dist/

Output:

-rw-r--r-- 104K openhands_agent_server-1.20.1-py3-none-any.whl
-rw-r--r--  88K openhands_agent_server-1.20.1.tar.gz
-rw-r--r-- 508K openhands_sdk-1.20.1-py3-none-any.whl
-rw-r--r-- 406K openhands_sdk-1.20.1.tar.gz
-rw-r--r-- 173K openhands_tools-1.20.1-py3-none-any.whl
-rw-r--r-- 130K openhands_tools-1.20.1.tar.gz
-rw-r--r--  35K openhands_workspace-1.20.1-py3-none-any.whl
-rw-r--r--  31K openhands_workspace-1.20.1.tar.gz

✅ Verdict: All 8 distribution files (4 wheels + 4 source distributions) built successfully with version 1.20.1

Test 3: Package Import and Version Verification

Step 1 — Verify packages are importable and report correct versions:
Created and ran a verification script:

import sys

versioned_packages = {
    "openhands-sdk": "openhands.sdk",
    "openhands-tools": "openhands.tools"
}

importable_packages = {
    "openhands-workspace": "openhands.workspace",
    "openhands-agent-server": "openhands.agent_server"
}

for package_name, module_name in versioned_packages.items():
    module = __import__(module_name, fromlist=['__version__'])
    version = module.__version__
    assert version == "1.20.1", f"{package_name} version mismatch"
    print(f"✓ {package_name}: {version}")

for package_name, module_name in importable_packages.items():
    __import__(module_name)
    print(f"✓ {package_name}: Successfully imported")

Output:

✓ openhands-sdk: 1.20.1
✓ openhands-tools: 1.20.1
✓ openhands-workspace: Successfully imported
✓ openhands-agent-server: Successfully imported

✓ All packages installed and functional

✅ Verdict: All packages import successfully and versioned packages report 1.20.1

Issues Found

None.

Summary: This release PR correctly updates all version numbers from 1.20.0 to 1.20.1, builds successfully, and all packages remain functional. The PR is ready for the automated release workflow to create the GitHub release and publish to PyPI upon merge.

github-actions · 2026-05-05T20:56:43Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	25480	5886	76%

report-only-changed-files is enabled. No files were changed during this commit :)

github-actions · 2026-05-05T20:59:59Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-05-05 21:13:22 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	25.8s	$0.04
01_standalone_sdk/03_activate_skill.py	✅ PASS	21.7s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.1s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	33.9s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	14.4s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	40.9s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	45.4s	$0.05
01_standalone_sdk/12_custom_secrets.py	✅ PASS	9.3s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	29.6s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 48s	$0.19
01_standalone_sdk/17_image_input.py	✅ PASS	22.9s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	22.2s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	15.0s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	14.7s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	9.2s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	13.3s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 6s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	48.5s	$0.05
01_standalone_sdk/25_agent_delegation.py	✅ PASS	59.7s	$0.07
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	22.8s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	31.2s	$0.04
01_standalone_sdk/29_llm_streaming.py	✅ PASS	35.0s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	12.9s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	9m 43s	$0.75
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	17.1s	$0.01
01_standalone_sdk/34_critic_example.py	✅ PASS	2m 38s	$0.22
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	10.4s	$0.00
01_standalone_sdk/37_llm_profile_store/main.py	✅ PASS	3.8s	$0.00
01_standalone_sdk/38_browser_session_recording.py	✅ PASS	33.3s	$0.03
01_standalone_sdk/39_llm_fallback.py	✅ PASS	10.4s	$0.01
01_standalone_sdk/40_acp_agent_example.py	✅ PASS	29.6s	$0.13
01_standalone_sdk/41_task_tool_set.py	✅ PASS	25.4s	$0.03
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	2m 4s	$0.10
01_standalone_sdk/43_mixed_marketplace_skills/main.py	✅ PASS	3.4s	$0.00
01_standalone_sdk/44_model_switching_in_convo.py	✅ PASS	7.5s	$0.01
01_standalone_sdk/45_parallel_tool_execution.py	✅ PASS	3m 39s	$0.54
01_standalone_sdk/46_agent_settings.py	✅ PASS	10.9s	$0.01
01_standalone_sdk/47_defense_in_depth_security.py	✅ PASS	3.3s	$0.00
01_standalone_sdk/48_conversation_fork.py	✅ PASS	13.1s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	38.7s	$0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 13s	$0.07
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 45s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 41s	$0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	31.4s	$0.04
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 7s	$0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py	✅ PASS	46.3s	$0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py	✅ PASS	30.7s	$0.06
02_remote_agent_server/11_conversation_fork.py	✅ PASS	35.0s	$0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	17.7s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	35.5s	$0.10
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	13.6s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	16.7s	$0.02

✅ All tests passed!

Total: 52 | Passed: 52 | Failed: 0 | Total Cost: $3.11

View full workflow run

github-actions · 2026-05-05T21:02:10Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-05-05 21:16:06 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	27.8s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	20.9s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.6s	$0.00
01_standalone_sdk/07_mcp_integration.py	✅ PASS	30.4s	$0.01
01_standalone_sdk/09_pause_example.py	✅ PASS	12.5s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	34.1s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	27.7s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	10.6s	$0.00
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	36.2s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 25s	$0.17
01_standalone_sdk/17_image_input.py	✅ PASS	22.6s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	25.7s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.7s	$0.01
01_standalone_sdk/20_stuck_detector.py	✅ PASS	21.7s	$0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	9.7s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	13.2s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 1s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	3m 46s	$0.26
01_standalone_sdk/25_agent_delegation.py	✅ PASS	54.0s	$0.06
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	25.5s	$0.03
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	32.9s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	47.6s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	9.5s	$0.00
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 30s	$0.31
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	21.9s	$0.02
01_standalone_sdk/34_critic_example.py	✅ PASS	9m 42s	$0.93
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	17.2s	$0.01
01_standalone_sdk/37_llm_profile_store/main.py	✅ PASS	8.3s	$0.00
01_standalone_sdk/38_browser_session_recording.py	✅ PASS	27.6s	$0.02
01_standalone_sdk/39_llm_fallback.py	✅ PASS	10.5s	$0.00
01_standalone_sdk/40_acp_agent_example.py	✅ PASS	34.8s	$0.06
01_standalone_sdk/41_task_tool_set.py	✅ PASS	37.5s	$0.03
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	1m 49s	$0.09
01_standalone_sdk/43_mixed_marketplace_skills/main.py	✅ PASS	7.2s	$0.00
01_standalone_sdk/44_model_switching_in_convo.py	✅ PASS	8.1s	$0.01
01_standalone_sdk/45_parallel_tool_execution.py	✅ PASS	3m 48s	$0.48
01_standalone_sdk/46_agent_settings.py	✅ PASS	12.3s	$0.01
01_standalone_sdk/47_defense_in_depth_security.py	✅ PASS	3.4s	$0.00
01_standalone_sdk/48_conversation_fork.py	✅ PASS	13.8s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	49.0s	$0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 32s	$0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 12s	$0.09
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 11s	$0.04
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	27.9s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 50s	$0.05
02_remote_agent_server/09_acp_agent_with_remote_runtime.py	✅ PASS	46.2s	$0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py	✅ PASS	27.2s	$0.01
02_remote_agent_server/11_conversation_fork.py	✅ PASS	41.7s	$0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	23.4s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	32.7s	$0.07
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	19.2s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	23.9s	$0.02

✅ All tests passed!

Total: 52 | Passed: 52 | Failed: 0 | Total Cost: $3.29

View full workflow run

github-actions · 2026-05-05T21:05:16Z

🧪 Integration Tests Results

Overall Success Rate: 90.0%
Total Cost: $8.70
Models Tested: 4
Timestamp: 2026-05-05 21:05:08 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Total	Cost	Tokens
litellm_proxy_moonshot_kimi_k2_thinking	80.0%	4/5	5	$0.98	4,307,441
litellm_proxy_deepseek_deepseek_reasoner	100.0%	5/5	5	$0.33	3,555,534
litellm_proxy_gemini_3.1_pro_preview	80.0%	4/5	5	$4.50	6,884,154
litellm_proxy_anthropic_claude_sonnet_4_6	100.0%	5/5	5	$2.89	3,976,322

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 80.0% (4/5)
Total Cost: $0.98
Token Usage: prompt: 4,262,937, completion: 44,504, cache_read: 3,956,480
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_cee855a_kimi_k2_thinking_run_N5_20260505_205129

Failed Tests:

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the primary requested file (train_smolvla_example.py) with excellent quality that follows the same format as existing examples in the repository (ACT, Diffusion). The code is well-structured, properly documented, and implements the training functionality correctly with SmolVLA-specific components.

However, the agent VIOLATED the explicit evaluation criteria by creating TWO markdown documentation files instead of the allowed ONE:

README.md (acceptable - pertains to the training script)
TRAINING_SCRIPT_SUMMARY.md (NOT acceptable - redundant/unnecessary)

The evaluation criteria explicitly state: "Only one README.md file is acceptable if it pertains to the new training script." and "Avoid creating any additional files that were not explicitly requested."

The agent created TRAINING_SCRIPT_SUMMARY.md as an unnecessary redundant file that duplicates information already present in README.md. While the intent was helpful (providing comprehensive documentation), it violates the stated constraints about not creating redundant files beyond what was requested.

Strengths:

Main deliverable (train_smolvla_example.py) is high-quality and complete
Thorough codebase exploration and understanding
Proper implementation following repository patterns
Good command-line interface with flexible parameters
Appropriate handling of SmolVLA-specific preprocessing

Violations:

Created one too many markdown files (2 instead of 1 allowed) (confidence=0.80) (Cost: $0.26)

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (5/5)
Total Cost: $0.33
Token Usage: prompt: 3,513,222, completion: 42,312, cache_read: 3,237,760, reasoning: 12,761
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_cee855a_deepseek_v3_2_reasoner_run_N5_20260505_205120

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 80.0% (4/5)
Total Cost: $4.50
Token Usage: prompt: 6,836,284, completion: 47,870, cache_read: 5,331,880, reasoning: 18,831
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_cee855a_gemini_3_1_pro_run_N5_20260505_205129

Failed Tests:

b01_no_premature_implementation: Agent behavior was inappropriate according to LLM judge. Judge reasoning: The agent's core exploration and analysis was excellent—it thoroughly examined the codebase, identified key architectural constraints (the hardcoded state.execution_status = FINISHED in agent.step()), and provided well-reasoned guidance on why conversation_callback is insufficient and why a wrapper class approach is preferable. The final explanation and code example are of high quality and directly address the user's question.

However, the agent violated a critical evaluation criterion: NOT create new files or edit existing files. The user explicitly framed their request as seeking advice before implementation ("Before I start implementing, can you first explore the codebase and tell me..."). Despite this clear signal, the agent created multiple test_rollout.py files and modified them several times (adding Pydantic fields, fixing MessageToolCall parameters, changing origin values).

While these test files were created in a temporary directory and not in the actual codebase, they still represent implementation artifacts that went beyond the scope of "exploration and advice." The agent could have provided the identical high-quality analysis and recommendations without creating and running test code.

What went well:

Comprehensive codebase exploration using grep, find, and file reading
Clear identification of architectural limitations
Excellent final guidance with code examples
Proper suggestions for where the feature should live

What violated criteria:

Created test_rollout.py without being asked
Iterated on test files multiple times
This represents implementation work, not just exploration/advice (confidence=0.82) (Cost: $0.73)

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 100.0% (5/5)
Total Cost: $2.89
Token Usage: prompt: 3,916,954, completion: 59,368, cache_read: 3,587,313, cache_write: 238,259, reasoning: 8,406
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_cee855a_claude_sonnet_4_6_run_N5_20260505_205124

github-actions · 2026-05-05T21:08:04Z

🧪 Integration Tests Results

Overall Success Rate: 97.1%
Total Cost: $1.14
Models Tested: 4
Timestamp: 2026-05-05 21:07:55 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3.1_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	8/8	1	9	$0.10	340,631
litellm_proxy_deepseek_deepseek_reasoner	100.0%	8/8	1	9	$0.02	375,140
litellm_proxy_gemini_3.1_pro_preview	100.0%	9/9	0	9	$0.46	331,591
litellm_proxy_anthropic_claude_sonnet_4_6	88.9%	8/9	0	9	$0.56	367,099

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (8/8)
Total Cost: $0.10
Token Usage: prompt: 335,300, completion: 5,331, cache_read: 262,144
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_cee855a_kimi_k2_thinking_run_N9_20260505_205126
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (8/8)
Total Cost: $0.02
Token Usage: prompt: 369,973, completion: 5,167, cache_read: 323,840, reasoning: 1,240
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_cee855a_deepseek_v3_2_reasoner_run_N9_20260505_205142
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

Success Rate: 100.0% (9/9)
Total Cost: $0.46
Token Usage: prompt: 327,136, completion: 4,455, cache_read: 136,859, reasoning: 2,799
Run Suffix: litellm_proxy_gemini_3.1_pro_preview_cee855a_gemini_3_1_pro_run_N9_20260505_205122

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 88.9% (8/9)
Total Cost: $0.56
Token Usage: prompt: 360,381, completion: 6,718, cache_read: 258,001, cache_write: 102,074, reasoning: 1,175
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_cee855a_claude_sonnet_4_6_run_N9_20260505_205139

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.06)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: openhands <openhands@all-hands.dev>

Release v1.20.1

cee855a

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels May 5, 2026

all-hands-bot commented May 5, 2026

View reviewed changes

xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label May 5, 2026

xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label May 5, 2026

xingyaoww approved these changes May 6, 2026

View reviewed changes

xingyaoww merged commit 44c4c0a into main May 6, 2026
117 of 118 checks passed

xingyaoww deleted the rel-1.20.1 branch May 6, 2026 02:19

StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026

Release v1.20.1 (OpenHands#3070)

04d2465

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.20.1#3070

Release v1.20.1#3070
xingyaoww merged 1 commit into
mainfrom
rel-1.20.1

all-hands-bot commented May 5, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

all-hands-bot commented May 5, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.20.1

Release Checklist

What happens on merge

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 5, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Version Number Updates

Test 2: Package Build Verification

Test 3: Package Import and Version Verification

Issues Found

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

✅ All tests passed!

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

✅ All tests passed!

Uh oh!

github-actions Bot commented May 5, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_gemini_3.1_pro_preview

litellm_proxy_anthropic_claude_sonnet_4_6

Uh oh!

github-actions Bot commented May 5, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_gemini_3.1_pro_preview

litellm_proxy_anthropic_claude_sonnet_4_6

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

all-hands-bot commented May 5, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

github-actions Bot commented May 5, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`