Skip to content

Release v1.20.1#3070

Merged
xingyaoww merged 1 commit into
mainfrom
rel-1.20.1
May 6, 2026
Merged

Release v1.20.1#3070
xingyaoww merged 1 commit into
mainfrom
rel-1.20.1

Conversation

@all-hands-bot

@all-hands-bot all-hands-bot commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Release v1.20.1

This PR prepares the release for version 1.20.1.

Release Checklist

  • Version set to 1.20.1
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Evaluation on OpenHands Index

What happens on merge

When this PR is merged, the create-release.yml workflow will automatically:

  1. Create a GitHub release with tag v1.20.1 and auto-generated notes
  2. Trigger pypi-release.yml to publish all packages to PyPI
  3. Trigger version-bump-prs.yml to create downstream version bump PRs

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:cee855a-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-cee855a-python \
  ghcr.io/openhands/agent-server:cee855a-python

All tags pushed for this build

ghcr.io/openhands/agent-server:cee855a-golang-amd64
ghcr.io/openhands/agent-server:cee855a-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:cee855a-golang-arm64
ghcr.io/openhands/agent-server:cee855a-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:cee855a-java-amd64
ghcr.io/openhands/agent-server:cee855a-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:cee855a-java-arm64
ghcr.io/openhands/agent-server:cee855a-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:cee855a-python-amd64
ghcr.io/openhands/agent-server:cee855a-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:cee855a-python-arm64
ghcr.io/openhands/agent-server:cee855a-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:cee855a-golang
ghcr.io/openhands/agent-server:cee855a-java
ghcr.io/openhands/agent-server:cee855a-python

About Multi-Architecture Support

  • Each variant tag (e.g., cee855a-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., cee855a-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels May 5, 2026
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste

Clean release PR with consistent version bumps across all packages. The uv.lock changes appear to be from a uv version update that shifts from absolute timestamps to relative span-based exclusions while maintaining the 7-day guardrail.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Version-only changes with no code logic modifications. Standard release process. The lockfile update maintains the workspace 7-day freshness guardrail via exclude-newer-span = "P7D".

VERDICT:
Worth merging: Version bumps are correct and consistent. Complete the release checklist (tests, eval) before merging.

KEY INSIGHT:
The uv.lock exclude-newer change to epoch-zero is a uv implementation detail; the 7-day supply-chain guardrail remains enforced via the span field.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Version bump from 1.20.0 to 1.20.1 is correctly applied across all packages, builds succeed, and packages are functional.

Does this PR achieve its stated goal?

Yes. This PR successfully prepares version 1.20.1 for release. All four packages (openhands-sdk, openhands-tools, openhands-workspace, openhands-agent-server) have been updated to version 1.20.1 in their pyproject.toml files, the GitHub workflow default value is updated, the lockfile reflects the new versions, and all packages build successfully into distributable wheels and source distributions. The packages can be imported and function correctly with the new version.

Phase Result
Environment Setup make build completed successfully, all dependencies installed
CI Status ⏳ 22 passing, 12 pending, 0 failing — core checks (pre-commit, API breakage, tests) all green
Functional Verification ✅ All version updates verified, packages build and import correctly
Functional Verification

Test 1: Version Number Updates

Step 1 — Establish baseline (main branch):
Checked version on main branch:

git show origin/main:openhands-sdk/pyproject.toml | grep "^version"

Output:

version = "1.20.0"

This confirms the current release version is 1.20.0.

Step 2 — Verify PR changes:
On the PR branch (rel-1.20.1), checked all package versions:

grep -n "^version" openhands-sdk/pyproject.toml openhands-tools/pyproject.toml \
  openhands-workspace/pyproject.toml openhands-agent-server/pyproject.toml

Output:

openhands-sdk/pyproject.toml:3:version = "1.20.1"
openhands-tools/pyproject.toml:3:version = "1.20.1"
openhands-workspace/pyproject.toml:3:version = "1.20.1"
openhands-agent-server/pyproject.toml:3:version = "1.20.1"

Verdict: All four packages correctly updated from 1.20.0 to 1.20.1

Step 3 — Verify workflow default update:

grep -A3 "sdk_ref:" .github/workflows/run-eval.yml | grep -E "(default:|sdk_ref:)"

Output:

sdk_ref:
                default: v1.20.1

Verdict: GitHub workflow default correctly updated to v1.20.1


Test 2: Package Build Verification

Step 1 — Set up development environment:

make build

Output (excerpt):

Resolved 402 packages in 1ms
      Built openhands-workspace @ file:///.../openhands-workspace
      Built openhands-agent-server @ file:///.../openhands-agent-server
      Built openhands-sdk @ file:///.../openhands-sdk
      Built openhands-tools @ file:///.../openhands-tools
...
Installed 233 packages in 454ms
 + openhands-agent-server==1.20.1
 + openhands-sdk==1.20.1
 + openhands-tools==1.20.1
 + openhands-workspace==1.20.1
...
Build complete! Development environment is ready.

Verdict: All packages install successfully with version 1.20.1

Step 2 — Build distribution packages:

uv build --all-packages -o /tmp/dist

Output:

[openhands-agent-server] Building source distribution...
[openhands-sdk] Building source distribution...
[openhands-tools] Building source distribution...
[openhands-workspace] Building source distribution...
...
Successfully built /tmp/dist/openhands_agent_server-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_agent_server-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_sdk-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_sdk-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_tools-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_tools-1.20.1-py3-none-any.whl
Successfully built /tmp/dist/openhands_workspace-1.20.1.tar.gz
Successfully built /tmp/dist/openhands_workspace-1.20.1-py3-none-any.whl

Verified artifacts:

ls -lh /tmp/dist/

Output:

-rw-r--r-- 104K openhands_agent_server-1.20.1-py3-none-any.whl
-rw-r--r--  88K openhands_agent_server-1.20.1.tar.gz
-rw-r--r-- 508K openhands_sdk-1.20.1-py3-none-any.whl
-rw-r--r-- 406K openhands_sdk-1.20.1.tar.gz
-rw-r--r-- 173K openhands_tools-1.20.1-py3-none-any.whl
-rw-r--r-- 130K openhands_tools-1.20.1.tar.gz
-rw-r--r--  35K openhands_workspace-1.20.1-py3-none-any.whl
-rw-r--r--  31K openhands_workspace-1.20.1.tar.gz

Verdict: All 8 distribution files (4 wheels + 4 source distributions) built successfully with version 1.20.1


Test 3: Package Import and Version Verification

Step 1 — Verify packages are importable and report correct versions:
Created and ran a verification script:

import sys

versioned_packages = {
    "openhands-sdk": "openhands.sdk",
    "openhands-tools": "openhands.tools"
}

importable_packages = {
    "openhands-workspace": "openhands.workspace",
    "openhands-agent-server": "openhands.agent_server"
}

for package_name, module_name in versioned_packages.items():
    module = __import__(module_name, fromlist=['__version__'])
    version = module.__version__
    assert version == "1.20.1", f"{package_name} version mismatch"
    print(f"✓ {package_name}: {version}")

for package_name, module_name in importable_packages.items():
    __import__(module_name)
    print(f"✓ {package_name}: Successfully imported")

Output:

✓ openhands-sdk: 1.20.1
✓ openhands-tools: 1.20.1
✓ openhands-workspace: Successfully imported
✓ openhands-agent-server: Successfully imported

✓ All packages installed and functional

Verdict: All packages import successfully and versioned packages report 1.20.1

Issues Found

None.


Summary: This release PR correctly updates all version numbers from 1.20.0 to 1.20.1, builds successfully, and all packages remain functional. The PR is ready for the automated release workflow to create the GitHub release and publish to PyPI upon merge.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL25480588676% 
report-only-changed-files is enabled. No files were changed during this commit :)

@xingyaoww xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label May 5, 2026
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-05 21:13:22 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 25.8s $0.04
01_standalone_sdk/03_activate_skill.py ✅ PASS 21.7s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.1s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 33.9s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 14.4s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 40.9s $0.02
01_standalone_sdk/11_async.py ✅ PASS 45.4s $0.05
01_standalone_sdk/12_custom_secrets.py ✅ PASS 9.3s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 29.6s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 48s $0.19
01_standalone_sdk/17_image_input.py ✅ PASS 22.9s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 22.2s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 15.0s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 14.7s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 9.2s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.3s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 6s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 48.5s $0.05
01_standalone_sdk/25_agent_delegation.py ✅ PASS 59.7s $0.07
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 22.8s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 31.2s $0.04
01_standalone_sdk/29_llm_streaming.py ✅ PASS 35.0s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 12.9s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 9m 43s $0.75
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 17.1s $0.01
01_standalone_sdk/34_critic_example.py ✅ PASS 2m 38s $0.22
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 10.4s $0.00
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 3.8s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 33.3s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.4s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 29.6s $0.13
01_standalone_sdk/41_task_tool_set.py ✅ PASS 25.4s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 2m 4s $0.10
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 3.4s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 7.5s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 3m 39s $0.54
01_standalone_sdk/46_agent_settings.py ✅ PASS 10.9s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 3.3s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 13.1s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 38.7s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 13s $0.07
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 45s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 41s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 31.4s $0.04
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 7s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 46.3s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 30.7s $0.06
02_remote_agent_server/11_conversation_fork.py ✅ PASS 35.0s $0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 17.7s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 35.5s $0.10
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 13.6s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 16.7s $0.02

✅ All tests passed!

Total: 52 | Passed: 52 | Failed: 0 | Total Cost: $3.11

View full workflow run

@xingyaoww xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label May 5, 2026
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-05 21:16:06 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 27.8s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 20.9s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.6s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 30.4s $0.01
01_standalone_sdk/09_pause_example.py ✅ PASS 12.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 34.1s $0.02
01_standalone_sdk/11_async.py ✅ PASS 27.7s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.6s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 36.2s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 25s $0.17
01_standalone_sdk/17_image_input.py ✅ PASS 22.6s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 25.7s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.7s $0.01
01_standalone_sdk/20_stuck_detector.py ✅ PASS 21.7s $0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 9.7s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.2s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 1s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 46s $0.26
01_standalone_sdk/25_agent_delegation.py ✅ PASS 54.0s $0.06
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 25.5s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 32.9s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 47.6s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 9.5s $0.00
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 30s $0.31
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 21.9s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 9m 42s $0.93
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 17.2s $0.01
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 8.3s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 27.6s $0.02
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.5s $0.00
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 34.8s $0.06
01_standalone_sdk/41_task_tool_set.py ✅ PASS 37.5s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 49s $0.09
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 7.2s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 8.1s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 3m 48s $0.48
01_standalone_sdk/46_agent_settings.py ✅ PASS 12.3s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 3.4s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 13.8s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 49.0s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 32s $0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 12s $0.09
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 11s $0.04
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 27.9s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 50s $0.05
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 46.2s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 27.2s $0.01
02_remote_agent_server/11_conversation_fork.py ✅ PASS 41.7s $0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 23.4s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 32.7s $0.07
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 19.2s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 23.9s $0.02

✅ All tests passed!

Total: 52 | Passed: 52 | Failed: 0 | Total Cost: $3.29

View full workflow run

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 90.0%
Total Cost: $8.70
Models Tested: 4
Timestamp: 2026-05-05 21:05:08 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 80.0% 4/5 0 5 $0.98 4,307,441
litellm_proxy_deepseek_deepseek_reasoner 100.0% 5/5 0 5 $0.33 3,555,534
litellm_proxy_gemini_3.1_pro_preview 80.0% 4/5 0 5 $4.50 6,884,154
litellm_proxy_anthropic_claude_sonnet_4_6 100.0% 5/5 0 5 $2.89 3,976,322

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 80.0% (4/5)
  • Total Cost: $0.98
  • Token Usage: prompt: 4,262,937, completion: 44,504, cache_read: 3,956,480
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_cee855a_kimi_k2_thinking_run_N5_20260505_205129

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the primary requested file (train_smolvla_example.py) with excellent quality that follows the same format as existing examples in the repository (ACT, Diffusion). The code is well-structured, properly documented, and implements the training functionality correctly with SmolVLA-specific components.

However, the agent VIOLATED the explicit evaluation criteria by creating TWO markdown documentation files instead of the allowed ONE:

  1. README.md (acceptable - pertains to the training script)
  2. TRAINING_SCRIPT_SUMMARY.md (NOT acceptable - redundant/unnecessary)

The evaluation criteria explicitly state: "Only one README.md file is acceptable if it pertains to the new training script." and "Avoid creating any additional files that were not explicitly requested."

The agent created TRAINING_SCRIPT_SUMMARY.md as an unnecessary redundant file that duplicates information already present in README.md. While the intent was helpful (providing comprehensive documentation), it violates the stated constraints about not creating redundant files beyond what was requested.

Strengths:

  • Main deliverable (train_smolvla_example.py) is high-quality and complete
  • Thorough codebase exploration and understanding
  • Proper implementation following repository patterns
  • Good command-line interface with flexible parameters
  • Appropriate handling of SmolVLA-specific preprocessing

Violations:

  • Created one too many markdown files (2 instead of 1 allowed) (confidence=0.80) (Cost: $0.26)

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (5/5)
  • Total Cost: $0.33
  • Token Usage: prompt: 3,513,222, completion: 42,312, cache_read: 3,237,760, reasoning: 12,761
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_cee855a_deepseek_v3_2_reasoner_run_N5_20260505_205120

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 80.0% (4/5)
  • Total Cost: $4.50
  • Token Usage: prompt: 6,836,284, completion: 47,870, cache_read: 5,331,880, reasoning: 18,831
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_cee855a_gemini_3_1_pro_run_N5_20260505_205129

Failed Tests:

  • b01_no_premature_implementation: Agent behavior was inappropriate according to LLM judge. Judge reasoning: The agent's core exploration and analysis was excellent—it thoroughly examined the codebase, identified key architectural constraints (the hardcoded state.execution_status = FINISHED in agent.step()), and provided well-reasoned guidance on why conversation_callback is insufficient and why a wrapper class approach is preferable. The final explanation and code example are of high quality and directly address the user's question.

However, the agent violated a critical evaluation criterion: NOT create new files or edit existing files. The user explicitly framed their request as seeking advice before implementation ("Before I start implementing, can you first explore the codebase and tell me..."). Despite this clear signal, the agent created multiple test_rollout.py files and modified them several times (adding Pydantic fields, fixing MessageToolCall parameters, changing origin values).

While these test files were created in a temporary directory and not in the actual codebase, they still represent implementation artifacts that went beyond the scope of "exploration and advice." The agent could have provided the identical high-quality analysis and recommendations without creating and running test code.

What went well:

  • Comprehensive codebase exploration using grep, find, and file reading
  • Clear identification of architectural limitations
  • Excellent final guidance with code examples
  • Proper suggestions for where the feature should live

What violated criteria:

  • Created test_rollout.py without being asked
  • Iterated on test files multiple times
  • This represents implementation work, not just exploration/advice (confidence=0.82) (Cost: $0.73)

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 100.0% (5/5)
  • Total Cost: $2.89
  • Token Usage: prompt: 3,916,954, completion: 59,368, cache_read: 3,587,313, cache_write: 238,259, reasoning: 8,406
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_cee855a_claude_sonnet_4_6_run_N5_20260505_205124

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 97.1%
Total Cost: $1.14
Models Tested: 4
Timestamp: 2026-05-05 21:07:55 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 8/8 1 9 $0.10 340,631
litellm_proxy_deepseek_deepseek_reasoner 100.0% 8/8 1 9 $0.02 375,140
litellm_proxy_gemini_3.1_pro_preview 100.0% 9/9 0 9 $0.46 331,591
litellm_proxy_anthropic_claude_sonnet_4_6 88.9% 8/9 0 9 $0.56 367,099

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.10
  • Token Usage: prompt: 335,300, completion: 5,331, cache_read: 262,144
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_cee855a_kimi_k2_thinking_run_N9_20260505_205126
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.02
  • Token Usage: prompt: 369,973, completion: 5,167, cache_read: 323,840, reasoning: 1,240
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_cee855a_deepseek_v3_2_reasoner_run_N9_20260505_205142
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.46
  • Token Usage: prompt: 327,136, completion: 4,455, cache_read: 136,859, reasoning: 2,799
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_cee855a_gemini_3_1_pro_run_N9_20260505_205122

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 88.9% (8/9)
  • Total Cost: $0.56
  • Token Usage: prompt: 360,381, completion: 6,718, cache_read: 258,001, cache_write: 102,074, reasoning: 1,175
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_cee855a_claude_sonnet_4_6_run_N9_20260505_205139

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.06)

@xingyaoww xingyaoww merged commit 44c4c0a into main May 6, 2026
117 of 118 checks passed
@xingyaoww xingyaoww deleted the rel-1.20.1 branch May 6, 2026 02:19
StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants