Skip to content

Release v1.20.0#3038

Merged
xingyaoww merged 4 commits into
mainfrom
rel-1.20.0
May 4, 2026
Merged

Release v1.20.0#3038
xingyaoww merged 4 commits into
mainfrom
rel-1.20.0

Conversation

@all-hands-bot

@all-hands-bot all-hands-bot commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Release v1.20.0

This PR prepares the release for version 1.20.0.

Release Checklist

  • Version set to 1.20.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.20.0
    • Select branch: rel-1.20.0
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:1635059-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-1635059-python \
  ghcr.io/openhands/agent-server:1635059-python

All tags pushed for this build

ghcr.io/openhands/agent-server:1635059-golang-amd64
ghcr.io/openhands/agent-server:1635059-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:1635059-golang-arm64
ghcr.io/openhands/agent-server:1635059-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:1635059-java-amd64
ghcr.io/openhands/agent-server:1635059-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:1635059-java-arm64
ghcr.io/openhands/agent-server:1635059-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:1635059-python-amd64
ghcr.io/openhands/agent-server:1635059-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:1635059-python-arm64
ghcr.io/openhands/agent-server:1635059-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:1635059-golang
ghcr.io/openhands/agent-server:1635059-java
ghcr.io/openhands/agent-server:1635059-python

About Multi-Architecture Support

  • Each variant tag (e.g., 1635059-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 1635059-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels May 1, 2026
@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

1 similar comment
@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, mechanical release version bump.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Pure version number updates for release v1.20.0. All packages consistently bumped from 1.19.1 to 1.20.0, eval workflow default updated, and lock file regenerated correctly. No code changes or behavioral modifications.

VERDICT:
Worth merging: Standard release process, all version updates are consistent.

KEY INSIGHT:
Textbook release PR - mechanical version bumps with proper lock file maintenance.


Note: Cannot formally approve as this appears to be my own PR, but this is ready to merge from a technical review perspective.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Version bump from 1.19.1 to 1.20.0 is complete and consistent across all packages. Built packages correctly report version 1.20.0 and all artifacts are ready for release.

Does this PR achieve its stated goal?

Yes. This PR successfully prepares the repository for version 1.20.0 release. All version strings have been updated from 1.19.1 to 1.20.0 across all four packages (openhands-sdk, openhands-tools, openhands-workspace, openhands-agent-server), the lockfile, and the workflow configuration. The packages build successfully and report the correct version through their metadata. The version bump is the only change required for this release PR, and it has been executed correctly.

Phase Result
Environment Setup uv sync --dev completed, all packages installed
CI Status ⚠️ 2 checks failing (agent-server-tests, check) - unrelated to version bump
Functional Verification ✅ All version strings updated consistently, packages build successfully
Functional Verification

Test 1: Version Consistency Across Configuration Files

Step 1 — Establish baseline (main branch):
Checked version in all pyproject.toml files on main branch:

openhands-sdk/pyproject.toml: version = "1.19.1"
openhands-tools/pyproject.toml: version = "1.19.1"
openhands-workspace/pyproject.toml: version = "1.19.1"
openhands-agent-server/pyproject.toml: version = "1.19.1"
.github/workflows/run-eval.yml: default: v1.19.1

This confirms the baseline version is 1.19.1 before this PR.

Step 2 — Apply the PR's changes:
Checked out the rel-1.20.0 branch (commit f1621e9).

Step 3 — Verify version updates:
Ran grep '^version' */pyproject.toml:

openhands-sdk/pyproject.toml: version = "1.20.0"
openhands-tools/pyproject.toml: version = "1.20.0"
openhands-workspace/pyproject.toml: version = "1.20.0"
openhands-agent-server/pyproject.toml: version = "1.20.0"

Checked uv.lock:

name = "openhands-agent-server"
version = "1.20.0"
name = "openhands-sdk"
version = "1.20.0"
name = "openhands-tools"
version = "1.20.0"
name = "openhands-workspace"
version = "1.20.0"

Checked workflow default:

sdk_ref:
  description: SDK commit/ref to evaluate...
  required: true
  default: v1.20.0

This confirms all configuration files have been updated consistently to 1.20.0.


Test 2: Package Metadata Reports Correct Version

Step 1 — Verify installed packages report version 1.20.0:
Ran uv sync --dev to install packages, then checked versions:

from importlib.metadata import version
for pkg in ['openhands-sdk', 'openhands-tools', 'openhands-workspace', 'openhands-agent-server']:
    print(f'{pkg}: {version(pkg)}')

Output:

openhands-sdk: 1.20.0
openhands-tools: 1.20.0
openhands-workspace: 1.20.0
openhands-agent-server: 1.20.0

Imported the SDK and verified the banner shows the new version:

from openhands.sdk import Agent, LLM, Conversation

Banner output:

+----------------------------------------------------------------------+
|  OpenHands SDK v1.20.0                                               |
|                                                                      |
|  Report a bug: github.com/OpenHands/software-agent-sdk/issues        |
|  Get help: openhands.dev/joinslack                                   |
|  Scale up: openhands.dev/product/sdk                                 |
+----------------------------------------------------------------------+

This confirms the installed packages correctly report version 1.20.0 through their metadata and user-facing version displays.


Test 3: Packages Build Successfully with Correct Version

Step 1 — Build openhands-sdk package:
Ran uv build openhands-sdk/ --out-dir /tmp/dist-test

Output:

Successfully built /tmp/dist-test/openhands_sdk-1.20.0.tar.gz
Successfully built /tmp/dist-test/openhands_sdk-1.20.0-py3-none-any.whl

Verified wheel metadata:

Name: openhands-sdk
Version: 1.20.0

Step 2 — Build openhands-agent-server package:
Ran uv build openhands-agent-server/ --out-dir /tmp/dist-test-server

Output:

Successfully built /tmp/dist-test-server/openhands_agent_server-1.20.0.tar.gz
Successfully built /tmp/dist-test-server/openhands_agent_server-1.20.0-py3-none-any.whl

This confirms all packages build successfully and their build artifacts (wheels, tarballs) contain the correct version 1.20.0 in their filenames and metadata. These are the artifacts that will be published to PyPI.


Test 4: No Unintended Version References

Searched for any remaining 1.19.1 references:

git grep -n "1\.19\.1" -- '*.py' '*.toml' '*.yml' '*.yaml' '*.md' '*.txt'

Found only:

openhands-sdk/openhands/sdk/tool/registry.py:165:  deprecated_in="1.19.1",

Verified context:

warn_deprecated(
    "register_tool(callable_factory)",
    deprecated_in="1.19.1",
    removed_in="1.24.0",
    ...

This is correct — it's deprecation metadata recording when a feature was deprecated, not the current version. No version strings were missed.

Issues Found

None related to the version bump. The PR achieves its stated goal.

Note on CI failures: Two checks are failing (agent-server-tests, check), but these appear to be pre-existing or unrelated to the version bump itself. The "Check package versions" CI check passes, confirming version consistency. The failing checks should be addressed as part of the release process per the PR checklist.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   config.py68395%29, 42, 193
   file_router.py661281%56–58, 94–96, 124–127, 130–131
openhands-sdk/openhands/sdk/llm
   llm.py5328883%465, 489, 532, 795, 904, 906–907, 935, 981, 992–994, 998, 1004–1007, 1009–1016, 1024–1026, 1036–1038, 1041–1042, 1046, 1049–1050, 1052–1053, 1055, 1286–1287, 1492–1493, 1502, 1515, 1517–1522, 1524–1541, 1544–1548, 1550–1551, 1557–1566, 1623, 1625
TOTAL25326587876% 

Remove due LLM and agent-server deprecated APIs while extending the context.skills import shim for downstream migration.

Co-authored-by: openhands <openhands@all-hands.dev>
Teach the REST API breakage check to allow OpenAPI schema property removals after their documented deprecation deadline, matching the release treatment for removed operations.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig requested a review from xingyaoww May 3, 2026 13:47

@xingyaoww xingyaoww left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xingyaoww xingyaoww left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we haven't do test-examples/integration tests QAs

@xingyaoww

Copy link
Copy Markdown
Member

I might as well bring the latest updates from main in this release

@xingyaoww xingyaoww added test-examples Run all applicable "examples/" files. Expensive operation. integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. labels May 4, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

🧪 Integration Tests Results

Overall Success Rate: 97.1%
Total Cost: $1.12
Models Tested: 4
Timestamp: 2026-05-04 14:20:19 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 8/8 1 9 $0.09 310,116
litellm_proxy_deepseek_deepseek_reasoner 100.0% 8/8 1 9 $0.02 314,496
litellm_proxy_gemini_3.1_pro_preview 100.0% 9/9 0 9 $0.45 320,118
litellm_proxy_anthropic_claude_sonnet_4_6 88.9% 8/9 0 9 $0.56 375,855

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.09
  • Token Usage: prompt: 304,592, completion: 5,524, cache_read: 239,360
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1635059_kimi_k2_thinking_run_N9_20260504_141649
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.02
  • Token Usage: prompt: 309,803, completion: 4,693, cache_read: 271,872, reasoning: 1,251
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_1635059_deepseek_v3_2_reasoner_run_N9_20260504_141651
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3.1_pro_preview

  • Success Rate: 100.0% (9/9)
  • Total Cost: $0.45
  • Token Usage: prompt: 315,442, completion: 4,676, cache_read: 129,868, reasoning: 3,004
  • Run Suffix: litellm_proxy_gemini_3.1_pro_preview_1635059_gemini_3_1_pro_run_N9_20260504_141651

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 88.9% (8/9)
  • Total Cost: $0.56
  • Token Usage: prompt: 369,487, completion: 6,368, cache_read: 266,710, cache_write: 102,463, reasoning: 914
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_1635059_claude_sonnet_4_6_run_N9_20260504_141654

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.06)

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-04 14:37:26 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 31.0s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 22.2s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.1s $0.01
01_standalone_sdk/07_mcp_integration.py ❌ FAIL
Exit code 1
40.1s --
01_standalone_sdk/09_pause_example.py ✅ PASS 14.0s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 45.0s $0.02
01_standalone_sdk/11_async.py ✅ PASS 43.2s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.1s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 37.1s $0.02
01_standalone_sdk/14_context_condenser.py ❌ FAIL
Exit code 1
15.1s --
01_standalone_sdk/17_image_input.py ✅ PASS 16.7s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 16.5s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 18.1s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 26.1s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 24.4s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 22.7s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 2m 45s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 52s $0.28
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 23s $0.07
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 21.2s $0.03
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 40.3s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 42.0s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 9.4s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 29s $0.31
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 33.5s $0.04
01_standalone_sdk/34_critic_example.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 17.6s $0.01
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 3.9s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 51.2s $0.04
01_standalone_sdk/39_llm_fallback.py ✅ PASS 21.7s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 33.1s $0.14
01_standalone_sdk/41_task_tool_set.py ✅ PASS 35.2s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 39s $0.09
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 6.3s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 16.8s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 3m 1s $0.31
01_standalone_sdk/46_agent_settings.py ✅ PASS 10.7s $0.00
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 3.1s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 15.6s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 46.2s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 2m 0s $0.07
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 21s $0.06
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 50s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 42.2s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 38s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 1m 1s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 29.0s $0.05
02_remote_agent_server/11_conversation_fork.py ✅ PASS 1m 5s $0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 30.0s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 49.7s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 17.9s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 20.7s $0.02

❌ Some tests failed

Total: 52 | Passed: 49 | Failed: 3 | Total Cost: $2.26

Failed examples:

  • examples/01_standalone_sdk/07_mcp_integration.py: Exit code 1
  • examples/01_standalone_sdk/14_context_condenser.py: Exit code 1
  • examples/01_standalone_sdk/34_critic_example.py: Timed out after 600 seconds

View full workflow run

@xingyaoww

Copy link
Copy Markdown
Member

@OpenHands investigate the failure in #3038 (comment) should i be worried? Can you only run those three failing example again and see if it is working?

@openhands-ai

openhands-ai Bot commented May 4, 2026

Copy link
Copy Markdown

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link
Copy Markdown
Member

Investigation of Example Test Failures

This comment was created by an AI agent (OpenHands).

I investigated the 3 failing examples from the workflow run. Here's the analysis:

1. 07_mcp_integration.py — ❌ Exit code 1

Root cause: Transient LLM proxy authentication error

litellm.AuthenticationError: AuthenticationError: Litellm_proxyException - Authentication Error,

This is a transient auth failure with the LLM proxy — not a code bug.

2. 14_context_condenser.py — ❌ Exit code 1

Root cause: Transient LLM proxy authentication error

litellm.AuthenticationError: AuthenticationError: Litellm_proxyException - Authentication Error, All connection attempts failed

Same class of transient proxy auth failure. Not a code bug.

3. 34_critic_example.py — ❌ Timed out (600s)

Root cause: Critic iterative refinement didn't reach quality threshold with claude-haiku

The critic kept trying to refine but never hit the 0.700 threshold across 3 iterations:

  • Iteration 1: score 0.620
  • Iteration 2: score 0.626
  • Iteration 3: score 0.504

Each iteration involved full agent runs with claude-haiku, so the total time exceeded the 10-minute timeout. This is model-quality dependent and not a code regression.

Summary: Nothing to worry about

All 3 failures are flaky/infrastructure-related, not code regressions:

  • 2 are transient LLM proxy authentication errors
  • 1 is a timeout due to the critic example being too slow with claude-haiku (the critic quality threshold of 0.700 is hard for haiku to consistently reach)

I've re-triggered the failed jobs — you can monitor the re-run here: https://github.com/OpenHands/software-agent-sdk/actions/runs/25324058504

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-05-04 15:08:46 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 32.5s $0.04
01_standalone_sdk/03_activate_skill.py ✅ PASS 33.0s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.0s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 40.3s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 12.5s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 34.0s $0.03
01_standalone_sdk/11_async.py ✅ PASS 47.0s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 9.0s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 35.4s $0.02
01_standalone_sdk/14_context_condenser.py ❌ FAIL
Exit code 1
3m 42s --
01_standalone_sdk/17_image_input.py ✅ PASS 16.2s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 18.0s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 28.2s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 15.1s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 14.5s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 29.4s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 2m 4s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 22s $0.33
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 15s $0.07
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 17.1s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 38.5s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 47.5s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 8.6s $0.01
01_standalone_sdk/31_iterative_refinement.py ❌ FAIL
Exit code 1
7m 33s --
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 33.3s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 7m 59s $0.68
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 26.6s $0.01
01_standalone_sdk/37_llm_profile_store/main.py ✅ PASS 7.9s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 31.3s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.7s $0.00
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 33.6s $0.13
01_standalone_sdk/41_task_tool_set.py ✅ PASS 29.3s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 49s $0.08
01_standalone_sdk/43_mixed_marketplace_skills/main.py ✅ PASS 4.9s $0.00
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 7.6s $0.01
01_standalone_sdk/45_parallel_tool_execution.py ✅ PASS 3m 45s $0.51
01_standalone_sdk/46_agent_settings.py ✅ PASS 12.3s $0.01
01_standalone_sdk/47_defense_in_depth_security.py ✅ PASS 3.0s $0.00
01_standalone_sdk/48_conversation_fork.py ✅ PASS 13.6s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 34.0s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 2m 2s $0.07
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 17s $0.09
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 23s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 29.9s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 46s $0.02
02_remote_agent_server/09_acp_agent_with_remote_runtime.py ✅ PASS 1m 8s $0.11
02_remote_agent_server/10_cloud_workspace_share_credentials.py ✅ PASS 34.1s $0.05
02_remote_agent_server/11_conversation_fork.py ✅ PASS 59.1s $0.00
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 27.0s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ❌ FAIL
Exit code 1
32.2s --
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 24.6s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 22.9s $0.02

❌ Some tests failed

Total: 52 | Passed: 49 | Failed: 3 | Total Cost: $2.79

Failed examples:

  • examples/01_standalone_sdk/14_context_condenser.py: Exit code 1
  • examples/01_standalone_sdk/31_iterative_refinement.py: Exit code 1
  • examples/04_llm_specific_tools/02_gemini_file_tools.py: Exit code 1

View full workflow run

Copy link
Copy Markdown
Member

Re-run Results Update

This comment was created by an AI agent (OpenHands).

The re-run of the failed jobs has completed. Here's the comparison:

Original 3 failures → Re-run results:

Example Original Run Re-run Verdict
07_mcp_integration.py ❌ Auth Error ✅ PASS Flaky — transient proxy auth
14_context_condenser.py ❌ Auth Error ❌ Auth Error Flaky — transient proxy auth (hit it again)
34_critic_example.py ❌ Timeout (600s) ✅ PASS Flaky — model-dependent timing

New failures in re-run (also transient):

  • 31_iterative_refinement.pyAuthenticationError: Litellm_proxyException - Authentication Error,
  • 02_gemini_file_tools.pyAuthenticationError: Server disconnected without sending a response.

Bottom line: No need to worry

All failures across both runs are transient LLM proxy authentication errors — a different subset of examples fails each time, confirming this is infrastructure flakiness, not a code regression. The originally-failing 07_mcp_integration.py and 34_critic_example.py now pass cleanly.

49/52 examples pass consistently. The 3 failures in each run are random victims of proxy auth instability.

@openhands-ai

openhands-ai Bot commented May 4, 2026

Copy link
Copy Markdown

Summary

The user asked me to investigate the 3 failing examples from the test-examples workflow on PR #3038 and determine if they should be worried, then re-run those three failing examples.

Findings

No need to worry — all failures are transient infrastructure issues, not code bugs.

Example Failure Type Re-run Result Root Cause
07_mcp_integration.py Auth Error ✅ PASS Transient LLM proxy auth failure
14_context_condenser.py Auth Error ❌ Auth Error (flaky) Transient LLM proxy auth failure
34_critic_example.py Timeout (600s) ✅ PASS Critic iterative refinement too slow with claude-haiku (scores 0.620→0.626→0.504, never reached 0.700 threshold)

Key evidence it's flaky: The re-run produced a different set of 3 failures — previously-passing examples (31_iterative_refinement.py, 02_gemini_file_tools.py) now failed with the same proxy auth errors, while 2 of the original 3 failures now pass. This rotation of failures confirms infrastructure flakiness.

Actions Taken

  • Investigated the failure by pulling workflow logs and identifying root causes for all 3 failures
  • Re-ran the failed jobs via gh run rerun --failed
  • Waited for the re-run to complete and analyzed the new results
  • Posted two detailed analysis comments on PR Release v1.20.0 #3038 (initial investigation + re-run results)
  • No code changes were made (none were needed — these aren't code bugs)

@xingyaoww xingyaoww left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK Let's ship!

@xingyaoww xingyaoww merged commit 4cc0ebd into main May 4, 2026
70 of 73 checks passed
@xingyaoww xingyaoww deleted the rel-1.20.0 branch May 4, 2026 15:30
StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants