Skip to content

fix(view): prevent KeyError on duplicate observation-like events#3051

Merged
VascoSch92 merged 4 commits into
mainfrom
vasco/crash-recovery-can-create-duplicate-tool-results-and-crash-condensation-with-KeyError
May 5, 2026
Merged

fix(view): prevent KeyError on duplicate observation-like events#3051
VascoSch92 merged 4 commits into
mainfrom
vasco/crash-recovery-can-create-duplicate-tool-results-and-crash-condensation-with-KeyError

Conversation

@VascoSch92

@VascoSch92 VascoSch92 commented May 4, 2026

Copy link
Copy Markdown
Member
  • A human has tested these changes.

Summary

  • Make ToolCallMatchingProperty.enforce() cardinality-aware so duplicate observation-like events for the same tool_call_id (e.g. an AgentErrorEvent from crash recovery + a late ObservationEvent) get dropped instead of
    slipping past membership-only checks.
  • Replace set.remove() with set.discard() + logger.warning in manipulation_indices() so a duplicate slipping past enforce() no longer crashes condensation with a raw KeyError.
  • Add a reproduction script under .pr/ and a property-level regression test.

Issue Number

#3045.

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:58cdeb8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-58cdeb8-python \
  ghcr.io/openhands/agent-server:58cdeb8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:58cdeb8-golang-amd64
ghcr.io/openhands/agent-server:58cdeb8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:58cdeb8-golang-arm64
ghcr.io/openhands/agent-server:58cdeb8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:58cdeb8-java-amd64
ghcr.io/openhands/agent-server:58cdeb8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:58cdeb8-java-arm64
ghcr.io/openhands/agent-server:58cdeb8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:58cdeb8-python-amd64
ghcr.io/openhands/agent-server:58cdeb8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:58cdeb8-python-arm64
ghcr.io/openhands/agent-server:58cdeb8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:58cdeb8-golang
ghcr.io/openhands/agent-server:58cdeb8-java
ghcr.io/openhands/agent-server:58cdeb8-python

About Multi-Architecture Support

  • Each variant tag (e.g., 58cdeb8-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 58cdeb8-python-amd64) are also available if needed

@VascoSch92 VascoSch92 requested a review from all-hands-bot May 4, 2026 06:39
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

PR Artifacts Cleaned Up

The .pr/ directory has been automatically removed.

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL24367706271% 
report-only-changed-files is enabled. No files were changed during this commit :)

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean bug fix for the duplicate observation-like events crash. The two-layer defense (enforce detects duplicates + manipulation_indices uses discard) is pragmatic and well-tested.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW

Bug fix with straightforward logic addressing a real production issue (crash recovery duplicates). Changes are isolated to duplicate detection and don't affect normal operation. Good test coverage for the primary fix.

@VascoSch92 VascoSch92 marked this pull request as ready for review May 4, 2026 06:49

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean bug fix with appropriate two-layer defense.

The duplicate detection logic in enforce() is straightforward and the defensive discard() fallback in manipulation_indices() is pragmatic. Test coverage directly validates the crash recovery scenario.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW

Isolated bug fix addressing a real crash scenario. Changes are defensive and well-tested. No impact on normal operation.

@VascoSch92 VascoSch92 requested review from csmith49 and neubig May 4, 2026 06:52

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Bug fix verified - duplicate observation events are now properly detected and removed, preventing KeyError crashes in condensation.

Does this PR achieve its stated goal?

Yes. The PR successfully fixes issue #3045 where crash recovery could create duplicate observation-like events (AgentErrorEvent + late ObservationEvent) with the same tool_call_id, causing a KeyError in condensation. I reproduced the exact crash scenario on main (KeyError raised), then verified the fix eliminates the crash by: (1) making enforce() cardinality-aware to drop duplicates, and (2) using discard() in manipulation_indices() as a defensive layer. The reproduction script demonstrates the fix working correctly, and the new regression test passes.

Phase Result
Environment Setup ✅ Dependencies installed, uv environment ready
CI Status ✅ All 29 checks passing (sdk-tests, agent-server-tests, pre-commit, coverage, etc.)
Functional Verification ✅ Bug reproduced on main, fix verified on PR branch
Functional Verification

Test 1: Reproduce the KeyError crash on main branch

Step 1 — Reproduce the bug (without the fix):

Checked out origin/main and ran the reproduction script that simulates crash recovery creating duplicate observation-like events:

cd /home/runner/work/software-agent-sdk/software-agent-sdk/pr-repo
git checkout origin/main
OPENHANDS_SUPPRESS_BANNER=1 uv run --frozen python /tmp/reproduce_issue_3045.py

Output:

==============================================================================
Event sequence (1 ActionEvent + 2 observation-like events for the SAME tool_call_id):
  [0] MessageEvent            tool_call_id=
  [1] ActionEvent             tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm
  [2] AgentErrorEvent         tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm
  [3] ObservationEvent        tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm

ToolCallMatchingProperty.enforce() => events_to_remove = set()

Calling ToolCallMatchingProperty.manipulation_indices(events)...
  -> KeyError: KeyError('toolu_01WK8m53DJP6KkuqVCBSbYzm')  (BUG present)

Building View.from_events() and accessing view.manipulation_indices...
  view len = 4
  -> KeyError: KeyError('toolu_01WK8m53DJP6KkuqVCBSbYzm')  (BUG present)

Interpretation: This confirms the bug exists. The scenario has 1 ActionEvent followed by 2 observation-like events (AgentErrorEvent from crash recovery + late ObservationEvent) with the same tool_call_id. The old enforce() returns events_to_remove = set() (fails to detect the duplicate), and both manipulation_indices() calls crash with KeyError when trying to remove the same tool_call_id twice from the pending set.

Step 2 — Apply the PR's changes:

Checked out the PR branch:

git checkout vasco/crash-recovery-can-create-duplicate-tool-results-and-crash-condensation-with-KeyError

Step 3 — Re-run with the fix in place:

Ran the same reproduction script:

OPENHANDS_SUPPRESS_BANNER=1 uv run --frozen python /tmp/reproduce_issue_3045.py

Output:

==============================================================================
Event sequence (1 ActionEvent + 2 observation-like events for the SAME tool_call_id):
  [0] MessageEvent            tool_call_id=
  [1] ActionEvent             tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm
  [2] AgentErrorEvent         tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm
  [3] ObservationEvent        tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm

ToolCallMatchingProperty.enforce() => events_to_remove = {'obs-late'}

Calling ToolCallMatchingProperty.manipulation_indices(events)...
{"message": "Duplicate observation-like event for tool_call_id=toolu_01WK8m53DJP6KkuqVCBSbYzm"}
  -> no KeyError (BUG fixed)

Building View.from_events() and accessing view.manipulation_indices...
{"message": "Property <class 'openhands.sdk.context.view.properties.tool_call_matching.ToolCallMatchingProperty'> enforced, 1 events dropped."}
  view len = 3
  -> no KeyError (BUG fixed)

Interpretation: The fix works correctly. Now enforce() detects the duplicate and returns events_to_remove = {'obs-late'}, preventing it from reaching manipulation_indices(). The defensive discard() in manipulation_indices() logs a warning for any duplicate that might slip through but does not crash. The View correctly drops 1 event (view len = 3 instead of 4), and no KeyError is raised.


Test 2: Verify regression test passes

Ran the new test case that covers this scenario:

OPENHANDS_SUPPRESS_BANNER=1 uv run --frozen pytest tests/sdk/context/view/properties/test_tool_call_matching.py -k "duplicate" -v

Output:

tests/sdk/context/view/properties/test_tool_call_matching.py::TestToolCallMatchingPropertyEnforcement::test_duplicate_observation_like_events_for_same_tool_call_id PASSED [100%]

======================== 1 passed, 14 deselected, 5 warnings in 0.05s ========================

The regression test passes, confirming the fix handles the crash recovery duplicate scenario correctly.


Test 3: Verify no regressions in existing tests

Ran the entire tool_call_matching test suite:

OPENHANDS_SUPPRESS_BANNER=1 uv run --frozen pytest tests/sdk/context/view/properties/test_tool_call_matching.py -v

Output:

======================== 15 passed, 5 warnings in 0.72s ========================

All 15 tests pass (including 10 existing tests for enforcement and manipulation indices), confirming no regressions.

Issues Found

None.

@csmith49 csmith49 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend we turn this detection into a separate property implementation. Will help keep the duplicate item dropping logic easy-to-find (in case we need to change if we take first or last instance), and if we put it first in the property list it guarantees it's true for all downstream .enforce() calls.

One note there is that there's no way to ensure this property via manipulation indices, but that system is built with the assumption that the input view is already well-formed. For this property we can just return a complete set of indices and throw a warning if somehow the duplicate surfaces at that point.

@VascoSch92 VascoSch92 requested a review from all-hands-bot May 4, 2026 17:29
…ol-results-and-crash-condensation-with-KeyError

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean bug fix that addresses previous review feedback.

Previous review concerns fully resolved:

  • ✅ Separate property implementation (ObservationUniquenessProperty) created
  • ✅ Property correctly ordered first in property list to run before ToolCallMatchingProperty
  • ✅ Good defensive programming: manipulation_indices() logs warnings instead of crashing

The two-layer defense (enforce drops duplicates + manipulation_indices warns) is pragmatic and handles the crash recovery scenario cleanly.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟢 LOW

Isolated bug fix for duplicate observations during crash recovery. Changes are defensive and well-tested. No impact on normal operation (non-duplicate case).

VERDICT:
Worth merging

@VascoSch92 VascoSch92 requested a review from csmith49 May 4, 2026 17:32

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The PR successfully prevents KeyError crashes when duplicate observation-like events exist for the same tool_call_id during crash recovery scenarios.

Does this PR achieve its stated goal?

Yes. The PR fixes issue #3045 by preventing KeyError during condensation when crash recovery creates duplicate observation-like events (e.g., an AgentErrorEvent + a late ObservationEvent with the same tool_call_id). I verified this end-to-end by reproducing the exact crash scenario on main and confirming it's fixed on this PR branch. The solution is architecturally sound: a new ObservationUniquenessProperty runs before ToolCallMatchingProperty and removes duplicates, preventing them from reaching the strict pairing logic that would crash.

Phase Result
Environment Setup ✅ Built successfully with make build
CI Status ✅ All critical checks passing (sdk-tests, pre-commit, Python API, cross-tests)
Functional Verification ✅ Crash scenario reproduced on main, confirmed fixed on PR branch
Functional Verification

Test 1: Reproduce the KeyError on main branch

Step 1 — Establish baseline (main branch without fix):

Checked out main branch and created a reproduction scenario with duplicate observation-like events:

  • ActionEvent with tool_call_id="call_1"
  • AgentErrorEvent (crash recovery) with tool_call_id="call_1"
  • Late ObservationEvent with tool_call_id="call_1"

Ran ToolCallMatchingProperty.manipulation_indices() directly on these events:

Testing scenario with duplicate observation-like events on main branch...
  Action: action_1 (tool_call_id=call_1)
  AgentError: 6f62c620-de8a-4f7d-b8fd-f1703d7111dd (tool_call_id=call_1)
  Late Observation: obs_late (tool_call_id=call_1)

Calling ToolCallMatchingProperty.manipulation_indices()...
  ✓ KeyError occurred as expected: 'call_1'
  This is the bug that the PR fixes!

This confirms the bug exists on main: the second observation-like event tries to remove() the same tool_call_id from the pending set, causing a KeyError.

Step 2 — Apply the PR's changes:

Checked out PR branch vasco/crash-recovery-can-create-duplicate-tool-results-and-crash-condensation-with-KeyError

Step 3 — Re-run with the fix in place:

Ran the same scenario through the property pipeline (ObservationUniquenessProperty → ToolCallMatchingProperty):

Testing ObservationUniquenessProperty → ToolCallMatchingProperty pipeline...
Initial events: ['action_1', '0a4047c9-5cb5-4430-a094-655b9553030a', 'obs_late']

Step 1: Applying ObservationUniquenessProperty.enforce()...
  Events to remove: {'obs_late'}
  Remaining events: ['action_1', '0a4047c9-5cb5-4430-a094-655b9553030a']

Step 2: Applying ToolCallMatchingProperty...
  Calling enforce()...
  ✓ enforce() succeeded, events to remove: set()
  Calling manipulation_indices()...
  ✓ manipulation_indices() succeeded: ManipulationIndices({0, 2})

✓ Pipeline completed without KeyError!
Final events: ['action_1', '0a4047c9-5cb5-4430-a094-655b9553030a']

This shows the fix works: ObservationUniquenessProperty removes the duplicate observation before ToolCallMatchingProperty sees it, preventing the KeyError.

Test 2: Unit tests verification

Ran the new test suite:

uv run pytest tests/sdk/context/view/properties/test_observation_uniqueness.py -v

All 4 tests passed:

  • test_enforce_drops_late_observation_after_agent_error
  • test_enforce_no_duplicates_returns_empty
  • test_manipulation_indices_returns_complete_for_well_formed_view
  • test_manipulation_indices_warns_but_does_not_crash_on_duplicates

Test 3: Property registration order

Verified that ObservationUniquenessProperty is registered before ToolCallMatchingProperty in ALL_PROPERTIES:

ALL_PROPERTIES: list[ViewPropertyBase] = [
    ObservationUniquenessProperty(),  # index 0
    BatchAtomicityProperty(),          # index 1
    ToolCallMatchingProperty(),        # index 2
    ToolLoopAtomicityProperty(),       # index 3
]

This order ensures duplicates are removed before the pairing logic runs.

Issues Found

None.

@VascoSch92 VascoSch92 merged commit 0feb33b into main May 5, 2026
41 checks passed
@VascoSch92 VascoSch92 deleted the vasco/crash-recovery-can-create-duplicate-tool-results-and-crash-condensation-with-KeyError branch May 5, 2026 06:42
StressTestor pushed a commit to StressTestor/software-agent-sdk that referenced this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants