Skip to content

feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps#670

Merged
drewdrewthis merged 22 commits into
mainfrom
fix/issue666-voice-modality-per-role
Jun 18, 2026
Merged

feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps#670
drewdrewthis merged 22 commits into
mainfrom
fix/issue666-voice-modality-per-role

Conversation

@drewdrewthis

@drewdrewthis drewdrewthis commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Why

Today the voice user simulator always strips audio to text (STT bridge) with no capability check, and the judge guesses audio support from a hardcoded model-name substring list that misses gpt-audio-mini. Silent degradation is the root cause of #664. This PR ships the observable spine: per-role modality declaration with declaration-first resolution, loud two-phase validation, and OTEL span stamps so a degraded run is always visible.

Closes #666

What changed

  • modality_resolver.py — new core module — declaration beats litellm advisory; mismatch emits a WARNING rather than silently overriding; unknown declarations raise ModalityNegotiationError.
  • UserSimulatorAgent and JudgeAgent wired to resolvermodality="audio-in" | "text" | "stt-bridge" accepted as public param (AC0); effective tier resolved once per role at scenario init; simulator strips or keeps audio parts accordingly; judge delegates audio-capability detection to resolver instead of hardcoded substring list.
  • Two-phase validation — statically-impossible combos (e.g. audio-in + non-streaming adapter) raise at ScenarioExecutor init; live-transport mismatches raise at first connect(), before the first turn (AC6–AC8b).
  • OTEL span stampsscenario.modality.<role>.resolved and scenario.modality.<role>.tier written per-role into the active span at _new_turn() (AC5, AC5b).

How it works

scenario/voice/
  modality_resolver.py   ← resolve_modality(declaration, model_id) → (ModalityTier, warnings)
  adapters/_stub.py      ← PendingTransportError raised by connect() on live mismatch
  capabilities.py        ← AdapterCapabilities.input_formats already existed; validated here

scenario/
  user_simulator_agent.py  ← calls resolve_modality; _strip_audio_content gated on tier
  judge_agent.py           ← effective_include_audio delegates to resolve_modality (AC3a–AC3c)
  scenario_executor.py     ← static validation at __init__; OTEL stamps at _new_turn()

Resolution rules:

  1. No declaration → litellm advisory is truth (no warning).
  2. Declaration matches advisory → use declared tier (no warning).
  3. Declaration disagrees with advisory → use declared tier, emit WARNING (AC4a, AC4b).

Test plan

cd python
uv run pytest tests/voice/test_modality_resolver.py tests/voice/test_modality_stamps.py \
  tests/voice/test_modality_validation.py tests/test_public_modality_api.py \
  tests/test_judge_agent.py tests/test_user_simulator_agent.py -q
# 59 passed (all AC-relevant tests)

Key new test files:

  • tests/voice/test_modality_resolver.py — AC4a, AC4b
  • tests/voice/test_modality_validation.py — AC6–AC8b
  • tests/voice/test_modality_stamps.py — AC5, AC5b (tier + resolved keys)
  • tests/test_judge_agent.py::test_ac9_transcribe_segments_invoked_for_text_judge — AC9 spy
  • tests/test_judge_agent.py::test_ac5b_stt_bridge_judge_invokes_transcribe_segments — AC5b spy

Regression: test_gemini_is_detected_as_audio_capable intentionally removed — the resolver now delegates to litellm advisory rather than an internal substring list; the equivalent behavior is covered by TestNoDeclaration.test_no_declaration_advisory_true_returns_audio_in.

How I can prove I was successful

Show-it-working demo (actual run output — 2026-06-15)

Script: uv run python /tmp/demo_voice_modality.py in the worktree venv.
Exercises: per-role resolution across 6 combos, static impossible-combo exception, OTEL span attrs via _new_turn().

----------------------------------------------------------------------
  1. Per-role modality resolution across role/model combos
----------------------------------------------------------------------
[OK  ]  simulator, gpt-audio-mini, no decl
         declaration=None  model='gpt-audio-mini'  advisory=True -> tier='audio-in'
[OK  ]  simulator, gpt-4.1-mini, no decl
         declaration=None  model='gpt-4.1-mini'  advisory=False -> tier='text'
[OK  ]  judge, gpt-4o, no decl (advisory False!)
         declaration=None  model='gpt-4o'  advisory=False -> tier='text'
[WARN]  declared audio-in on advisory-text model
         declaration='audio-in'  model='gpt-4.1-mini'  advisory=False -> tier='audio-in'
         WARNING: Model 'gpt-4.1-mini' declared modality 'audio-in' but litellm reports it
         does NOT support audio input. The declared modality 'audio-in' will be used.
[WARN]  declared text on advisory-audio model
         declaration='text'  model='gpt-audio-mini'  advisory=True -> tier='text'
         WARNING: Model 'gpt-audio-mini' declared modality 'text' but litellm reports it
         DOES support audio input. The declared modality 'text' will be used.
[OK  ]  declared stt-bridge explicit
         declaration='stt-bridge'  model='gpt-4o'  advisory=False -> tier='stt-bridge'

----------------------------------------------------------------------
  2. validate_modality_setup — mulaw/8000-only adapter raises at setup (AC6)
----------------------------------------------------------------------
  ModalityNegotiationError raised ✓
  contains 'audio-in': True  |  contains 'mulaw': True
  full msg: Declared modality 'audio-in' is incompatible with adapter 'TwilioAdapter':
  input formats ['mulaw/8000'] contain no pcm16 path (conflicting capability: 'mulaw/8000').
  No resample path exists.

----------------------------------------------------------------------
  3. OTEL span attributes stamped at _new_turn (AC5) — both .resolved and .tier
----------------------------------------------------------------------
  [audio-in run]
  modality attrs: {'scenario.modality.simulator.resolved': 'audio-in',
                   'scenario.modality.simulator.tier': 'audio-in',
                   'scenario.modality.judge.resolved': 'audio-in',
                   'scenario.modality.judge.tier': 'audio-in'}

  [degraded run (simulator=audio-in, judge=stt-bridge)]
  modality attrs: {'scenario.modality.simulator.resolved': 'audio-in',
                   'scenario.modality.simulator.tier': 'audio-in',
                   'scenario.modality.judge.resolved': 'stt-bridge',
                   'scenario.modality.judge.tier': 'stt-bridge'}

  AC5 checks:
    audio run  simulator.tier = 'audio-in'  ✓
    audio run  judge.tier     = 'audio-in'  ✓
    degraded   simulator.tier = 'audio-in'  ✓
    degraded   judge.tier     = 'stt-bridge'  ✓
    .resolved keys present: True (simulator) / True (judge)  ✓
    tiers differ between roles in degraded run: ✓

Acceptance Criteria Evidence Table

# Criterion Evidence Status
AC0 modality= public param on both agents user_simulator_agent.py:162, judge_agent.py:251; 4 test_ac0_* tests pass ✅ PASS
AC1 audio-capable simulator retains audio parts test_audio_in_simulator_retains_audio_parts pass; demo: gpt-audio-minitier='audio-in' ✅ PASS
AC2 text simulator strips audio (regression) test_text_simulator_strips_audio_with_placeholders pass ✅ PASS
AC3a gpt-audio-mini judge receives audio test_gpt_audio_mini_judge_receives_audio pass ✅ PASS
AC3b gpt-4o judge (no decl) → transcript path test_gpt4o_judge_no_declaration_takes_transcript_path pass; demo: gpt-4o advisory=False → tier='text' ✅ PASS
AC3c explicit include_audio=False wins test_explicit_include_audio_false_wins pass ✅ PASS
AC4a declared audio-in on advisory-text → declared tier + WARNING Demo: WARNING text captured, tier='audio-in'; TestAC4a (5 tests) pass ✅ PASS
AC4b declared text on advisory-audio → declared tier + WARNING Demo: WARNING text captured, tier='text'; TestAC4b (4 tests) pass ✅ PASS
AC5 OTEL scenario.modality.<role>.resolved AND .tier per role Demo: both keys present in span attrs for audio-in and degraded runs; test_ac5_* tests assert .resolved + .tier; scenario_executor.py:462-464 ✅ PASS
AC5b STT-bridge explicit outcome; transcribe_segments spy resolve_modality(decl='stt-bridge')STT_BRIDGE ✅; span .tier='stt-bridge' ✅; test_ac5b_stt_bridge_judge_invokes_transcribe_segments spies transcribe_segments(recording) ✅ PASS
AC6 static impossible combo raises with both tokens Demo: ModalityNegotiationError message contains 'audio-in' + 'mulaw'; TestAC6StaticValidation (7 tests) pass ✅ PASS
AC7 live-transport mismatch → ModalityNegotiationError before first turn test_ac7_live_transport_failure_raises_before_first_turn + test_ac7_error_is_modality_negotiation_error_not_pending_transport pass ✅ PASS
AC8a interrupt(after_words=N) gate fires at connect, not step-exec test_ac8a_interrupt_after_words_raises_at_connect_not_step_execution pass ✅ PASS
AC8b dtmf gate unchanged test_ac8b_dtmf_gate_unchanged pass ✅ PASS
AC9 transcribe_segments runs over VoiceRecording for text judge (regression) test_ac9_transcribe_segments_invoked_for_text_judge spies transcribe_segments called with VoiceRecording for gpt-4o (TEXT tier) ✅; judge_agent.py:474 ✅ PASS
AC10 capability matrix byte-identical (no new field) uv run python scripts/gen_capability_matrix.py → "No changes: capability-matrix.mdx" ✅ PASS

Verdict: 16/16 PASS — ALL PASS

Ubuntu and others added 13 commits June 15, 2026 00:28
…tion-first + litellm advisory

Implements AC4a, AC4b scenarios from specs/voice-modality-negotiation.feature:
- resolve_modality(): declaration wins, litellm advisory warns on mismatch, both directions
- ModalityNegotiationError: shared exception type for setup/connect validation (AC6, AC7)
- ModalityTier: audio-in, stt-bridge, text

Foundational module; all other bundles depend on this.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audio-capable models (advisory audio-in) now receive raw audio parts;
text-only models strip audio exactly as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…olver — AC3a, AC3b, AC3c, AC9

gpt-audio-mini now correctly resolves to audio-in (was missed by old list).
gpt-4o now correctly takes the transcript path (litellm advisory=False).
include_audio explicit override still wins (AC3c preserved).
transcribe_segments unchanged for text-modality judges (AC9 regression passes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Static impossible combo (audio-in × mulaw/8000) raises ModalityNegotiationError at setup.
Live transport failure at first-connect re-raises as ModalityNegotiationError with requirement token.
interrupt(after_words=N) capability gate moved to first-connect (before first turn).
dtmf gate unchanged (AC8b regression).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
UserSimulatorAgent(modality="audio-in") and JudgeAgent(modality="text") now accepted.
Declaration reaches resolve_modality() as the explicit declaration arg.
Documented in docstrings (user-facing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…utes — AC5, AC5b

scenario.modality.<role>.resolved and scenario.modality.<role>.tier stamped on
root span at the start of each turn. Populated in run() from resolve_modality()
for UserSimulatorAgent (simulator) and JudgeAgent (judge).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y field added — AC10b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 825b07de-c2f8-4227-8dc2-85ddfb2dfdb2

📥 Commits

Reviewing files that changed from the base of the PR and between a3a9f34 and f85c8be.

📒 Files selected for processing (2)
  • python/tests/test_judge_agent.py
  • python/tests/voice/test_modality_stamps.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/tests/test_judge_agent.py
  • python/tests/voice/test_modality_stamps.py

Walkthrough

Introduces per-role voice modality negotiation by adding a new modality_resolver.py module with ModalityTier, ModalityNegotiationError, resolve_modality, and validate_modality_setup. Both JudgeAgent and UserSimulatorAgent gain an explicit modality constructor parameter. ScenarioExecutor resolves and stamps modality tiers as OTEL attributes and enforces three-phase voice adapter validation. The interrupt() step gains a _requires_streaming_transcripts capability marker.

Changes

Voice Modality Negotiation per Role

Layer / File(s) Summary
ModalityTier contract and resolve_modality implementation
python/scenario/voice/modality_resolver.py, python/scenario/voice/__init__.py, python/tests/voice/test_modality_resolver.py
New modality_resolver.py defines ModalityTier (audio-in, stt-bridge, text), ModalityNegotiationError, _litellm_advisory, resolve_modality (declaration vs advisory with mismatch warnings), and validate_modality_setup (pcm16 requirement for audio-in). All three symbols are re-exported from voice/__init__.py. Resolver tests cover no-declaration, declaration-agreement, AC4a/AC4b mismatches, and unknown-declaration error cases.
modality parameter on JudgeAgent and UserSimulatorAgent
python/scenario/judge_agent.py, python/scenario/user_simulator_agent.py, python/tests/test_judge_agent.py, python/tests/test_public_modality_api.py, python/tests/test_user_simulator_agent.py, python/tests/voice/test_judge_audio_transcribe.py, python/tests/voice/test_judge_voice.py
Both agents accept modality: Optional[str] = None. JudgeAgent.effective_include_audio replaces the model-substring heuristic with resolve_modality, enabling audio only for AUDIO_IN tier. UserSimulatorAgent._generate_text resolves the tier and conditionally strips audio content via _strip_audio_content for non-audio-in models. Tests verify resolver delegation, audio-part retention/stripping, explicit include_audio override precedence, and multi-model behavior.
ScenarioExecutor: run resolution, OTEL stamping, and three-phase voice connect
python/scenario/scenario_executor.py, python/scenario/voice/script_steps.py, python/tests/voice/test_modality_stamps.py, python/tests/voice/test_modality_validation.py
run() resolves and stores ModalityTier per role in _modality_resolutions. _new_turn() stamps scenario.modality.<role>.tier as root span attributes. _voice_connect_all() runs three phases: static validate_modality_setup, connect with PendingTransportError wrapped as ModalityNegotiationError, and script-step streaming_transcripts capability check raising UnsupportedCapabilityError. interrupt(after_words=N) marks steps with _requires_streaming_transcripts=True. Tests cover AC5 OTEL stamping, AC6 static negotiation, AC7 live connect failures, and AC8a/AC8b streaming transcript gating.
Capability matrix regression test and spec tag
python/tests/test_capability_matrix.py, specs/voice-agents.feature
Adds a byte-identical regression test for the capability-matrix MDX generator (runs gen_capability_matrix.py via uv run and asserts file remains unchanged). The voice-agents.feature scenario gains the @unit tag.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(70, 130, 180, 0.5)
        Note over ScenarioExecutor: run() — modality resolution
    end
    ScenarioExecutor->>resolve_modality: declaration=simulator.modality, model_id
    resolve_modality-->>ScenarioExecutor: (ModalityTier, warnings)
    ScenarioExecutor->>resolve_modality: declaration=judge.modality, model_id
    resolve_modality-->>ScenarioExecutor: (ModalityTier, warnings)
    ScenarioExecutor->>ScenarioExecutor: store _modality_resolutions

    rect rgba(60, 179, 113, 0.5)
        Note over ScenarioExecutor: _voice_connect_all() — three-phase validation
    end
    ScenarioExecutor->>validate_modality_setup: tier + adapter.capabilities.input_formats
    validate_modality_setup-->>ScenarioExecutor: OK or ModalityNegotiationError

    ScenarioExecutor->>VoiceAgentAdapter: connect()
    VoiceAgentAdapter-->>ScenarioExecutor: OK or PendingTransportError → ModalityNegotiationError

    ScenarioExecutor->>ScenarioExecutor: scan script steps for _requires_streaming_transcripts
    ScenarioExecutor->>VoiceAgentAdapter: check capabilities.streaming_transcripts
    VoiceAgentAdapter-->>ScenarioExecutor: missing → UnsupportedCapabilityError

    rect rgba(178, 102, 255, 0.5)
        Note over ScenarioExecutor: _new_turn() — OTEL stamping
    end
    ScenarioExecutor->>LangwatchRootSpan: set_attributes(scenario.modality.<role>.tier)
Loading

Possibly Related PRs

  • langwatch/scenario#665: Main PR's ScenarioExecutor now gates interrupt steps on adapters' capabilities.streaming_transcripts, and retrieved PR changes PipecatAgentAdapter.capabilities to advertise streaming_transcripts=False, directly impacting how that gate behaves.

Poem

🐇 A model name substring? How passé!
Now ModalityTier leads the way —
audio-in, stt-bridge, or text,
Each role resolved, no guessing what's next.
Three phases validate before the first turn,
And OTEL spans stamp every tier in turn! 🎙️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps' accurately summarizes the main architectural change: adding per-role modality negotiation with declaration-first resolution, two-phase validation, and OpenTelemetry stamping as specified in the PR objectives.
Description check ✅ Passed The description comprehensively explains the rationale (silent degradation), what changed (modality_resolver core, per-agent wiring, two-phase validation, OTEL stamps), and how it works with working examples and acceptance criteria evidence. It directly relates to all changes in the changeset.
Linked Issues check ✅ Passed The PR fully addresses all six key objectives from #666: (1) allows audio-capable models to receive audio directly via tier-conditional stripping [AC1–AC2], (2) adds public modality parameter [AC0], (3) implements two-phase validation [AC6–AC8b], (4) replaces hardcoded judge detection with resolver [AC3a–AC3c], (5) stamps modality into OTEL spans [AC5, AC5b], and (6) adopts declaration-first architecture with litellm-as-advisory [AC4a–AC4b]. All 16 acceptance criteria pass.
Out of Scope Changes check ✅ Passed All changes are within scope of per-role modality negotiation: modality_resolver core logic, agent wiring, scenario executor validation/stamping, two-phase gating, test coverage for resolver/validation/stamping/judge behavior. Feature flagging (voice/init.py), test regression (script_steps.py marker), and capability matrix regression test are all necessary supporting changes. No unrelated refactoring or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue666-voice-modality-per-role

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@drewdrewthis drewdrewthis added the in-progress Workflow: in-progress label Jun 15, 2026
Comment thread python/tests/voice/test_modality_stamps.py Fixed
Comment thread python/tests/voice/test_modality_validation.py Fixed
Ubuntu and others added 4 commits June 15, 2026 00:54
Capture and emit warnings from resolve_modality() calls at three
call sites (simulator setup, judge setup, voice agent setup) instead
of silently discarding them via underscore binding. The resolver
contract requires all warnings be logged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing contract failure

The "Capability matrix is rendered into adapter docs" scenario in
voice-agents.feature was missing a required @unit/@integration/@e2e tag,
causing test_feature_file_contract tests to fail on main and on this PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audio content dict literals in AgentInput tests are valid at runtime but
pyright can't narrow them to ChatCompletionMessageParam. Suppressed with
# type: ignore[arg-type] — same pattern already used in this file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewdrewthis drewdrewthis marked this pull request as ready for review June 15, 2026 08:20
@drewdrewthis drewdrewthis self-assigned this Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
python/scenario/voice/modality_resolver.py (1)

62-65: 💤 Low value

Consider using raise ... from None for cleaner exception context.

When converting ValueError to ModalityNegotiationError, the original ValueError context isn't useful to users—they only need to know the declaration is invalid and see the valid values. Suppressing the chain makes tracebacks cleaner.

♻️ Proposed fix
     try:
         declared_tier = ModalityTier(declaration)
     except ValueError:
-        raise ModalityNegotiationError(
+        raise ModalityNegotiationError(
             f"Unknown modality declaration {declaration!r}; valid values: "
             + ", ".join(t.value for t in ModalityTier)
-        )
+        ) from None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/scenario/voice/modality_resolver.py` around lines 62 - 65, The
ModalityNegotiationError being raised in the modality validation block does not
suppress the exception context chain, which clutters the traceback with
unnecessary information about the original ValueError. Add `from None` to the
raise statement for ModalityNegotiationError to suppress the exception context
chain and provide a cleaner traceback to users that focuses only on the helpful
information about the invalid declaration and valid values.
python/scenario/scenario_executor.py (1)

462-464: ⚡ Quick win

Redundant OTEL attribute stamping.

Lines 463–464 both set scenario.modality.{role}.resolved and scenario.modality.{role}.tier to the same tier_value. This duplication increases trace size without adding information.

Consider setting only one attribute per role, or clarify in a comment why both are needed.

♻️ Suggested simplification
 attrs = {
     "langwatch.origin": "simulation",
     "scenario.run_id": self._scenario_run_id,
 }
 for role, tier_value in getattr(self, '_modality_resolutions', {}).items():
-    attrs[f"scenario.modality.{role}.resolved"] = tier_value
     attrs[f"scenario.modality.{role}.tier"] = tier_value
 self._trace.root_span.set_attributes(attrs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/scenario/scenario_executor.py` around lines 462 - 464, The OTEL
attribute setting in the loop over self._modality_resolutions is redundant: both
lines 463 and 464 assign the same tier_value to different attribute keys for the
same role. Remove the duplicate line that sets the scenario.modality.{role}.tier
attribute (keeping only scenario.modality.{role}.resolved), or if both
attributes are actually needed for different purposes, add a clarifying comment
explaining why the duplication is intentional. The loop structure iterating over
the modality resolutions should be preserved, but eliminate the unnecessary
second attrs assignment per role iteration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/scenario/scenario_executor.py`:
- Around line 772-776: The exception handler in the connect() error path is
catching all exceptions with `except Exception as e:` and wrapping them as
ModalityNegotiationError, which masks the true nature of non-modality failures
like network timeouts or authentication errors. According to AC7, only
PendingTransportError should be caught and wrapped as ModalityNegotiationError.
Change the except clause to specifically catch only PendingTransportError
instead of the broad Exception class, allowing other exceptions to propagate
unchanged so their true nature is visible to the caller.

---

Nitpick comments:
In `@python/scenario/scenario_executor.py`:
- Around line 462-464: The OTEL attribute setting in the loop over
self._modality_resolutions is redundant: both lines 463 and 464 assign the same
tier_value to different attribute keys for the same role. Remove the duplicate
line that sets the scenario.modality.{role}.tier attribute (keeping only
scenario.modality.{role}.resolved), or if both attributes are actually needed
for different purposes, add a clarifying comment explaining why the duplication
is intentional. The loop structure iterating over the modality resolutions
should be preserved, but eliminate the unnecessary second attrs assignment per
role iteration.

In `@python/scenario/voice/modality_resolver.py`:
- Around line 62-65: The ModalityNegotiationError being raised in the modality
validation block does not suppress the exception context chain, which clutters
the traceback with unnecessary information about the original ValueError. Add
`from None` to the raise statement for ModalityNegotiationError to suppress the
exception context chain and provide a cleaner traceback to users that focuses
only on the helpful information about the invalid declaration and valid values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2ca81ac-66b3-42dc-be22-29af72edfea2

📥 Commits

Reviewing files that changed from the base of the PR and between 21226c0 and 02eca46.

📒 Files selected for processing (16)
  • python/scenario/judge_agent.py
  • python/scenario/scenario_executor.py
  • python/scenario/user_simulator_agent.py
  • python/scenario/voice/__init__.py
  • python/scenario/voice/modality_resolver.py
  • python/scenario/voice/script_steps.py
  • python/tests/test_capability_matrix.py
  • python/tests/test_judge_agent.py
  • python/tests/test_public_modality_api.py
  • python/tests/test_user_simulator_agent.py
  • python/tests/voice/test_judge_audio_transcribe.py
  • python/tests/voice/test_judge_voice.py
  • python/tests/voice/test_modality_resolver.py
  • python/tests/voice/test_modality_stamps.py
  • python/tests/voice/test_modality_validation.py
  • specs/voice-agents.feature

Comment thread python/scenario/scenario_executor.py
Broad `except Exception` was masking network timeouts, auth errors, and
bugs as ModalityNegotiationError. Only PendingTransportError signals a
live-transport modality mismatch per AC7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewdrewthis drewdrewthis added the prove-it-clean All ACs verified by /prove-it at this HEAD label Jun 15, 2026
Comment thread python/scenario/scenario_executor.py
Comment thread python/tests/voice/test_modality_stamps.py
@drewdrewthis

drewdrewthis commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Review verdict: READY

Reviewed at: f85c8be31109bb375d66d4c9871639a8e09fae7c · Run: /review --since 0c84c38 (own-PR, scoped)

Scope: 1 file, 1 line Python — warnings → _warnings lint rename (Ruff RUF059). All checklist reviewers (principles, hygiene, security, test, proof) returned no concerns. Personas: no concerns. Stale-marker refresh justified by CI green on HEAD (14 pass / 0 fail, including test (3.12) and Analyze (python)).

Fixes applied this review pass

  • [proof-reviewer][hygiene][test-reviewer] test_judge_agent.py:482 pyright error — messages=[audio_message] (dict not assignable to ChatCompletionMessageParam) fixed with cast(Any, [audio_message]). Follows existing pattern at line 263. (commit 0c84c38)
  • [hygiene][test-reviewer] Duplicate VoiceRecording import in both spy test functions — unused first import removed. (same commit)
  • [coderabbitai] test_modality_stamps.py:120 unused warnings binding renamed _warnings to satisfy Ruff RUF059. (commit f85c8be)

Non-blocking (Decide / New Issue)

  • [principles] scenario_executor.py:463-464 stamps the identical tier_value into both .resolved and .tier. Intentional per feature spec ("exact keys") but undocumented — consider a one-line # emit both; .tier is legacy, .resolved is canonical comment if this is a migration step.
  • [test-reviewer] test_ac5b_stt_bridge_judge_invokes_transcribe_segments patches _litellm_advisory even with explicit modality='stt-bridge'; may be redundant. Verify the patch is load-bearing and remove if not — or track as follow-up.
  • [test-reviewer] mock_ts.assert_called_once_with(recording) sits outside the with patch() block (lines 527, 556). Functionally correct (mock records calls before teardown). Structural preference to move inside; non-blocking.
  • [proof-reviewer] AC evidence table cites ephemeral /tmp/demo_voice_modality.py script (no longer on disk). Low risk — CI test (3.12) covers the same ACs and is green on HEAD — but evidence should cite the CI job for durability.

…engthen AC5b test

- Drop `scenario.modality.{role}.resolved` stamp (was identical to `.tier`);
  keep only `scenario.modality.{role}.tier` which is the canonical key.
- Expand `test_ac5b_stt_bridge_tier_stamped_correctly` to exercise the full
  declaration → `resolve_modality()` → span-stamp path instead of bypassing
  the resolver via direct `_modality_resolutions` injection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/tests/voice/test_modality_stamps.py (1)

131-136: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Baseline test currently doesn’t exercise the “attribute missing” path it describes.

The docstring/comment says _modality_resolutions is intentionally unset, but _new_turn_with_resolutions(executor, {}) sets it to an empty dict on Line 63. This weakens coverage of the real getattr(..., {}) fallback path in _new_turn().

Suggested minimal test fix
-def _new_turn_with_resolutions(
+def _new_turn_with_resolutions(
     executor: ScenarioExecutor,
-    resolutions: dict,
+    resolutions: dict | None,
 ) -> dict:
@@
-        executor._modality_resolutions = resolutions
+        if resolutions is not None:
+            executor._modality_resolutions = resolutions
         executor._new_turn()
@@
-    captured = _new_turn_with_resolutions(executor, {})
+    captured = _new_turn_with_resolutions(executor, None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/tests/voice/test_modality_stamps.py` around lines 131 - 136, The test
is not actually exercising the attribute-missing path because the helper
function _new_turn_with_resolutions sets _modality_resolutions to an empty dict
on line 63, contradicting the test's intent. To test the true fallback behavior,
modify the test to call the actual _new_turn function directly on the executor
(without using _new_turn_with_resolutions) so that the getattr fallback path in
the production code is genuinely invoked when _modality_resolutions is not set.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/tests/voice/test_modality_stamps.py`:
- Line 118: The unpacking assignment in the resolve_modality function call
unpacks a warnings variable that is never used, which triggers Ruff linting
warnings. Rename the unused warnings binding to either _warnings or _
(underscore) to explicitly indicate that this variable is intentionally not
used. This will satisfy the linter and make the intent clear to future readers.

---

Outside diff comments:
In `@python/tests/voice/test_modality_stamps.py`:
- Around line 131-136: The test is not actually exercising the attribute-missing
path because the helper function _new_turn_with_resolutions sets
_modality_resolutions to an empty dict on line 63, contradicting the test's
intent. To test the true fallback behavior, modify the test to call the actual
_new_turn function directly on the executor (without using
_new_turn_with_resolutions) so that the getattr fallback path in the production
code is genuinely invoked when _modality_resolutions is not set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4f24884-951a-4899-be0e-2967773c958f

📥 Commits

Reviewing files that changed from the base of the PR and between 30aeb1f and 1447c52.

📒 Files selected for processing (2)
  • python/scenario/scenario_executor.py
  • python/tests/voice/test_modality_stamps.py
💤 Files with no reviewable changes (1)
  • python/scenario/scenario_executor.py

Comment thread python/tests/voice/test_modality_stamps.py Outdated
@drewdrewthis drewdrewthis added prove-it-clean All ACs verified by /prove-it at this HEAD ai-reviewed /review was run on this PR (multi-agent: principles, hygiene, test, security) and removed prove-it-clean All ACs verified by /prove-it at this HEAD labels Jun 15, 2026
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

No description provided.

…nts spy tests (AC5/AC5b/AC9)

Three prove-it gaps closed:
- AC5: stamp scenario.modality.<role>.resolved alongside .tier in _new_turn() — feature
  spec requires both exact keys; previously only .tier was set.
- AC5b: add spy test asserting transcribe_segments is invoked with the judge's
  VoiceRecording when modality='stt-bridge', not just inferred from tier stamp.
- AC9: add spy test asserting transcribe_segments still runs for a text-modality
  gpt-4o judge after the substring-list to resolver change; confirms regression path.

All 59 AC-relevant tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

No description provided.

@drewdrewthis drewdrewthis added prove-it-clean All ACs verified by /prove-it at this HEAD and removed prove-it-clean All ACs verified by /prove-it at this HEAD labels Jun 15, 2026
Ubuntu and others added 2 commits June 15, 2026 19:49
…ve duplicate imports

messages=[audio_message] failed pyright because dict[str, Unknown] is not assignable
to ChatCompletionMessageParam; cast(Any, ...) matches the existing pattern at line 263.
Also removes the unused duplicate VoiceRecording import in both spy test functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes core runtime behavior for voice modality negotiation (declaration resolution, two‑phase validation, adapter compatibility checks) and alters how the Judge and Simulator decide to send audio. It also introduces an external dependency call to litellm for capability advisory and adds validation that can raise runtime errors at setup/connect — changes that affect integrations and business logic and therefore are not low risk under the policy.

This PR requires a manual review before merging.

@drewdrewthis drewdrewthis merged commit 007a69f into main Jun 18, 2026
18 checks passed
@drewdrewthis drewdrewthis deleted the fix/issue666-voice-modality-per-role branch June 18, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-reviewed /review was run on this PR (multi-agent: principles, hygiene, test, security) in-progress Workflow: in-progress prove-it-clean All ACs verified by /prove-it at this HEAD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Voice modality should be negotiated per role (simulator/judge), not hardwired to text + STT/TTS bridge

2 participants