feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps by drewdrewthis · Pull Request #670 · langwatch/scenario

drewdrewthis · 2026-06-15T00:52:00Z

Why

Today the voice user simulator always strips audio to text (STT bridge) with no capability check, and the judge guesses audio support from a hardcoded model-name substring list that misses gpt-audio-mini. Silent degradation is the root cause of #664. This PR ships the observable spine: per-role modality declaration with declaration-first resolution, loud two-phase validation, and OTEL span stamps so a degraded run is always visible.

Closes #666

What changed

modality_resolver.py — new core module — declaration beats litellm advisory; mismatch emits a WARNING rather than silently overriding; unknown declarations raise ModalityNegotiationError.
UserSimulatorAgent and JudgeAgent wired to resolver — modality="audio-in" | "text" | "stt-bridge" accepted as public param (AC0); effective tier resolved once per role at scenario init; simulator strips or keeps audio parts accordingly; judge delegates audio-capability detection to resolver instead of hardcoded substring list.
Two-phase validation — statically-impossible combos (e.g. audio-in + non-streaming adapter) raise at ScenarioExecutor init; live-transport mismatches raise at first connect(), before the first turn (AC6–AC8b).
OTEL span stamps — scenario.modality.<role>.resolved and scenario.modality.<role>.tier written per-role into the active span at _new_turn() (AC5, AC5b).

How it works

scenario/voice/
  modality_resolver.py   ← resolve_modality(declaration, model_id) → (ModalityTier, warnings)
  adapters/_stub.py      ← PendingTransportError raised by connect() on live mismatch
  capabilities.py        ← AdapterCapabilities.input_formats already existed; validated here

scenario/
  user_simulator_agent.py  ← calls resolve_modality; _strip_audio_content gated on tier
  judge_agent.py           ← effective_include_audio delegates to resolve_modality (AC3a–AC3c)
  scenario_executor.py     ← static validation at __init__; OTEL stamps at _new_turn()

Resolution rules:

No declaration → litellm advisory is truth (no warning).
Declaration matches advisory → use declared tier (no warning).
Declaration disagrees with advisory → use declared tier, emit WARNING (AC4a, AC4b).

Test plan

cd python
uv run pytest tests/voice/test_modality_resolver.py tests/voice/test_modality_stamps.py \
  tests/voice/test_modality_validation.py tests/test_public_modality_api.py \
  tests/test_judge_agent.py tests/test_user_simulator_agent.py -q
# 59 passed (all AC-relevant tests)

Key new test files:

tests/voice/test_modality_resolver.py — AC4a, AC4b
tests/voice/test_modality_validation.py — AC6–AC8b
tests/voice/test_modality_stamps.py — AC5, AC5b (tier + resolved keys)
tests/test_judge_agent.py::test_ac9_transcribe_segments_invoked_for_text_judge — AC9 spy
tests/test_judge_agent.py::test_ac5b_stt_bridge_judge_invokes_transcribe_segments — AC5b spy

Regression: test_gemini_is_detected_as_audio_capable intentionally removed — the resolver now delegates to litellm advisory rather than an internal substring list; the equivalent behavior is covered by TestNoDeclaration.test_no_declaration_advisory_true_returns_audio_in.

How I can prove I was successful

Show-it-working demo (actual run output — 2026-06-15)

Script: uv run python /tmp/demo_voice_modality.py in the worktree venv.
Exercises: per-role resolution across 6 combos, static impossible-combo exception, OTEL span attrs via _new_turn().

----------------------------------------------------------------------
  1. Per-role modality resolution across role/model combos
----------------------------------------------------------------------
[OK  ]  simulator, gpt-audio-mini, no decl
         declaration=None  model='gpt-audio-mini'  advisory=True -> tier='audio-in'
[OK  ]  simulator, gpt-4.1-mini, no decl
         declaration=None  model='gpt-4.1-mini'  advisory=False -> tier='text'
[OK  ]  judge, gpt-4o, no decl (advisory False!)
         declaration=None  model='gpt-4o'  advisory=False -> tier='text'
[WARN]  declared audio-in on advisory-text model
         declaration='audio-in'  model='gpt-4.1-mini'  advisory=False -> tier='audio-in'
         WARNING: Model 'gpt-4.1-mini' declared modality 'audio-in' but litellm reports it
         does NOT support audio input. The declared modality 'audio-in' will be used.
[WARN]  declared text on advisory-audio model
         declaration='text'  model='gpt-audio-mini'  advisory=True -> tier='text'
         WARNING: Model 'gpt-audio-mini' declared modality 'text' but litellm reports it
         DOES support audio input. The declared modality 'text' will be used.
[OK  ]  declared stt-bridge explicit
         declaration='stt-bridge'  model='gpt-4o'  advisory=False -> tier='stt-bridge'

----------------------------------------------------------------------
  2. validate_modality_setup — mulaw/8000-only adapter raises at setup (AC6)
----------------------------------------------------------------------
  ModalityNegotiationError raised ✓
  contains 'audio-in': True  |  contains 'mulaw': True
  full msg: Declared modality 'audio-in' is incompatible with adapter 'TwilioAdapter':
  input formats ['mulaw/8000'] contain no pcm16 path (conflicting capability: 'mulaw/8000').
  No resample path exists.

----------------------------------------------------------------------
  3. OTEL span attributes stamped at _new_turn (AC5) — both .resolved and .tier
----------------------------------------------------------------------
  [audio-in run]
  modality attrs: {'scenario.modality.simulator.resolved': 'audio-in',
                   'scenario.modality.simulator.tier': 'audio-in',
                   'scenario.modality.judge.resolved': 'audio-in',
                   'scenario.modality.judge.tier': 'audio-in'}

  [degraded run (simulator=audio-in, judge=stt-bridge)]
  modality attrs: {'scenario.modality.simulator.resolved': 'audio-in',
                   'scenario.modality.simulator.tier': 'audio-in',
                   'scenario.modality.judge.resolved': 'stt-bridge',
                   'scenario.modality.judge.tier': 'stt-bridge'}

  AC5 checks:
    audio run  simulator.tier = 'audio-in'  ✓
    audio run  judge.tier     = 'audio-in'  ✓
    degraded   simulator.tier = 'audio-in'  ✓
    degraded   judge.tier     = 'stt-bridge'  ✓
    .resolved keys present: True (simulator) / True (judge)  ✓
    tiers differ between roles in degraded run: ✓

Acceptance Criteria Evidence Table

#	Criterion	Evidence	Status
AC0	`modality=` public param on both agents	`user_simulator_agent.py:162`, `judge_agent.py:251`; 4 `test_ac0_*` tests pass	✅ PASS
AC1	audio-capable simulator retains audio parts	`test_audio_in_simulator_retains_audio_parts` pass; demo: `gpt-audio-mini` → `tier='audio-in'`	✅ PASS
AC2	text simulator strips audio (regression)	`test_text_simulator_strips_audio_with_placeholders` pass	✅ PASS
AC3a	`gpt-audio-mini` judge receives audio	`test_gpt_audio_mini_judge_receives_audio` pass	✅ PASS
AC3b	`gpt-4o` judge (no decl) → transcript path	`test_gpt4o_judge_no_declaration_takes_transcript_path` pass; demo: `gpt-4o` advisory=False → `tier='text'`	✅ PASS
AC3c	explicit `include_audio=False` wins	`test_explicit_include_audio_false_wins` pass	✅ PASS
AC4a	declared `audio-in` on advisory-text → declared tier + WARNING	Demo: WARNING text captured, `tier='audio-in'`; `TestAC4a` (5 tests) pass	✅ PASS
AC4b	declared `text` on advisory-audio → declared tier + WARNING	Demo: WARNING text captured, `tier='text'`; `TestAC4b` (4 tests) pass	✅ PASS
AC5	OTEL `scenario.modality.<role>.resolved` AND `.tier` per role	Demo: both keys present in span attrs for audio-in and degraded runs; `test_ac5_*` tests assert `.resolved` + `.tier`; `scenario_executor.py:462-464`	✅ PASS
AC5b	STT-bridge explicit outcome; `transcribe_segments` spy	`resolve_modality(decl='stt-bridge')` → `STT_BRIDGE` ✅; span `.tier='stt-bridge'` ✅; `test_ac5b_stt_bridge_judge_invokes_transcribe_segments` spies `transcribe_segments(recording)` ✅	✅ PASS
AC6	static impossible combo raises with both tokens	Demo: `ModalityNegotiationError` message contains `'audio-in'` + `'mulaw'`; `TestAC6StaticValidation` (7 tests) pass	✅ PASS
AC7	live-transport mismatch → `ModalityNegotiationError` before first turn	`test_ac7_live_transport_failure_raises_before_first_turn` + `test_ac7_error_is_modality_negotiation_error_not_pending_transport` pass	✅ PASS
AC8a	`interrupt(after_words=N)` gate fires at connect, not step-exec	`test_ac8a_interrupt_after_words_raises_at_connect_not_step_execution` pass	✅ PASS
AC8b	dtmf gate unchanged	`test_ac8b_dtmf_gate_unchanged` pass	✅ PASS
AC9	`transcribe_segments` runs over VoiceRecording for text judge (regression)	`test_ac9_transcribe_segments_invoked_for_text_judge` spies `transcribe_segments` called with `VoiceRecording` for gpt-4o (TEXT tier) ✅; `judge_agent.py:474`	✅ PASS
AC10	capability matrix byte-identical (no new field)	`uv run python scripts/gen_capability_matrix.py` → "No changes: capability-matrix.mdx"	✅ PASS

Verdict: 16/16 PASS — ALL PASS

…tion-first + litellm advisory Implements AC4a, AC4b scenarios from specs/voice-modality-negotiation.feature: - resolve_modality(): declaration wins, litellm advisory warns on mismatch, both directions - ModalityNegotiationError: shared exception type for setup/connect validation (AC6, AC7) - ModalityTier: audio-in, stt-bridge, text Foundational module; all other bundles depend on this. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Audio-capable models (advisory audio-in) now receive raw audio parts; text-only models strip audio exactly as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…olver — AC3a, AC3b, AC3c, AC9 gpt-audio-mini now correctly resolves to audio-in (was missed by old list). gpt-4o now correctly takes the transcript path (litellm advisory=False). include_audio explicit override still wins (AC3c preserved). transcribe_segments unchanged for text-modality judges (AC9 regression passes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Static impossible combo (audio-in × mulaw/8000) raises ModalityNegotiationError at setup. Live transport failure at first-connect re-raises as ModalityNegotiationError with requirement token. interrupt(after_words=N) capability gate moved to first-connect (before first turn). dtmf gate unchanged (AC8b regression). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

UserSimulatorAgent(modality="audio-in") and JudgeAgent(modality="text") now accepted. Declaration reaches resolve_modality() as the explicit declaration arg. Documented in docstrings (user-facing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…utes — AC5, AC5b scenario.modality.<role>.resolved and scenario.modality.<role>.tier stamped on root span at the start of each turn. Populated in run() from resolve_modality() for UserSimulatorAgent (simulator) and JudgeAgent (judge). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y field added — AC10b Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-15T00:52:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 825b07de-c2f8-4227-8dc2-85ddfb2dfdb2

📥 Commits

Reviewing files that changed from the base of the PR and between a3a9f34 and f85c8be.

📒 Files selected for processing (2)

python/tests/test_judge_agent.py
python/tests/voice/test_modality_stamps.py

🚧 Files skipped from review as they are similar to previous changes (2)

python/tests/test_judge_agent.py
python/tests/voice/test_modality_stamps.py

Walkthrough

Introduces per-role voice modality negotiation by adding a new modality_resolver.py module with ModalityTier, ModalityNegotiationError, resolve_modality, and validate_modality_setup. Both JudgeAgent and UserSimulatorAgent gain an explicit modality constructor parameter. ScenarioExecutor resolves and stamps modality tiers as OTEL attributes and enforces three-phase voice adapter validation. The interrupt() step gains a _requires_streaming_transcripts capability marker.

Changes

Voice Modality Negotiation per Role

Layer / File(s)	Summary
ModalityTier contract and resolve_modality implementation `python/scenario/voice/modality_resolver.py`, `python/scenario/voice/__init__.py`, `python/tests/voice/test_modality_resolver.py`	New `modality_resolver.py` defines `ModalityTier` (`audio-in`, `stt-bridge`, `text`), `ModalityNegotiationError`, `_litellm_advisory`, `resolve_modality` (declaration vs advisory with mismatch warnings), and `validate_modality_setup` (pcm16 requirement for `audio-in`). All three symbols are re-exported from `voice/__init__.py`. Resolver tests cover no-declaration, declaration-agreement, AC4a/AC4b mismatches, and unknown-declaration error cases.
`modality` parameter on JudgeAgent and UserSimulatorAgent `python/scenario/judge_agent.py`, `python/scenario/user_simulator_agent.py`, `python/tests/test_judge_agent.py`, `python/tests/test_public_modality_api.py`, `python/tests/test_user_simulator_agent.py`, `python/tests/voice/test_judge_audio_transcribe.py`, `python/tests/voice/test_judge_voice.py`	Both agents accept `modality: Optional[str] = None`. `JudgeAgent.effective_include_audio` replaces the model-substring heuristic with `resolve_modality`, enabling audio only for `AUDIO_IN` tier. `UserSimulatorAgent._generate_text` resolves the tier and conditionally strips audio content via `_strip_audio_content` for non-audio-in models. Tests verify resolver delegation, audio-part retention/stripping, explicit `include_audio` override precedence, and multi-model behavior.
ScenarioExecutor: run resolution, OTEL stamping, and three-phase voice connect `python/scenario/scenario_executor.py`, `python/scenario/voice/script_steps.py`, `python/tests/voice/test_modality_stamps.py`, `python/tests/voice/test_modality_validation.py`	`run()` resolves and stores `ModalityTier` per role in `_modality_resolutions`. `_new_turn()` stamps `scenario.modality.<role>.tier` as root span attributes. `_voice_connect_all()` runs three phases: static `validate_modality_setup`, connect with `PendingTransportError` wrapped as `ModalityNegotiationError`, and script-step `streaming_transcripts` capability check raising `UnsupportedCapabilityError`. `interrupt(after_words=N)` marks steps with `_requires_streaming_transcripts=True`. Tests cover AC5 OTEL stamping, AC6 static negotiation, AC7 live connect failures, and AC8a/AC8b streaming transcript gating.
Capability matrix regression test and spec tag `python/tests/test_capability_matrix.py`, `specs/voice-agents.feature`	Adds a byte-identical regression test for the capability-matrix MDX generator (runs `gen_capability_matrix.py` via `uv run` and asserts file remains unchanged). The `voice-agents.feature` scenario gains the `@unit` tag.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(70, 130, 180, 0.5)
        Note over ScenarioExecutor: run() — modality resolution
    end
    ScenarioExecutor->>resolve_modality: declaration=simulator.modality, model_id
    resolve_modality-->>ScenarioExecutor: (ModalityTier, warnings)
    ScenarioExecutor->>resolve_modality: declaration=judge.modality, model_id
    resolve_modality-->>ScenarioExecutor: (ModalityTier, warnings)
    ScenarioExecutor->>ScenarioExecutor: store _modality_resolutions

    rect rgba(60, 179, 113, 0.5)
        Note over ScenarioExecutor: _voice_connect_all() — three-phase validation
    end
    ScenarioExecutor->>validate_modality_setup: tier + adapter.capabilities.input_formats
    validate_modality_setup-->>ScenarioExecutor: OK or ModalityNegotiationError

    ScenarioExecutor->>VoiceAgentAdapter: connect()
    VoiceAgentAdapter-->>ScenarioExecutor: OK or PendingTransportError → ModalityNegotiationError

    ScenarioExecutor->>ScenarioExecutor: scan script steps for _requires_streaming_transcripts
    ScenarioExecutor->>VoiceAgentAdapter: check capabilities.streaming_transcripts
    VoiceAgentAdapter-->>ScenarioExecutor: missing → UnsupportedCapabilityError

    rect rgba(178, 102, 255, 0.5)
        Note over ScenarioExecutor: _new_turn() — OTEL stamping
    end
    ScenarioExecutor->>LangwatchRootSpan: set_attributes(scenario.modality.<role>.tier)

Possibly Related PRs

langwatch/scenario#665: Main PR's ScenarioExecutor now gates interrupt steps on adapters' capabilities.streaming_transcripts, and retrieved PR changes PipecatAgentAdapter.capabilities to advertise streaming_transcripts=False, directly impacting how that gate behaves.

Poem

🐇 A model name substring? How passé!
Now ModalityTier leads the way —
audio-in, stt-bridge, or text,
Each role resolved, no guessing what's next.
Three phases validate before the first turn,
And OTEL spans stamp every tier in turn! 🎙️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(`#666`): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps' accurately summarizes the main architectural change: adding per-role modality negotiation with declaration-first resolution, two-phase validation, and OpenTelemetry stamping as specified in the PR objectives.
Description check	✅ Passed	The description comprehensively explains the rationale (silent degradation), what changed (modality_resolver core, per-agent wiring, two-phase validation, OTEL stamps), and how it works with working examples and acceptance criteria evidence. It directly relates to all changes in the changeset.
Linked Issues check	✅ Passed	The PR fully addresses all six key objectives from `#666`: (1) allows audio-capable models to receive audio directly via tier-conditional stripping [AC1–AC2], (2) adds public modality parameter [AC0], (3) implements two-phase validation [AC6–AC8b], (4) replaces hardcoded judge detection with resolver [AC3a–AC3c], (5) stamps modality into OTEL spans [AC5, AC5b], and (6) adopts declaration-first architecture with litellm-as-advisory [AC4a–AC4b]. All 16 acceptance criteria pass.
Out of Scope Changes check	✅ Passed	All changes are within scope of per-role modality negotiation: modality_resolver core logic, agent wiring, scenario executor validation/stamping, two-phase gating, test coverage for resolver/validation/stamping/judge behavior. Feature flagging (voice/init.py), test regression (script_steps.py marker), and capability matrix regression test are all necessary supporting changes. No unrelated refactoring or scope creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue666-voice-modality-per-role

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Capture and emit warnings from resolve_modality() calls at three call sites (simulator setup, judge setup, voice agent setup) instead of silently discarding them via underscore binding. The resolver contract requires all warnings be logged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@e2e

…ing contract failure The "Capability matrix is rendered into adapter docs" scenario in voice-agents.feature was missing a required @unit/@integration/@e2e tag, causing test_feature_file_contract tests to fail on main and on this PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Audio content dict literals in AgentInput tests are valid at runtime but pyright can't narrow them to ChatCompletionMessageParam. Suppressed with # type: ignore[arg-type] — same pattern already used in this file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

python/scenario/voice/modality_resolver.py (1)

62-65: 💤 Low value

Consider using raise ... from None for cleaner exception context.

When converting ValueError to ModalityNegotiationError, the original ValueError context isn't useful to users—they only need to know the declaration is invalid and see the valid values. Suppressing the chain makes tracebacks cleaner.

♻️ Proposed fix

     try:
         declared_tier = ModalityTier(declaration)
     except ValueError:
-        raise ModalityNegotiationError(
+        raise ModalityNegotiationError(
             f"Unknown modality declaration {declaration!r}; valid values: "
             + ", ".join(t.value for t in ModalityTier)
-        )
+        ) from None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/scenario/voice/modality_resolver.py` around lines 62 - 65, The
ModalityNegotiationError being raised in the modality validation block does not
suppress the exception context chain, which clutters the traceback with
unnecessary information about the original ValueError. Add `from None` to the
raise statement for ModalityNegotiationError to suppress the exception context
chain and provide a cleaner traceback to users that focuses only on the helpful
information about the invalid declaration and valid values.

python/scenario/scenario_executor.py (1)

462-464: ⚡ Quick win

Redundant OTEL attribute stamping.

Lines 463–464 both set scenario.modality.{role}.resolved and scenario.modality.{role}.tier to the same tier_value. This duplication increases trace size without adding information.

Consider setting only one attribute per role, or clarify in a comment why both are needed.

♻️ Suggested simplification

 attrs = {
     "langwatch.origin": "simulation",
     "scenario.run_id": self._scenario_run_id,
 }
 for role, tier_value in getattr(self, '_modality_resolutions', {}).items():
-    attrs[f"scenario.modality.{role}.resolved"] = tier_value
     attrs[f"scenario.modality.{role}.tier"] = tier_value
 self._trace.root_span.set_attributes(attrs)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/scenario/scenario_executor.py` around lines 462 - 464, The OTEL
attribute setting in the loop over self._modality_resolutions is redundant: both
lines 463 and 464 assign the same tier_value to different attribute keys for the
same role. Remove the duplicate line that sets the scenario.modality.{role}.tier
attribute (keeping only scenario.modality.{role}.resolved), or if both
attributes are actually needed for different purposes, add a clarifying comment
explaining why the duplication is intentional. The loop structure iterating over
the modality resolutions should be preserved, but eliminate the unnecessary
second attrs assignment per role iteration.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/scenario/scenario_executor.py`:
- Around line 772-776: The exception handler in the connect() error path is
catching all exceptions with `except Exception as e:` and wrapping them as
ModalityNegotiationError, which masks the true nature of non-modality failures
like network timeouts or authentication errors. According to AC7, only
PendingTransportError should be caught and wrapped as ModalityNegotiationError.
Change the except clause to specifically catch only PendingTransportError
instead of the broad Exception class, allowing other exceptions to propagate
unchanged so their true nature is visible to the caller.

---

Nitpick comments:
In `@python/scenario/scenario_executor.py`:
- Around line 462-464: The OTEL attribute setting in the loop over
self._modality_resolutions is redundant: both lines 463 and 464 assign the same
tier_value to different attribute keys for the same role. Remove the duplicate
line that sets the scenario.modality.{role}.tier attribute (keeping only
scenario.modality.{role}.resolved), or if both attributes are actually needed
for different purposes, add a clarifying comment explaining why the duplication
is intentional. The loop structure iterating over the modality resolutions
should be preserved, but eliminate the unnecessary second attrs assignment per
role iteration.

In `@python/scenario/voice/modality_resolver.py`:
- Around line 62-65: The ModalityNegotiationError being raised in the modality
validation block does not suppress the exception context chain, which clutters
the traceback with unnecessary information about the original ValueError. Add
`from None` to the raise statement for ModalityNegotiationError to suppress the
exception context chain and provide a cleaner traceback to users that focuses
only on the helpful information about the invalid declaration and valid values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d2ca81ac-66b3-42dc-be22-29af72edfea2

📥 Commits

Reviewing files that changed from the base of the PR and between 21226c0 and 02eca46.

📒 Files selected for processing (16)

python/scenario/judge_agent.py
python/scenario/scenario_executor.py
python/scenario/user_simulator_agent.py
python/scenario/voice/__init__.py
python/scenario/voice/modality_resolver.py
python/scenario/voice/script_steps.py
python/tests/test_capability_matrix.py
python/tests/test_judge_agent.py
python/tests/test_public_modality_api.py
python/tests/test_user_simulator_agent.py
python/tests/voice/test_judge_audio_transcribe.py
python/tests/voice/test_judge_voice.py
python/tests/voice/test_modality_resolver.py
python/tests/voice/test_modality_stamps.py
python/tests/voice/test_modality_validation.py
specs/voice-agents.feature

Broad `except Exception` was masking network timeouts, auth errors, and bugs as ModalityNegotiationError. Only PendingTransportError signals a live-transport modality mismatch per AC7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis · 2026-06-15T08:53:13Z

Review verdict: READY

Reviewed at: f85c8be31109bb375d66d4c9871639a8e09fae7c · Run: /review --since 0c84c38 (own-PR, scoped)

Scope: 1 file, 1 line Python — warnings → _warnings lint rename (Ruff RUF059). All checklist reviewers (principles, hygiene, security, test, proof) returned no concerns. Personas: no concerns. Stale-marker refresh justified by CI green on HEAD (14 pass / 0 fail, including test (3.12) and Analyze (python)).

Fixes applied this review pass

[proof-reviewer][hygiene][test-reviewer] test_judge_agent.py:482 pyright error — messages=[audio_message] (dict not assignable to ChatCompletionMessageParam) fixed with cast(Any, [audio_message]). Follows existing pattern at line 263. (commit 0c84c38)
[hygiene][test-reviewer] Duplicate VoiceRecording import in both spy test functions — unused first import removed. (same commit)
[coderabbitai] test_modality_stamps.py:120 unused warnings binding renamed _warnings to satisfy Ruff RUF059. (commit f85c8be)

Non-blocking (Decide / New Issue)

[principles] scenario_executor.py:463-464 stamps the identical tier_value into both .resolved and .tier. Intentional per feature spec ("exact keys") but undocumented — consider a one-line # emit both; .tier is legacy, .resolved is canonical comment if this is a migration step.
[test-reviewer] test_ac5b_stt_bridge_judge_invokes_transcribe_segments patches _litellm_advisory even with explicit modality='stt-bridge'; may be redundant. Verify the patch is load-bearing and remove if not — or track as follow-up.
[test-reviewer] mock_ts.assert_called_once_with(recording) sits outside the with patch() block (lines 527, 556). Functionally correct (mock records calls before teardown). Structural preference to move inside; non-blocking.
[proof-reviewer] AC evidence table cites ephemeral /tmp/demo_voice_modality.py script (no longer on disk). Low risk — CI test (3.12) covers the same ACs and is green on HEAD — but evidence should cite the CI job for durability.

…engthen AC5b test - Drop `scenario.modality.{role}.resolved` stamp (was identical to `.tier`); keep only `scenario.modality.{role}.tier` which is the canonical key. - Expand `test_ac5b_stt_bridge_tier_stamped_correctly` to exercise the full declaration → `resolve_modality()` → span-stamp path instead of bypassing the resolver via direct `_modality_resolutions` injection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/tests/voice/test_modality_stamps.py (1)

131-136: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Baseline test currently doesn’t exercise the “attribute missing” path it describes.

The docstring/comment says _modality_resolutions is intentionally unset, but _new_turn_with_resolutions(executor, {}) sets it to an empty dict on Line 63. This weakens coverage of the real getattr(..., {}) fallback path in _new_turn().

Suggested minimal test fix

-def _new_turn_with_resolutions(
+def _new_turn_with_resolutions(
     executor: ScenarioExecutor,
-    resolutions: dict,
+    resolutions: dict | None,
 ) -> dict:
@@
-        executor._modality_resolutions = resolutions
+        if resolutions is not None:
+            executor._modality_resolutions = resolutions
         executor._new_turn()
@@
-    captured = _new_turn_with_resolutions(executor, {})
+    captured = _new_turn_with_resolutions(executor, None)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/tests/voice/test_modality_stamps.py` around lines 131 - 136, The test
is not actually exercising the attribute-missing path because the helper
function _new_turn_with_resolutions sets _modality_resolutions to an empty dict
on line 63, contradicting the test's intent. To test the true fallback behavior,
modify the test to call the actual _new_turn function directly on the executor
(without using _new_turn_with_resolutions) so that the getattr fallback path in
the production code is genuinely invoked when _modality_resolutions is not set.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/tests/voice/test_modality_stamps.py`:
- Line 118: The unpacking assignment in the resolve_modality function call
unpacks a warnings variable that is never used, which triggers Ruff linting
warnings. Rename the unused warnings binding to either _warnings or _
(underscore) to explicitly indicate that this variable is intentionally not
used. This will satisfy the linter and make the intent clear to future readers.

---

Outside diff comments:
In `@python/tests/voice/test_modality_stamps.py`:
- Around line 131-136: The test is not actually exercising the attribute-missing
path because the helper function _new_turn_with_resolutions sets
_modality_resolutions to an empty dict on line 63, contradicting the test's
intent. To test the true fallback behavior, modify the test to call the actual
_new_turn function directly on the executor (without using
_new_turn_with_resolutions) so that the getattr fallback path in the production
code is genuinely invoked when _modality_resolutions is not set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4f24884-951a-4899-be0e-2967773c958f

📥 Commits

Reviewing files that changed from the base of the PR and between 30aeb1f and 1447c52.

📒 Files selected for processing (2)

python/scenario/scenario_executor.py
python/tests/voice/test_modality_stamps.py

💤 Files with no reviewable changes (1)

python/scenario/scenario_executor.py

drewdrewthis · 2026-06-15T19:29:24Z

No description provided.

…nts spy tests (AC5/AC5b/AC9) Three prove-it gaps closed: - AC5: stamp scenario.modality.<role>.resolved alongside .tier in _new_turn() — feature spec requires both exact keys; previously only .tier was set. - AC5b: add spy test asserting transcribe_segments is invoked with the judge's VoiceRecording when modality='stt-bridge', not just inferred from tier stamp. - AC9: add spy test asserting transcribe_segments still runs for a text-modality gpt-4o judge after the substring-list to resolver change; confirms regression path. All 59 AC-relevant tests green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis · 2026-06-15T19:40:08Z

No description provided.

…ve duplicate imports messages=[audio_message] failed pyright because dict[str, Unknown] is not assignable to ChatCompletionMessageParam; cast(Any, ...) matches the existing pattern at line 263. Also removes the unused duplicate VoiceRecording import in both spy test functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-16T15:51:07Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes core runtime behavior for voice modality negotiation (declaration resolution, two‑phase validation, adapter compatibility checks) and alters how the Judge and Simulator decide to send audio. It also introduces an external dependency call to litellm for capability advisory and adds validation that can raise runtime errors at setup/connect — changes that affect integrations and business logic and therefore are not low risk under the policy.

This PR requires a manual review before merging.

Ubuntu and others added 13 commits June 15, 2026 00:28

Merge Bundle 1: resolver core (AC4a, AC4b)

8ba9995

feat(#666): conditional audio strip in simulator — AC1, AC2

a29822f

Audio-capable models (advisory audio-in) now receive raw audio parts; text-only models strip audio exactly as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge Bundle 2: simulator wiring (AC1, AC2)

0ae04f5

Merge Bundle 3: judge wiring (AC3a, AC3b, AC3c, AC9)

b641a3f

Merge Bundle 4: two-phase validation (AC6, AC7, AC8a, AC8b)

51d5994

Merge Bundle 5: OTEL modality stamps (AC5, AC5b)

d7489b4

Merge Bundle 6: public modality= parameter (AC0)

9634b69

test(#666): verify capability matrix byte-identical when no capabilit…

b243d7d

…y field added — AC10b Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis added the in-progress Workflow: in-progress label Jun 15, 2026

github-code-quality Bot found potential problems Jun 15, 2026

View reviewed changes

Comment thread python/tests/voice/test_modality_stamps.py Fixed

Comment thread python/tests/voice/test_modality_validation.py Fixed

Ubuntu and others added 4 commits June 15, 2026 00:54

chore(#666): remove unused mock imports flagged by code-quality bot

ae52191

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis marked this pull request as ready for review June 15, 2026 08:20

drewdrewthis self-assigned this Jun 15, 2026

drewdrewthis requested a review from rogeriochaves June 15, 2026 08:27

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/scenario/scenario_executor.py

drewdrewthis added the prove-it-clean All ACs verified by /prove-it at this HEAD label Jun 15, 2026

drewdrewthis commented Jun 15, 2026

View reviewed changes

Comment thread python/scenario/scenario_executor.py

Comment thread python/tests/voice/test_modality_stamps.py

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/tests/voice/test_modality_stamps.py Outdated

drewdrewthis added prove-it-clean All ACs verified by /prove-it at this HEAD ai-reviewed /review was run on this PR (multi-agent: principles, hygiene, test, security) and removed prove-it-clean All ACs verified by /prove-it at this HEAD labels Jun 15, 2026

drewdrewthis added prove-it-clean All ACs verified by /prove-it at this HEAD and removed prove-it-clean All ACs verified by /prove-it at this HEAD labels Jun 15, 2026

Ubuntu and others added 2 commits June 15, 2026 19:49

fix(#666): rename unused warnings → _warnings to satisfy Ruff RUF059

f85c8be

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rogeriochaves approved these changes Jun 17, 2026

View reviewed changes

drewdrewthis merged commit 007a69f into main Jun 18, 2026
18 checks passed

drewdrewthis deleted the fix/issue666-voice-modality-per-role branch June 18, 2026 09:55

rogeriochaves mentioned this pull request Jun 18, 2026

chore(main): release python 0.7.32 #658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps#670

feat(#666): per-role voice modality negotiation — declaration-first, two-phase validation, OTEL stamps#670
drewdrewthis merged 22 commits into
mainfrom
fix/issue666-voice-modality-per-role

drewdrewthis commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drewdrewthis commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

drewdrewthis commented Jun 15, 2026

Uh oh!

drewdrewthis commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewdrewthis commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

How it works

Test plan

How I can prove I was successful

Show-it-working demo (actual run output — 2026-06-15)

Acceptance Criteria Evidence Table

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drewdrewthis commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review verdict: READY

Fixes applied this review pass

Non-blocking (Decide / New Issue)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drewdrewthis commented Jun 15, 2026

Uh oh!

drewdrewthis commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewdrewthis commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

drewdrewthis commented Jun 15, 2026 •

edited

Loading