fix(voice/#491): diagnose + resolve multi-turn @e2e suite-wedge + tighten VAD tests by drewdrewthis · Pull Request #694 · langwatch/scenario

drewdrewthis · 2026-06-21T11:00:58Z

Why

The 6 multi-turn voice @e2e demos carried @pytest.mark.skip ("Hangs in full suite … multi-turn max_turns demos wedge pytest process"), so a large slice of the @e2e behavioral contract was asserted in the feature file but never actually run. This diagnoses the wedge and removes the skips without re-papering the contract.

Closes #491

What changed

Root-caused the wedge (creds-free). scenario.run() teardown calls event_bus.drain() synchronously, which blocks on an unbounded queue.join() until the daemon=False telemetry worker POSTs every scenario event (30s httpx timeout, drained serially). Teardown cost ≈ events × up-to-30s whenever the LangWatch endpoint is reachable-but-slow; multi-turn voice demos emit the most events (one/turn + base64 audio) so their teardown is the most exposed. Chose to document + isolate rather than re-architect the shared event bus — high blast radius, and the demos need live creds so a fix can't be verified in python-ci anyway. Bounding the drain is recommended as a separate SDK-wide follow-up.
Resolution = per-process isolation (issue's path b), no skip papering. Removed the 6 @pytest.mark.skip; tagged the demos @pytest.mark.voice_multiturn (registered in pytest.ini). voice-integration.yml deselects them from the two bulk single-process steps and runs each in its own pytest process (set discovered by marker, so it can't drift). Fresh process per demo neutralises the telemetry drain and the documented adapter task/subprocess leaks at once.
Docs. TESTING.md gets the confirmed root cause + the supported marker mechanism; specs/voice-agents.feature gets a README block (comment-only — the 127-scenario contract is unchanged).
VAD (P3.2). Added a cross-class case pinning the warning rate-limit on the adapter_name string, not type(self).

Test plan

pytest tests/voice/test_vad.py → 7 passed (incl. new test_vad_fallback_rate_limit_keys_on_name_string_not_python_class).
pytest tests/voice/test_feature_file_contract.py → 5 passed (127-scenario contract intact after the feature-file comment block).
Marker split (creds-free): pytest -m voice_multiturn --collect-only → exactly the 6 demos; pytest -m "integration and not voice_multiturn" --collect-only → the other e2e, none of the 6; no @pytest.mark.skip left on the 6; marker registered (no PytestUnknownMark).
python-ci mirror CI=true pytest tests/ -m "not integration" → 980 passed / 18 skipped. The only 3 failures are pre-existing local-aarch64 test_red_team_agent mock failures (Mock has no attribute 'messages') — unrelated to this diff (touches no scenario/ source) and reproduce identically with origin/main's pytest.ini.

How I can prove I was successful

No playable artifact — pure test/CI infra. The root cause is demonstrated by a creds-free repro: point LANGWATCH_ENDPOINT at a socket that accepts-but-never-responds and run any scenario.run(). The captured hung-thread traceback — worker thread in socket.recv_into, calling thread in event_bus.drain() → queue.join() → all_tasks_done.wait() — is attached to issue #491. The green unit suite + marker-split commands above are the rest of the demonstration.

Human verification

This is a backend-only change — voice recv-loop / scenario.run() teardown logic plus test-suite isolation in CI. There is no UI surface: no screen, page, or rendered artifact to inspect, so live-app visual proof does not apply here. backend-only, no UI surface.

To confirm by hand (all creds-free):

Repro the wedge → confirm isolation neutralises it. Point LANGWATCH_ENDPOINT at a socket that accepts but never responds (e.g. nc -lk 127.0.0.1 9999), then run any scenario.run(). Without isolation the calling thread hangs in event_bus.drain() → queue.join(); under the per-process split each demo runs in a fresh process so the hung drain can't wedge the shared suite.
Marker split is exact. pytest -m voice_multiturn --collect-only → exactly the 6 multi-turn demos; pytest -m "integration and not voice_multiturn" --collect-only → the rest, none of the 6. No @pytest.mark.skip remains on the 6.
Unit suites green. pytest tests/voice/test_vad.py → 7 passed; pytest tests/voice/test_feature_file_contract.py → 5 passed.

Anything surprising?

The issue estimated "10" skipped wrappers; re-confirmed at fix-time the actual wedge set is 6 @pytest.mark.skip e2e wrappers. The other 4 skips are skipif(CI=='true') ("scenario.run hangs in python-ci; fine locally") — a separate, pre-existing family (very likely the same drain root cause) left out of scope here.
The demos stay creds-bound (OpenAI + others) so they're verified in voice-integration.yml, not python-ci — which is why path (b) (isolation) is the complete creds-free resolution rather than (a) (which the issue gates on "don't require live creds in CI").

…ot python class The parametrized rate-limit test varied only the adapter_name and used a single class, so it could not distinguish 'keyed on the string' from 'keyed on the (one) class'. Add an explicit cross-class case: a subclass built with a string the base already warned for stays silent (shared _warned_adapters ClassVar), while a fresh string warns — pinning that the dedupe key is the caller-passed string, not type(self). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…isolation AC1 — diagnosis (root cause, confirmed creds-free): scenario.run() offloads each run to a worker thread with a private event loop and, in that thread's finally, calls event_bus.drain() synchronously. drain() does an unbounded queue.join() that returns only once the (daemon=False) event-bus worker has POSTed every scenario event to the LangWatch endpoint; each POST has a 30s httpx timeout and the worker drains serially. So teardown cost scales with event_count x up-to-30s whenever the endpoint is reachable-but-slow. Multi-turn voice demos emit the most events (one/turn + base64 audio snapshots) so their teardown is the most exposed; several in one process compounds it past the 60s per-test timeout -> wedge. Reproduced creds-free by pointing LANGWATCH_ENDPOINT at a socket that accepts but never responds: the worker blocks in socket.recv_into while the calling thread blocks in drain() -> queue.join(). Locally the default endpoint fast-refuses, so the drain returns instantly — which is why the wedge is invisible in isolation but bites in voice-integration.yml where LANGWATCH_API_KEY is set. Full write-up in TESTING.md. AC2 — resolution (creds-free path b): process-isolation, no skip papering. - Remove the 6 @pytest.mark.skip markers on the multi-turn demos; tag them @pytest.mark.voice_multiturn (registered in pytest.ini) instead. - voice-integration.yml deselects voice_multiturn from the two bulk single-process steps and runs each marked demo in its OWN pytest process (discovered by marker, so the set can't drift) — isolation neutralises the telemetry drain AND the documented adapter task/subprocess leaks at once. - TESTING.md: replace the 'not yet isolated / run manually' note with the confirmed root cause + the supported marker mechanism; recommend a bounded event_bus.drain() as a separate SDK-wide follow-up. - specs/voice-agents.feature: README block documenting the isolation requirement (comment-only; 127-scenario contract unchanged). python-ci is unaffected (e2e stay deselected via -m 'not integration'). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-21T11:01:11Z

Walkthrough

Six multi-turn voice E2E tests previously marked skip are re-marked with a new voice_multiturn pytest marker registered in pytest.ini. The CI workflow is updated to exclude this marker from bulk runs and execute each marked file in its own uv pytest process. A new VAD unit test is added for warning deduplication by adapter name. Documentation in TESTING.md and specs/voice-agents.feature is updated with root-cause analysis and command guidance.

Changes

voice_multiturn Process Isolation and VAD Test

Layer / File(s)	Summary
`voice_multiturn` marker definition and test re-annotation `python/pytest.ini`, `python/tests/voice/test_accent_loop_e2e.py`, `python/tests/voice/test_emotional_escalation_e2e.py`, `python/tests/voice/test_long_hold_e2e.py`, `python/tests/voice/test_multi_intent_e2e.py`, `python/tests/voice/test_random_interruptions_e2e.py`, `python/tests/voice/test_silence_handling_e2e.py`	Registers `voice_multiturn` as a documented pytest marker in `pytest.ini` and replaces `@pytest.mark.skip(...)` with `@pytest.mark.voice_multiturn` on all six multi-turn E2E test functions.
CI workflow split for isolated execution `.github/workflows/voice-integration.yml`	Excludes `voice_multiturn` from the integration and bulk E2E steps via `-m "not voice_multiturn"` and `-m "integration and not voice_multiturn"`. Adds a new step that discovers matching demo files via `--collect-only`, extracts node IDs, then runs each in a separate `uv run pytest` process with `-p no:cacheprovider` and aggregates exit codes.
VAD warning rate-limit unit test `python/tests/voice/test_vad.py`	Adds `test_vad_fallback_rate_limit_keys_on_name_string_not_python_class` to verify that `WebRTCVadFallback` deduplicates the "no native VAD" `UserWarning` by `adapter_name` string rather than Python class identity, covering base class, subclass same-string, and subclass new-string cases.
Documentation update `TESTING.md`, `specs/voice-agents.feature`	Rewrites the multi-turn section in `TESTING.md` to document `event_bus.drain()` blocking as the root cause (unbounded queue join tied to telemetry POSTs with 30s httpx timeout), updates command examples for per-process discovery and execution, and adds a README-style comment block in `specs/voice-agents.feature` describing the isolation requirement and deselection guidance.

Suggested labels

low-risk-change, prove-it-clean

Suggested reviewers

rogeriochaves

🐇 Six tests once frozen in skip's icy hold,
Now marked voice_multiturn, brave and bold!
Each runs alone — no queue left to wedge,
The drain no longer blocks on telemetry's edge.
One process per demo, the CI sings true,
A fresh pytest world for each voice breakthrough! 🎙️

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main changes: diagnosing and fixing the multi-turn `@e2e` suite wedge issue `#491`, plus tightening VAD tests.
Linked Issues check	✅ Passed	All four acceptance criteria from `#491` are met: wedge diagnosed with reproducible evidence (`#491` criterion 1), resolution path (b) implemented with marker-based isolation (`#491` criterion 2), VAD parametrized test added (`#491` criterion 3), CI green with 980 passed (`#491` criterion 4).
Out of Scope Changes check	✅ Passed	All changes directly support the issue resolution: workflow isolation via marker filtering, test marker updates, documentation of root cause, new VAD test case, and pytest marker registration.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description comprehensively explains the root cause (event_bus.drain() deadlock), the chosen resolution (per-process isolation with voice_multiturn marker), all changes across 9 files, test verification, and manual repro steps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/491-voice-suite-wedge-vad

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

…empty The marker-discovery loop ran zero iterations (rc=0, green) if pytest --collect-only returned nothing — a collection/import error or a marker rename would silently look like 'all multi-turn demos passed'. Guard with an explicit empty-list check that exits 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/tests/voice/test_vad.py`:
- Line 67: The function
test_vad_fallback_rate_limit_keys_on_name_string_not_python_class is missing an
explicit return type annotation required for pyright strict mode compliance. Add
the explicit `-> None` return type annotation to the function definition after
the parameter list and before the colon to indicate that this test function does
not return a value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73da2472-35e4-4d0f-818a-bedf79edaab6

📥 Commits

Reviewing files that changed from the base of the PR and between b819849 and e7d70fd.

📒 Files selected for processing (11)

.github/workflows/voice-integration.yml
TESTING.md
python/pytest.ini
python/tests/voice/test_accent_loop_e2e.py
python/tests/voice/test_emotional_escalation_e2e.py
python/tests/voice/test_long_hold_e2e.py
python/tests/voice/test_multi_intent_e2e.py
python/tests/voice/test_random_interruptions_e2e.py
python/tests/voice/test_silence_handling_e2e.py
python/tests/voice/test_vad.py
specs/voice-agents.feature

…stency Addresses review findings on PR #694: - voice-integration.yml: capture the --collect-only exit code (stderr merged) and fail on non-zero so a PARTIAL collection error (one demo fails to import while others collect) aborts instead of silently running the subset — that silent-subset path would re-open the same asserted-but-never-run hole this step closes. Switch the loop to a mapfile array with quoted expansion so a path with whitespace/metacharacters can't word-split or inject. - test_random_interruptions_e2e.py: order voice_multiturn above asyncio to match the other five wrappers (marker order is semantically irrelevant but the inconsistency was noise). - pytest.ini + conftest.py: 'nightly' -> 'on-demand' for the voice-integration workflow (it is workflow_dispatch), matching the TESTING.md wording. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Design-review hardening: pytest-timeout's signal/thread methods can't interrupt the asyncio teardown hangs this PR isolates (pyproject.toml records that both methods let the first hang kill the whole process), so an external per-file 'timeout 600' keeps a recurred wedge from riding the job-level timeout instead of failing in ~10 min. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

drewdrewthis · 2026-06-21T11:23:13Z

Review verdict: READY

Reviewed at: f4b7400f · Run: /review (own-PR) · 7 reviewers: principles, hygiene, security, test, proof-reviewer, design-soundness, drift

No blocking concerns remain. All review threads resolved at this SHA. Every actionable finding surfaced by the fan-out was applied in-session (commits below); remaining items are non-blocking.

Resolved during this review

[principles][hygiene] voice-integration.yml — the marker-discovery step could silently pass on a partial collection failure (one demo fails to import while others collect) — the exact "asserted-but-never-run" hole this PR closes. Now captures the --collect-only exit code (stderr merged) and fails non-zero. (commit a156bea)
[security] voice-integration.yml — unquoted for f in $files was a latent word-split/injection seam. Switched to mapfile + quoted array expansion. (commit a156bea)
[hygiene] test_random_interruptions_e2e.py — voice_multiturn/asyncio decorator order now matches the other five wrappers. (commit a156bea)
[hygiene] pytest.ini + conftest.py — stale "nightly" → "on-demand" to match the corrected TESTING.md (the workflow is workflow_dispatch). (commit a156bea)
[design-soundness] voice-integration.yml — added an external per-file timeout 600 so a recurred drain-wedge (which pytest-timeout can't interrupt for asyncio hangs) can't ride the job-level timeout. (commit f4b7400)

Non-blocking (Decide / New Issue)

[design-soundness][drift] New Issue — promote the deferred SDK fix (bound ScenarioEventBus.drain() so telemetry can't block test teardown for any scenario.run() caller) from the TESTING.md note to a tracked, linked issue. Filing as a follow-up; this PR is the process-isolation stopgap (the diagnosis confirms a built-in like pytest-forked/xdist would not clear this serial-unbounded-teardown wedge, so the per-process CI loop is justified, not a reinvented wheel).
[CodeRabbit] test_vad.py:67 Decide (declined) — add -> None to the new test. Declined for consistency: 0 of ~385 test functions in tests/voice/ carry return annotations; annotating one would make it the outlier and pyright does not require it.
[proof-reviewer] AC1 (Decide, optional) — the issue-comment traceback is a frame-summarized stack; a raw faulthandler dump would be strictly stronger. The frames verify exactly against HEAD source.

Design soundness & drift

[design-soundness] Verdict: strong foundation, proceed — no blocking design finding. Confirmed build-vs-reuse is sound (pytest-forked/xdist don't solve this wedge + would break the session-bot fixture; both absent from deps).
[drift] Advisory only, zero unratified decisions. The "6 vs 10" skip count is a correct re-confirmation (the issue conflated 6 wedge-skips with 4 unrelated skipif(CI) skips) and is already disclosed in the PR body; the deferred drain() fix is ratified by AC2 path (b) + the plan's "narrowest fix".

Open verification gate (DoD, not a code-quality blocker)

[proof-reviewer] AC4 — python-ci test (3.12) must report green on this SHA before the PR is merge-done. Pending at f4b7400f (creds-free local mirror was 980 passed / 18 skipped; the diff touches no scenario/ source). This is a machine-event wait, tracked separately from this quality verdict.

Verdict is prose, not a GitHub approval. /review never flips approve state.

drewdrewthis · 2026-06-21T11:25:56Z

Filed #696 to track the SDK-wide root-cause fix (bound ScenarioEventBus.drain()) that the design-soundness + drift reviews recommended promoting from the TESTING.md note. This PR remains the narrow per-process-isolation resolution for #491; #696 is the durable fix that would also let the 4 skipif(CI) tests be un-skipped.

github-actions · 2026-06-21T11:35:55Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR modifies files in restricted directories that require manual review per policy.

This PR requires a manual review before merging.

drewdrewthis and others added 2 commits June 21, 2026 10:57

drewdrewthis self-assigned this Jun 21, 2026

drewdrewthis requested a review from rogeriochaves June 21, 2026 11:00

drewdrewthis mentioned this pull request Jun 21, 2026

voice: diagnose multi-turn @e2e suite-wedge + tighten VAD warning rate-limit tests #491

Open

4 tasks

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread python/tests/voice/test_vad.py

drewdrewthis and others added 2 commits June 21, 2026 11:16

drewdrewthis mentioned this pull request Jun 21, 2026

scenario.run() teardown can block indefinitely on a slow telemetry endpoint — bound event_bus.drain() #696

Open

drewdrewthis added the slack-requested Slack PR review request posted label Jun 21, 2026

drewdrewthis requested review from 0xdeafcafe, Aryansharma28 and sergioestebance June 23, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice/#491): diagnose + resolve multi-turn @e2e suite-wedge + tighten VAD tests#694

fix(voice/#491): diagnose + resolve multi-turn @e2e suite-wedge + tighten VAD tests#694
drewdrewthis wants to merge 5 commits into
mainfrom
fix/491-voice-suite-wedge-vad

drewdrewthis commented Jun 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

drewdrewthis commented Jun 21, 2026

Uh oh!

drewdrewthis commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewdrewthis commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Test plan

How I can prove I was successful

Human verification

Anything surprising?

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drewdrewthis commented Jun 21, 2026

Review verdict: READY

Resolved during this review

Non-blocking (Decide / New Issue)

Design soundness & drift

Open verification gate (DoD, not a code-quality blocker)

Uh oh!

drewdrewthis commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewdrewthis commented Jun 21, 2026 •

edited

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading