fix(voice/#491): diagnose + resolve multi-turn @e2e suite-wedge + tighten VAD tests#694
fix(voice/#491): diagnose + resolve multi-turn @e2e suite-wedge + tighten VAD tests#694drewdrewthis wants to merge 5 commits into
Conversation
…ot python class The parametrized rate-limit test varied only the adapter_name and used a single class, so it could not distinguish 'keyed on the string' from 'keyed on the (one) class'. Add an explicit cross-class case: a subclass built with a string the base already warned for stays silent (shared _warned_adapters ClassVar), while a fresh string warns — pinning that the dedupe key is the caller-passed string, not type(self). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…isolation AC1 — diagnosis (root cause, confirmed creds-free): scenario.run() offloads each run to a worker thread with a private event loop and, in that thread's finally, calls event_bus.drain() synchronously. drain() does an unbounded queue.join() that returns only once the (daemon=False) event-bus worker has POSTed every scenario event to the LangWatch endpoint; each POST has a 30s httpx timeout and the worker drains serially. So teardown cost scales with event_count x up-to-30s whenever the endpoint is reachable-but-slow. Multi-turn voice demos emit the most events (one/turn + base64 audio snapshots) so their teardown is the most exposed; several in one process compounds it past the 60s per-test timeout -> wedge. Reproduced creds-free by pointing LANGWATCH_ENDPOINT at a socket that accepts but never responds: the worker blocks in socket.recv_into while the calling thread blocks in drain() -> queue.join(). Locally the default endpoint fast-refuses, so the drain returns instantly — which is why the wedge is invisible in isolation but bites in voice-integration.yml where LANGWATCH_API_KEY is set. Full write-up in TESTING.md. AC2 — resolution (creds-free path b): process-isolation, no skip papering. - Remove the 6 @pytest.mark.skip markers on the multi-turn demos; tag them @pytest.mark.voice_multiturn (registered in pytest.ini) instead. - voice-integration.yml deselects voice_multiturn from the two bulk single-process steps and runs each marked demo in its OWN pytest process (discovered by marker, so the set can't drift) — isolation neutralises the telemetry drain AND the documented adapter task/subprocess leaks at once. - TESTING.md: replace the 'not yet isolated / run manually' note with the confirmed root cause + the supported marker mechanism; recommend a bounded event_bus.drain() as a separate SDK-wide follow-up. - specs/voice-agents.feature: README block documenting the isolation requirement (comment-only; 127-scenario contract unchanged). python-ci is unaffected (e2e stay deselected via -m 'not integration'). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WalkthroughSix multi-turn voice E2E tests previously marked Changesvoice_multiturn Process Isolation and VAD Test
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…empty The marker-discovery loop ran zero iterations (rc=0, green) if pytest --collect-only returned nothing — a collection/import error or a marker rename would silently look like 'all multi-turn demos passed'. Guard with an explicit empty-list check that exits 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/tests/voice/test_vad.py`:
- Line 67: The function
test_vad_fallback_rate_limit_keys_on_name_string_not_python_class is missing an
explicit return type annotation required for pyright strict mode compliance. Add
the explicit `-> None` return type annotation to the function definition after
the parameter list and before the colon to indicate that this test function does
not return a value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 73da2472-35e4-4d0f-818a-bedf79edaab6
📒 Files selected for processing (11)
.github/workflows/voice-integration.ymlTESTING.mdpython/pytest.inipython/tests/voice/test_accent_loop_e2e.pypython/tests/voice/test_emotional_escalation_e2e.pypython/tests/voice/test_long_hold_e2e.pypython/tests/voice/test_multi_intent_e2e.pypython/tests/voice/test_random_interruptions_e2e.pypython/tests/voice/test_silence_handling_e2e.pypython/tests/voice/test_vad.pyspecs/voice-agents.feature
…stency Addresses review findings on PR #694: - voice-integration.yml: capture the --collect-only exit code (stderr merged) and fail on non-zero so a PARTIAL collection error (one demo fails to import while others collect) aborts instead of silently running the subset — that silent-subset path would re-open the same asserted-but-never-run hole this step closes. Switch the loop to a mapfile array with quoted expansion so a path with whitespace/metacharacters can't word-split or inject. - test_random_interruptions_e2e.py: order voice_multiturn above asyncio to match the other five wrappers (marker order is semantically irrelevant but the inconsistency was noise). - pytest.ini + conftest.py: 'nightly' -> 'on-demand' for the voice-integration workflow (it is workflow_dispatch), matching the TESTING.md wording. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design-review hardening: pytest-timeout's signal/thread methods can't interrupt the asyncio teardown hangs this PR isolates (pyproject.toml records that both methods let the first hang kill the whole process), so an external per-file 'timeout 600' keeps a recurred wedge from riding the job-level timeout instead of failing in ~10 min. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review verdict: READYReviewed at: No blocking concerns remain. All review threads resolved at this SHA. Every actionable finding surfaced by the fan-out was applied in-session (commits below); remaining items are non-blocking. Resolved during this review
Non-blocking (Decide / New Issue)
Design soundness & drift
Open verification gate (DoD, not a code-quality blocker)
Verdict is prose, not a GitHub approval. /review never flips approve state. |
|
Filed #696 to track the SDK-wide root-cause fix (bound |
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
Why
The 6 multi-turn voice
@e2edemos carried@pytest.mark.skip("Hangs in full suite … multi-turn max_turns demos wedge pytest process"), so a large slice of the@e2ebehavioral contract was asserted in the feature file but never actually run. This diagnoses the wedge and removes the skips without re-papering the contract.Closes #491
What changed
scenario.run()teardown callsevent_bus.drain()synchronously, which blocks on an unboundedqueue.join()until thedaemon=Falsetelemetry worker POSTs every scenario event (30s httpx timeout, drained serially). Teardown cost ≈events × up-to-30swhenever the LangWatch endpoint is reachable-but-slow; multi-turn voice demos emit the most events (one/turn + base64 audio) so their teardown is the most exposed. Chose to document + isolate rather than re-architect the shared event bus — high blast radius, and the demos need live creds so a fix can't be verified in python-ci anyway. Bounding the drain is recommended as a separate SDK-wide follow-up.@pytest.mark.skip; tagged the demos@pytest.mark.voice_multiturn(registered inpytest.ini).voice-integration.ymldeselects them from the two bulk single-process steps and runs each in its own pytest process (set discovered by marker, so it can't drift). Fresh process per demo neutralises the telemetry drain and the documented adapter task/subprocess leaks at once.TESTING.mdgets the confirmed root cause + the supported marker mechanism;specs/voice-agents.featuregets a README block (comment-only — the 127-scenario contract is unchanged).adapter_namestring, nottype(self).Test plan
pytest tests/voice/test_vad.py→ 7 passed (incl. newtest_vad_fallback_rate_limit_keys_on_name_string_not_python_class).pytest tests/voice/test_feature_file_contract.py→ 5 passed (127-scenario contract intact after the feature-file comment block).pytest -m voice_multiturn --collect-only→ exactly the 6 demos;pytest -m "integration and not voice_multiturn" --collect-only→ the other e2e, none of the 6; no@pytest.mark.skipleft on the 6; marker registered (noPytestUnknownMark).CI=true pytest tests/ -m "not integration"→ 980 passed / 18 skipped. The only 3 failures are pre-existing local-aarch64test_red_team_agentmock failures (Mock has no attribute 'messages') — unrelated to this diff (touches noscenario/source) and reproduce identically with origin/main'spytest.ini.How I can prove I was successful
No playable artifact — pure test/CI infra. The root cause is demonstrated by a creds-free repro: point
LANGWATCH_ENDPOINTat a socket that accepts-but-never-responds and run anyscenario.run(). The captured hung-thread traceback — worker thread insocket.recv_into, calling thread inevent_bus.drain()→queue.join()→all_tasks_done.wait()— is attached to issue #491. The green unit suite + marker-split commands above are the rest of the demonstration.Human verification
This is a backend-only change — voice recv-loop /
scenario.run()teardown logic plus test-suite isolation in CI. There is no UI surface: no screen, page, or rendered artifact to inspect, so live-app visual proof does not apply here.backend-only, no UI surface.To confirm by hand (all creds-free):
LANGWATCH_ENDPOINTat a socket that accepts but never responds (e.g.nc -lk 127.0.0.1 9999), then run anyscenario.run(). Without isolation the calling thread hangs inevent_bus.drain()→queue.join(); under the per-process split each demo runs in a fresh process so the hung drain can't wedge the shared suite.pytest -m voice_multiturn --collect-only→ exactly the 6 multi-turn demos;pytest -m "integration and not voice_multiturn" --collect-only→ the rest, none of the 6. No@pytest.mark.skipremains on the 6.pytest tests/voice/test_vad.py→ 7 passed;pytest tests/voice/test_feature_file_contract.py→ 5 passed.Anything surprising?
@pytest.mark.skipe2e wrappers. The other 4 skips areskipif(CI=='true')("scenario.run hangs in python-ci; fine locally") — a separate, pre-existing family (very likely the same drain root cause) left out of scope here.voice-integration.yml, not python-ci — which is why path (b) (isolation) is the complete creds-free resolution rather than (a) (which the issue gates on "don't require live creds in CI").