Skip to content

feat(realtime): Semantic VAD EOU token#10444

Open
richiejp wants to merge 2 commits into
mudler:masterfrom
richiejp:feat/realtime-semantic-vad-eou
Open

feat(realtime): Semantic VAD EOU token#10444
richiejp wants to merge 2 commits into
mudler:masterfrom
richiejp:feat/realtime-semantic-vad-eou

Conversation

@richiejp

Copy link
Copy Markdown
Collaborator

Description

Use the EOU token from Parakeet transciption to implement semantic VAD.

Notes for Reviewers

  • feat(grpc): add AudioTranscriptionLive bidirectional RPC and TranscriptResult.eou
  • feat(grpc): wire AudioTranscriptionLive through client, server, base and embed
  • feat(parakeet-cpp): live transcription RPC with per-call engine locking
  • feat(core): live transcription session wrapper and pipeline turn_detection config
  • feat(realtime): EOU-driven semantic_vad turn detection with retranscribe gate
  • fix(traces): make audio clips playable — blob URLs and a live body cap
  • feat(realtime): commit immediately on EOU, drop the extra 0.3s silence window
  • fix(realtime): stop cutting the start of utterances on VAD buffer clears
  • feat(trace): record a model_load trace for every successful backend load
  • feat(realtime): per-turn live transcription traces and commit timing telemetry
  • fix(grpc): release bidi stream conns on terminal Recv; PCM/lookup cleanups
  • feat(parakeet-cpp): bump to parakeet.cpp ABI v5 — incremental mel and EOU/EOB split
  • feat(realtime): live input captions — stream transcription deltas while the user speaks
  • fix(realtime): stop dropping the EOU token to the eagerness timeout
  • feat(realtime): synthesize clauses off the LLM token callback
  • fix(ui): show the realtime pipeline components as a vertical list

Signed commits

  • Yes, I signed my commits.

@richiejp richiejp force-pushed the feat/realtime-semantic-vad-eou branch 2 times, most recently from 447708f to 772bdb3 Compare June 23, 2026 14:11
Comment thread backend/backend.proto
// resets itself after each EOU). Closing the send side finalizes: the
// backend flushes the decoder tail and emits a terminal message carrying
// final_result. A second Config mid-stream resets the decode session.
rpc AudioTranscriptionLive(stream TranscriptLiveRequest) returns (stream TranscriptLiveResponse) {}

@mudler mudler Jun 24, 2026

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if doing tihs uplift, I think would make sense to then deprecate rpc AudioTranscriptionStream(TranscriptRequest) returns (stream TranscriptStreamResponse) {} above, and since at it re-wire the backends to use AudioTranscriptionLive directly. Mainly to avoid code/logic dup split across where it could be unified in a single place

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most transcription backends can't really stream input. The Stream RPC only really means streaming the output and on live both are streamed. So to only use the live RPC you have to buffer input on each backend that doesn't support input streaming. Which would also mean that backends that don't really support bidirectional streaming would implement the RPC instead of returning unsupported which IMO is bad for UX.

I was very confused by what was going on in this code though so have made some changes. Including adding explicit state machines that are formally verified...

Add a `semantic_vad` turn-detection mode to the realtime API that feeds
the transcription model live and decides "the user finished speaking"
from the `<EOU>` end-of-utterance token rather than from silence alone.
When EOU fires the turn commits immediately (~0.3s); otherwise it falls
back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s).

Plumbing, bottom to top:

- proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof,
  mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus
  `TranscriptResult.eou` for the unary retranscribe gate.
- pkg/grpc: client/server/base/embed scaffolding for the bidi stream,
  modeled on AudioTransformStream; release stream conns on terminal Recv.
- parakeet-cpp: live transcription RPC with per-C-call engine locking
  (one live stream per turn, finalize+free at commit); bump parakeet.cpp
  to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel
  recompute that delayed EOU on long turns) and the <EOU>/<EOB> split;
  strip the literal <EOU>/<EOB> from offline text and set Eou.
- core/backend: LiveTranscriptionSession wrapper + pipeline
  `turn_detection:` config block (type/eagerness/retranscribe).
- realtime: semantic_vad integration — live input captions streamed as
  transcription deltas while the user speaks, EOU-immediate commit with
  eagerness fallback, optional retranscribe gate (batch re-decode must
  also end in <EOU> to confirm), clause synthesis off the LLM token
  callback, and per-turn live-transcription / model_load telemetry.
- UI: show the realtime pipeline components as a vertical list.

Docs and tests included; opt-in via the pipeline YAML or per-session
`session.update`. Non-streaming STT backends degrade to silence-only.

Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the feat/realtime-semantic-vad-eou branch from 0bc478f to 45ae61e Compare June 29, 2026 11:10
…streaming driver

The realtime API had several implicit state machines whose state was inferred
from scattered booleans, channels, and five separate mutexes, leaving
illegal/inconsistent states reachable. Make them explicit and keep the
implementation in step with a formal design; rework the parakeet streaming
backend along the same lines.

Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect
with a total, pure Next(state,event)->(state,[]effect) behind a single-writer
Coordinator:

  M1 conncoord    connection lifecycle: VAD toggle + once-only teardown
                  (replaces vadServerStarted + a `done` channel closed from
                  two sites).
  M2 turncoord    turn detection: collapses speechStarted and the live-stream
                  "turn open" flag into one state, so discardTurn can no longer
                  desync them and suppress the next onset.
  M3 respcoord    response coordination: serializes the dual-writer
                  start/cancel so at most one response is live; one
                  response.done per response.create.
  M4 compactcoord conversation compaction: single-flight (replaces the
                  `compacting atomic.Bool` CAS).
  M5 ttscoord     TTS pipeline: open->closing->closed, idempotent wait(),
                  rejects enqueue-after-close (was a silent drop).

The Coordinator/Sink/Next plumbing — only the sealed types and Next differed
per machine — is extracted once into core/http/endpoints/openai/coordinator as
a generic Coordinator[S,E,F]; each machine keeps its public API via type
aliases, so no sink, call-site, or test moved.

Hierarchy. session_lifecycle.fizz models M1 as the parent region with its
children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn
torn => all children terminal, none start after teardown). respcoord and
compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's
teardown drives the children terminal. This closes a compaction teardown gap: a
fire-and-forget compaction could outlive a torn session — compactionSink now
takes a session-scoped cancellable context + WaitGroup and joins the in-flight
summarize+evict on shutdown.

Formal verification. formal-verification/ holds one authoritative FizzBee spec
per machine plus the composition spec, each with an always-assertion and a
documented one-line edit that makes the checker fail (verified non-vacuous).
scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under
-race AND a model-check of every .fizz spec; a missing FizzBee is a hard error
(only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI).
FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into
.tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow,
and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the
repo's forbidigo lint): transition tables + fixed-seed property walks +
concurrent/-race specs, no rapid dependency. Design map:
docs/design/realtime-state-machines.md.

Parakeet streaming backend. The same treatment applied to the parakeet-cpp
streaming paths:
- AudioTranscriptionStream returns codes.Unimplemented for non-streaming models
  instead of decoding offline and emitting it as one delta + final. A client
  that asked for streaming learns the model cannot stream rather than receiving
  a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported
  carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it
  as an SSE error event. Mirrors AudioTranscriptionLive, which already did this.
- utteranceBoundary (boundary.go): a single definition of the end-of-utterance
  latch, replacing three open-coded finalEou toggles. Modelled as a two-valued
  type so illegal states are unrepresentable.
- Shared decode driver (driver.go): streamFeedResult (one per-feed event) +
  feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail.
  The feed loop is written once.
- AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed
  {delta,eou,eob,words} the realtime turn detector consumes and a terminal
  FinalResult carrying only Text. Segments/duration/eou are offline-only and no
  longer produced (nor read) on the live path; liveTraceState drops the terminal
  eou and keeps the per-feed eou_events count.
- AudioTranscriptionStream + streamJSON merge into one driver-based function;
  streamSegmenter is generalized to the unified event with a text-only fallback
  that preserves the legacy (no-words) library's per-utterance segmentation.

Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and
parakeet packages under -race, the fail-closed conformance gate green, and
make test-realtime (12 e2e WS+WebRTC).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@richiejp richiejp force-pushed the feat/realtime-semantic-vad-eou branch from 45ae61e to eb1514d Compare June 29, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants