Skip to content

feat(stt): multi-provider STT with TranscriptionProvider trait#3614

Merged
theonlyhennygod merged 3 commits intozeroclaw-labs:masterfrom
rareba:feat/stt-multi-provider
Mar 17, 2026
Merged

feat(stt): multi-provider STT with TranscriptionProvider trait#3614
theonlyhennygod merged 3 commits intozeroclaw-labs:masterfrom
rareba:feat/stt-multi-provider

Conversation

@rareba
Copy link
Contributor

@rareba rareba commented Mar 15, 2026

Supersedes #2995 (branch prefix correction: feature/ -> feat/)

Summary

  • Base branch target: master
  • Problem: Transcription was hardcoded to a single Groq endpoint — no way to use alternative STT providers
  • Why it matters: Users need flexibility to choose STT providers based on accuracy, cost, or compliance requirements
  • What changed: Refactored single-endpoint Groq transcription into a multi-provider architecture with TranscriptionProvider trait. Implemented five STT providers: Groq (default, existing), OpenAI Whisper, Deepgram, AssemblyAI, and Google Cloud Speech-to-Text. Added TranscriptionManager for provider routing and the transcribe_with_provider() method for explicit provider selection. Maintains full backward compatibility.
  • What did not change (scope boundary): Existing transcribe_audio() function signature unchanged. Existing config fields (api_url, model, api_key) and credential resolution (GROQ_API_KEY env fallback) preserved. Callers in telegram.rs, discord.rs, whatsapp_web.rs require no changes.

Files changed

  • src/channels/transcription.rs: Add TranscriptionProvider trait, five provider implementations, TranscriptionManager, shared validate_audio() helper, and parse_whisper_response() utility
  • src/config/schema.rs: Extend TranscriptionConfig with default_provider and optional sub-configs (OpenAiSttConfig, DeepgramSttConfig, AssemblyAiSttConfig, GoogleSttConfig); fix pre-existing sync_directory async/sync mismatch on non-unix platforms
  • src/config/mod.rs: Export new STT config types

Label Snapshot (required)

  • Risk label: risk: medium
  • Size label: size: L
  • Scope labels: channel, config
  • Module labels: channel: transcription
  • Contributor tier label: (auto-managed)
  • If any auto-label is incorrect, note requested correction: N/A

Change Metadata

  • Change type: feature
  • Primary scope: channel

Linked Issue

Supersede Attribution (required when Supersedes # is used)

N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check   # clean
cargo check   # passes (only pre-existing clippy warnings in unrelated files remain)
cargo test   # all 20 transcription unit tests pass (existing + new); config default, roundtrip, and without-transcription tests pass
  • Evidence provided: unit test results, config roundtrip tests
  • If any command is intentionally skipped, explain why: CI pipeline validation pending

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? Yes — four new STT provider endpoints (OpenAI, Deepgram, AssemblyAI, Google)
  • Secrets/tokens handling changed? Yes — new API key fields for each provider sub-config
  • File system access scope changed? No
  • If any Yes, describe risk and mitigation: Each provider's API key is optional and config-gated. Provider sub-configs default to None. Audio validation occurs before any network call. Existing Groq credential resolution unchanged.

Privacy and Data Hygiene (required)

  • Data-hygiene status: pass
  • Redaction/anonymization notes: Audio data sent to external STT APIs for transcription — same privacy model as existing Groq path
  • Neutral wording confirmation: Confirmed

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? Yes — new optional default_provider field and provider sub-configs in [transcription] section (all default to None/Groq)
  • Migration needed? No — existing configs without new fields parse correctly and default to Groq

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? No — code changes only

Human Verification (required)

  • Verified scenarios: Provider trait implementation for all five providers, manager routing, backward-compatible function preservation, config roundtrip
  • Edge cases checked: Audio validation ordering (size/format errors before missing-key errors), missing config defaults to Groq, invalid audio formats
  • What was not verified: Live API calls to non-Groq providers (requires credentials)

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Transcription subsystem, config schema
  • Potential unintended effects: None — existing callers use unchanged transcribe_audio() function
  • Guardrails/monitoring for early detection: Audio validation runs before network calls; provider selection explicit

Agent Collaboration Notes (recommended)

  • Agent tools used: Claude Code
  • Workflow/plan summary: Extracted trait from existing Groq implementation, replicated pattern for four additional providers
  • Verification focus: Backward compatibility, config serde stability, test coverage
  • Confirmation: naming + architecture boundaries followed

Rollback Plan (required)

  • Fast rollback command/path: git revert <commit>
  • Feature flags or config toggles: default_provider defaults to Groq; reverting preserves existing behavior
  • Observable failure symptoms: Non-Groq STT providers unavailable (Groq continues working)

Risks and Mitigations

  • Risk: New provider implementations untested against live APIs
    • Mitigation: Unit tests validate request construction and response parsing; live testing deferred to integration phase
  • Risk: Config schema expansion could break existing config files
    • Mitigation: All new fields have serde(default) — existing configs parse without changes

Summary by CodeRabbit

Release Notes

  • New Features
    • Transcription now supports multiple providers: OpenAI Whisper, Deepgram, AssemblyAI, Google STT, and Groq
    • Configure and select from multiple transcription providers based on your needs
    • Improved audio validation with format normalization support
    • Existing transcription configurations remain fully backward compatible

rareba and others added 3 commits March 17, 2026 00:27
Refactors single-endpoint transcription to support multiple providers:
Groq (existing), OpenAI Whisper, Deepgram, AssemblyAI, and Google Cloud
Speech-to-Text. Adds TranscriptionManager for provider routing with
backward-compatible config fields.
@theonlyhennygod theonlyhennygod force-pushed the feat/stt-multi-provider branch from 8186038 to b5a4b42 Compare March 17, 2026 04:33
@theonlyhennygod theonlyhennygod merged commit b099728 into zeroclaw-labs:master Mar 17, 2026
11 checks passed
lantrinh1999 pushed a commit to lantrinh1999/zeroclaw-1 that referenced this pull request Mar 18, 2026
…law-labs#3614)

* feat(stt): add multi-provider STT with TranscriptionProvider trait

Refactors single-endpoint transcription to support multiple providers:
Groq (existing), OpenAI Whisper, Deepgram, AssemblyAI, and Google Cloud
Speech-to-Text. Adds TranscriptionManager for provider routing with
backward-compatible config fields.

* style: fix cargo fmt + clippy violations

* fix: Box::pin large futures and resolve merge conflicts with master

---------

Co-authored-by: argenis de la rosa <theonlyhennygod@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(stt): multi-provider STT with TranscriptionProvider trait

2 participants