feat(stt): multi-provider STT with TranscriptionProvider trait#2995
feat(stt): multi-provider STT with TranscriptionProvider trait#2995rareba wants to merge 3 commits intozeroclaw-labs:masterfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note
|
| Cohort / File(s) | Summary |
|---|---|
Multi-Provider Transcription System src/channels/transcription.rs |
Added TranscriptionProvider trait and five provider implementations (Groq, OpenAI Whisper, Deepgram, AssemblyAI, Google), each with from_config() and transcribe() methods. Introduced TranscriptionManager to register, query, and route transcription requests to providers. Added validate_audio() for file validation and normalization. Public transcribe_audio() function provides backward compatibility with Groq as default. |
STT Configuration Schemas src/config/schema.rs |
Extended TranscriptionConfig with default_provider field and optional provider-specific configs (openai, deepgram, assemblyai, google). Added new public structs: OpenAiSttConfig, DeepgramSttConfig, AssemblyAiSttConfig, GoogleSttConfig with provider-specific settings. Added default functions for provider and model defaults. |
Config Module Exports src/config/mod.rs |
Updated public re-exports to include new STT provider configuration types (OpenAiSttConfig, DeepgramSttConfig, AssemblyAiSttConfig, GoogleSttConfig). |
Sequence Diagram(s)
sequenceDiagram
actor Client
participant TranscriptionManager
participant Provider A as OpenAI Whisper
participant Provider B as Deepgram
participant External API
Client->>TranscriptionManager: transcribe(audio_data, filename)
TranscriptionManager->>TranscriptionManager: Validate audio format/size
alt Use Default Provider
TranscriptionManager->>Provider A: transcribe(audio_data, filename)
else Use Specific Provider
TranscriptionManager->>Provider B: transcribe_with_provider(audio_data, filename, "deepgram")
end
Provider A->>External API: POST audio + metadata
External API-->>Provider A: JSON response
Provider A->>TranscriptionManager: Return transcribed text
TranscriptionManager-->>Client: Transcription result
sequenceDiagram
participant App
participant GroqProvider
participant GroqAPI as Groq API
participant AudioFile as File System
App->>GroqProvider: from_config(config)
Note over GroqProvider: Resolve API key from config or env
App->>GroqProvider: transcribe(audio_data, filename)
GroqProvider->>AudioFile: Validate file extension & size
GroqProvider->>GroqAPI: Multipart request (audio + metadata)
GroqAPI-->>GroqProvider: Whisper JSON response
GroqProvider->>GroqProvider: parse_whisper_response()
GroqProvider-->>App: Transcribed text string
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
- feat: support config-level api_key for transcription #2112 — Updates transcription credential resolution flow and adds config-level API key handling with environment variable fallbacks.
- fix(discord): transcribe inbound audio attachments #2700 — Integrates Discord channel code with the transcribe_audio public API and TranscriptionConfig.
Suggested labels
size: XL, risk: high, provider, config: core, channel: transcription, tests, core
Suggested reviewers
- theonlyhennygod
- chumyin
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title clearly and accurately summarizes the main change: introducing a multi-provider STT architecture with TranscriptionProvider trait. |
| Description check | ✅ Passed | The PR description provides a clear summary, lists changed files with details, confirms backward compatibility, includes test results, and documents risk/rollback. Most required template sections are addressed substantively. |
| Linked Issues check | ✅ Passed | The PR meets core coding requirements from #2989: implements TranscriptionProvider trait with async transcribe and name(), provides five provider implementations (Groq, OpenAI, Deepgram, AssemblyAI, Google), adds TranscriptionManager for routing, and extends config with provider selection. |
| Out of Scope Changes check | ✅ Passed | All code changes are directly aligned with the issue objectives: provider trait, five implementations, manager, config extension, and helper functions. The sync_directory fix is a pre-existing bug fix within scope of the config refactoring. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
src/channels/transcription.rs (1)
90-539: Consider moving the concrete STT providers intosrc/providers/.These sections put five provider-specific HTTP/JSON integrations under
src/channels/. Keeping only the routing/entrypoint here and moving concrete backends intosrc/providers/would keep the channel boundary smaller as more providers land.As per coding guidelines, "Keep module responsibilities single-purpose: orchestration in
agent/, transport inchannels/, model I/O inproviders/."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/channels/transcription.rs` around lines 90 - 539, The channel module currently contains five concrete STT providers (structs GroqProvider, OpenAiWhisperProvider, DeepgramProvider, AssemblyAiProvider, GoogleSttProvider and their impls of TranscriptionProvider including methods from_config, transcribe, name, supported_formats) which should be moved into a providers module to keep channels focused on transport/orchestration. Create a new providers module for these concrete implementations and move the provider structs and their TranscriptionProvider impls there; keep the TranscriptionProvider trait, validate_audio, parse_whisper_response, and any shared helpers in the channels module (or make them public) so the moved providers can call validate_audio, parse_whisper_response, and crate::config::build_runtime_proxy_client unchanged; update imports/uses in the channel to re-export or construct providers via their from_config constructors and ensure method names (from_config, transcribe, name, supported_formats) and referenced symbols (e.g., TRANSCRIPTION_TIMEOUT_SECS, parse_whisper_response) remain reachable after the move.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/channels/transcription.rs`:
- Around line 468-492: The transcribe() function currently ignores the
normalized filename returned by validate_audio(); change the let (_, _) =
validate_audio(...) call to capture the normalized filename (e.g., let (_,
normalized_name) = validate_audio(...)?), use normalized_name when extracting
the extension to set the encoding variable, and ensure you validate that the
resulting extension is one of supported_formats() (return an error if not) so
formats like .oga, .mp4, .m4a, .mpga don't fall through to
"ENCODING_UNSPECIFIED".
- Around line 581-617: The constructor pub fn new(config: &TranscriptionConfig)
currently swallows provider init errors (calls like GroqProvider::from_config,
OpenAiWhisperProvider::from_config, DeepgramProvider::from_config,
AssemblyAiProvider::from_config, GoogleSttProvider::from_config) and still
returns Ok with config.default_provider set, which hides root causes when the
chosen default failed to initialize; change the logic to surface errors: capture
and propagate the concrete error when a provider factory fails (or at minimum
record failures and, after attempting all providers, if
config.default_provider.is_some() but providers does not contain that key,
return an Err/bail! with a clear message that includes the original provider
error(s)); ensure the new() function returns an error instead of silently
succeeding whenever the configured default provider failed to initialize.
- Around line 392-433: The poll loop using poll_interval, max_polls and a
per-request timeout can exceed the intended ~3 minute cap because each .get()
can await up to timeout; fix by enforcing a total deadline (e.g., record start =
Instant::now() and check elapsed against Duration::from_secs(180) or wrap the
whole polling loop in tokio::time::timeout) and break with a timeout error when
exceeded, and also stop treating non-2xx responses as silent "unknown" — inspect
poll_resp.status(), propagate non-success statuses with the response body or
status code instead of using poll_body["status"].as_str().unwrap_or("unknown"),
and ensure error branches (the "error" match and non-2xx cases) return
informative messages so polling does not continue indefinitely.
In `@src/config/schema.rs`:
- Around line 696-727: Add validation for the transcription section inside
Config::validate(): inspect self.transcription.default_provider (trimmed) and
match it against the allowed values
("groq","openai","deepgram","assemblyai","google"); for each provider require
the corresponding fields are present and non-empty (e.g., for "groq" ensure
self.transcription.model is non-empty, for "openai" ensure
self.transcription.openai is Some and openai.model is non-empty, for "deepgram"
require self.transcription.deepgram Some and necessary fields non-empty, for
"google" require self.transcription.google Some and google.language_code
non-empty, assemblyai can just be allowed) and call anyhow::bail! with clear
messages on missing/empty values or on an unsupported default_provider value so
config loading fails fast.
- Around line 748-787: The STT provider api_key fields (OpenAiSttConfig.api_key,
DeepgramSttConfig.api_key, AssemblyAiSttConfig.api_key, GoogleSttConfig.api_key)
are not wired into the secret-store flow; update the config load/save wiring to
call decrypt_optional_secret(&store, &mut <provider>.api_key,
"config.transcription.<provider>.api_key") during load and
encrypt_optional_secret(&store, &<provider>.api_key,
"config.transcription.<provider>.api_key") during save for each optional
provider present (config.transcription.openai, .deepgram, .assemblyai, .google)
using the same pattern as the existing transcription.api_key handling; add a
regression test that saves a config with nested STT keys and asserts they are
encrypted on disk and correctly decrypted on reload.
---
Nitpick comments:
In `@src/channels/transcription.rs`:
- Around line 90-539: The channel module currently contains five concrete STT
providers (structs GroqProvider, OpenAiWhisperProvider, DeepgramProvider,
AssemblyAiProvider, GoogleSttProvider and their impls of TranscriptionProvider
including methods from_config, transcribe, name, supported_formats) which should
be moved into a providers module to keep channels focused on
transport/orchestration. Create a new providers module for these concrete
implementations and move the provider structs and their TranscriptionProvider
impls there; keep the TranscriptionProvider trait, validate_audio,
parse_whisper_response, and any shared helpers in the channels module (or make
them public) so the moved providers can call validate_audio,
parse_whisper_response, and crate::config::build_runtime_proxy_client unchanged;
update imports/uses in the channel to re-export or construct providers via their
from_config constructors and ensure method names (from_config, transcribe, name,
supported_formats) and referenced symbols (e.g., TRANSCRIPTION_TIMEOUT_SECS,
parse_whisper_response) remain reachable after the move.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5e12dca1-c69d-4e51-b77b-7918656c3fdf
📒 Files selected for processing (3)
src/channels/transcription.rssrc/config/mod.rssrc/config/schema.rs
| pub fn new(config: &TranscriptionConfig) -> Result<Self> { | ||
| let mut providers: HashMap<String, Box<dyn TranscriptionProvider>> = HashMap::new(); | ||
|
|
||
| if let Ok(groq) = GroqProvider::from_config(config) { | ||
| providers.insert("groq".to_string(), Box::new(groq)); | ||
| } | ||
|
|
||
| if let Some(ref openai_cfg) = config.openai { | ||
| if let Ok(p) = OpenAiWhisperProvider::from_config(openai_cfg) { | ||
| providers.insert("openai".to_string(), Box::new(p)); | ||
| } | ||
| } | ||
|
|
||
| if let Some(ref deepgram_cfg) = config.deepgram { | ||
| if let Ok(p) = DeepgramProvider::from_config(deepgram_cfg) { | ||
| providers.insert("deepgram".to_string(), Box::new(p)); | ||
| } | ||
| } | ||
|
|
||
| if let Some(ref assemblyai_cfg) = config.assemblyai { | ||
| if let Ok(p) = AssemblyAiProvider::from_config(assemblyai_cfg) { | ||
| providers.insert("assemblyai".to_string(), Box::new(p)); | ||
| } | ||
| } | ||
|
|
||
| if let Some(ref google_cfg) = config.google { | ||
| if let Ok(p) = GoogleSttProvider::from_config(google_cfg) { | ||
| providers.insert("google".to_string(), Box::new(p)); | ||
| } | ||
| } | ||
|
|
||
| let default_provider = config.default_provider.clone(); | ||
|
|
||
| Ok(Self { | ||
| providers, | ||
| default_provider, | ||
| }) |
There was a problem hiding this comment.
Don’t drop the root-cause error for the selected default provider.
Lines 584-609 silently discard provider init failures, and Line 612 still stores config.default_provider. With transcription enabled and a missing key for the chosen default, TranscriptionManager::new() succeeds and the first transcribe() call only says “not configured”.
🔧 Suggested fix
let default_provider = config.default_provider.clone();
+
+ if config.enabled && !providers.contains_key(&default_provider) {
+ let available: Vec<&str> = providers.keys().map(|k| k.as_str()).collect();
+ bail!(
+ "Default transcription provider '{}' is not configured. Available: {available:?}",
+ default_provider
+ );
+ }
Ok(Self {
providers,
default_provider,
})As per coding guidelines, "Prefer explicit bail!/errors for unsupported or unsafe states; never silently broaden permissions/capabilities; document fallback behavior when fallback is intentional and safe."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/channels/transcription.rs` around lines 581 - 617, The constructor pub fn
new(config: &TranscriptionConfig) currently swallows provider init errors (calls
like GroqProvider::from_config, OpenAiWhisperProvider::from_config,
DeepgramProvider::from_config, AssemblyAiProvider::from_config,
GoogleSttProvider::from_config) and still returns Ok with
config.default_provider set, which hides root causes when the chosen default
failed to initialize; change the logic to surface errors: capture and propagate
the concrete error when a provider factory fails (or at minimum record failures
and, after attempting all providers, if config.default_provider.is_some() but
providers does not contain that key, return an Err/bail! with a clear message
that includes the original provider error(s)); ensure the new() function returns
an error instead of silently succeeding whenever the configured default provider
failed to initialize.
| /// Default STT provider: "groq", "openai", "deepgram", "assemblyai", "google". | ||
| #[serde(default = "default_transcription_provider")] | ||
| pub default_provider: String, | ||
| /// API key used for transcription requests (Groq provider). | ||
| /// | ||
| /// If unset, runtime falls back to `GROQ_API_KEY` for backward compatibility. | ||
| #[serde(default)] | ||
| pub api_key: Option<String>, | ||
| /// Whisper API endpoint URL. | ||
| /// Whisper API endpoint URL (Groq provider). | ||
| #[serde(default = "default_transcription_api_url")] | ||
| pub api_url: String, | ||
| /// Whisper model name. | ||
| /// Whisper model name (Groq provider). | ||
| #[serde(default = "default_transcription_model")] | ||
| pub model: String, | ||
| /// Optional language hint (ISO-639-1, e.g. "en", "ru"). | ||
| /// Optional language hint (ISO-639-1, e.g. "en", "ru") for Groq provider. | ||
| #[serde(default)] | ||
| pub language: Option<String>, | ||
| /// Maximum voice duration in seconds (messages longer than this are skipped). | ||
| #[serde(default = "default_transcription_max_duration_secs")] | ||
| pub max_duration_secs: u64, | ||
| /// OpenAI Whisper STT provider configuration. | ||
| #[serde(default)] | ||
| pub openai: Option<OpenAiSttConfig>, | ||
| /// Deepgram STT provider configuration. | ||
| #[serde(default)] | ||
| pub deepgram: Option<DeepgramSttConfig>, | ||
| /// AssemblyAI STT provider configuration. | ||
| #[serde(default)] | ||
| pub assemblyai: Option<AssemblyAiSttConfig>, | ||
| /// Google Cloud Speech-to-Text provider configuration. | ||
| #[serde(default)] | ||
| pub google: Option<GoogleSttConfig>, |
There was a problem hiding this comment.
Validate transcription.default_provider at config load time.
Config::validate() never inspects the transcription section, so typos like "deppgram" or an empty model in the selected provider block are accepted until the first audio message hits the STT path. This should fail fast during config validation instead of surfacing as a delayed runtime error.
🧪 Suggested validation shape
match self.transcription.default_provider.trim() {
"groq" => {
if self.transcription.model.trim().is_empty() {
anyhow::bail!("transcription.model must not be empty when default_provider=groq");
}
}
"openai" => {
let cfg = self
.transcription
.openai
.as_ref()
.ok_or_else(|| anyhow::anyhow!("transcription.openai is required when default_provider=openai"))?;
if cfg.model.trim().is_empty() {
anyhow::bail!("transcription.openai.model must not be empty");
}
}
"deepgram" => { /* same pattern */ }
"assemblyai" => {}
"google" => {
let cfg = self
.transcription
.google
.as_ref()
.ok_or_else(|| anyhow::anyhow!("transcription.google is required when default_provider=google"))?;
if cfg.language_code.trim().is_empty() {
anyhow::bail!("transcription.google.language_code must not be empty");
}
}
other => anyhow::bail!(
"transcription.default_provider must be one of: groq, openai, deepgram, assemblyai, google (got {other})"
),
}As per coding guidelines, "Prefer explicit bail!/errors for unsupported or unsafe states; keep error paths obvious and localized."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config/schema.rs` around lines 696 - 727, Add validation for the
transcription section inside Config::validate(): inspect
self.transcription.default_provider (trimmed) and match it against the allowed
values ("groq","openai","deepgram","assemblyai","google"); for each provider
require the corresponding fields are present and non-empty (e.g., for "groq"
ensure self.transcription.model is non-empty, for "openai" ensure
self.transcription.openai is Some and openai.model is non-empty, for "deepgram"
require self.transcription.deepgram Some and necessary fields non-empty, for
"google" require self.transcription.google Some and google.language_code
non-empty, assemblyai can just be allowed) and call anyhow::bail! with clear
messages on missing/empty values or on an unsupported default_provider value so
config loading fails fast.
|
Thank you for the contribution! We appreciate the effort. We are closing this PR because it introduces a standalone module with no integration into the existing codebase — the new types/functions are not called by any existing code path. ZeroClaw follows a trait-driven architecture where new features must be wired through factory registration and have at least one active caller. To reopen, please:
We welcome smaller, integrated contributions. See CLAUDE.md §7 for playbooks on adding providers, channels, tools, and peripherals. |
|
Review: This PR introduces a standalone module that is not wired into any existing code path. The new types/functions have no callers in the codebase. To make this mergeable:
See CLAUDE.md §7 for playbooks. Recommend closing and resubmitting as smaller, integrated PRs. |
94993a5 to
742c95f
Compare
|
Rebased onto current master. The PR diff is now 3 files / 936 lines (was 937 files / 198K lines due to divergent base). What changed:
|
4a04704 to
7877a22
Compare
Refactors single-endpoint transcription to support multiple providers: Groq (existing), OpenAI Whisper, Deepgram, AssemblyAI, and Google Cloud Speech-to-Text. Adds TranscriptionManager for provider routing with backward-compatible config fields.
5bf6523 to
184df98
Compare
|
Superseded: reopening from |
Summary
masterTranscriptionProvidertrait. Implemented five STT providers: Groq (default, existing), OpenAI Whisper, Deepgram, AssemblyAI, and Google Cloud Speech-to-Text. AddedTranscriptionManagerfor provider routing and thetranscribe_with_provider()method for explicit provider selection. Maintains full backward compatibility.transcribe_audio()function signature unchanged. Existing config fields (api_url,model,api_key) and credential resolution (GROQ_API_KEYenv fallback) preserved. Callers in telegram.rs, discord.rs, whatsapp_web.rs require no changes.Files changed
src/channels/transcription.rs: AddTranscriptionProvidertrait, five provider implementations,TranscriptionManager, sharedvalidate_audio()helper, andparse_whisper_response()utilitysrc/config/schema.rs: ExtendTranscriptionConfigwithdefault_providerand optional sub-configs (OpenAiSttConfig,DeepgramSttConfig,AssemblyAiSttConfig,GoogleSttConfig); fix pre-existingsync_directoryasync/sync mismatch on non-unix platformssrc/config/mod.rs: Export new STT config typesLabel Snapshot (required)
risk: mediumsize: Lchannel,configchannel: transcriptionChange Metadata
featurechannelLinked Issue
Supersede Attribution (required when
Supersedes #is used)N/A
Validation Evidence (required)
Commands and result summary:
Security Impact (required)
Yes, describe risk and mitigation: Each provider's API key is optional and config-gated. Provider sub-configs default toNone. Audio validation occurs before any network call. Existing Groq credential resolution unchanged.Privacy and Data Hygiene (required)
passCompatibility / Migration
default_providerfield and provider sub-configs in[transcription]section (all default to None/Groq)i18n Follow-Through (required when docs or user-facing wording changes)
Human Verification (required)
Side Effects / Blast Radius (required)
transcribe_audio()functionAgent Collaboration Notes (recommended)
Rollback Plan (required)
git revert <commit>default_providerdefaults to Groq; reverting preserves existing behaviorRisks and Mitigations
serde(default)— existing configs parse without changesSummary by CodeRabbit
Release Notes