feat(whatsapp-web): add voice message transcription support#2920
feat(whatsapp-web): add voice message transcription support#2920rareba wants to merge 1 commit intozeroclaw-labs:masterfrom
Conversation
|
Note
|
| Cohort / File(s) | Summary |
|---|---|
Channel Initialization src/channels/mod.rs |
WhatsAppWebChannel construction now chains with_transcription(config.transcription.clone()) to bind transcription settings at channel creation. |
WhatsApp Web Transcription Support src/channels/whatsapp_web.rs |
Adds optional transcription field to WhatsAppWebChannel struct. Introduces feature-gated with_transcription() builder method and updates event handling to download, transcribe, and forward audio messages as text. Includes unit tests for transcription configuration behavior. |
Sequence Diagram
sequenceDiagram
actor User
participant WhatsAppWeb as WhatsApp Web<br/>Channel
participant Download as Audio<br/>Download
participant Transcribe as Transcription<br/>Service
participant Bot as Bot/Agent
User->>WhatsAppWeb: Send voice message
WhatsAppWeb->>WhatsAppWeb: Check for audio content
WhatsAppWeb->>Download: Download audio file
Download-->>WhatsAppWeb: Return audio bytes
WhatsAppWeb->>Transcribe: transcribe_audio(bytes, config)
Transcribe-->>WhatsAppWeb: Return transcribed text
WhatsAppWeb->>Bot: Send ChannelMessage(text)
Bot-->>User: Process transcribed text
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
- feat(whatsapp-web): supersede #1992 transcription flow [RMN-205] #2192 — Both modify WhatsApp Web channel to add transcription config binding via
with_transcription()in channel construction. - fix(discord): transcribe inbound audio attachments #2700 — Similar per-channel transcription support pattern added to Discord channel with identical builder method and configuration approach.
Suggested labels
size: M, risk: medium, channel
Suggested reviewers
- theonlyhennygod
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title clearly and concisely describes the main feature addition: adding voice message transcription support to the WhatsApp Web channel. |
| Description check | ✅ Passed | The PR description covers the problem, solution, and changes made with a test plan, though it lacks several required template sections like risk labels, scope labels, and backward compatibility details. |
| Linked Issues check | ✅ Passed | The PR fully addresses issue #2918 requirements: voice notes are now detected, downloaded, and transcribed using the existing Whisper API, with transcription wired via builder pattern matching Telegram's approach. |
| Out of Scope Changes check | ✅ Passed | All changes are directly scoped to issue #2918: adding voice transcription support to WhatsApp Web. No unrelated or out-of-scope modifications are present. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
|
Local validation complete on local-master-builder (starting from commit a6102f8, cumulative merge workflow). This PR merged cleanly in sequence and passed: cargo fmt --all -- --check, cargo clippy --all-targets -- -D warnings (no new warnings introduced), and cargo test. Marking this PR as validated and safe-to-merge from local integration testing. |
|
This PR properly integrates transcription into WhatsApp Web via |
c25d411 to
b2a880b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/channels/whatsapp_web.rs (1)
808-824: Please cover the handler path, not just the builder.These tests only pin
with_transcription(). The risky logic is inEvent::Message—allowlist resolution, voice-note filtering, missing-duration handling, download/transcription failures, and successful forwarding—so this change still lacks direct coverage where regressions are most likely.Based on learnings, "Applies to src/channels/**/*.rs : Implement
Channeltrait insrc/channels/, keepsend,listen,health_check, typing semantics consistent, cover auth/allowlist/health behavior with tests."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/channels/whatsapp_web.rs` around lines 808 - 824, Add tests that exercise the runtime handler path for message events instead of only the builder: write unit tests that invoke the Channel implementation's event handling (the Event::Message path) for the whatsapp_web channel created via make_channel() with with_transcription(), covering allowlist resolution, voice-note filtering, missing-duration handling, download/transcription failures, and the successful-forwarding path; use the Channel trait methods (send/listen/health_check/typing semantics) or the specific handler function used by the whatsapp_web implementation to feed synthetic message events and assert expected outcomes (transcription forwarded when enabled, ignored when disabled or filtered, proper error handling/logging on download/transcription failure, and health/auth behavior).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/channels/whatsapp_web.rs`:
- Around line 377-381: The warn/info logs in the WhatsApp Web handler currently
include PII and message content (e.g., logging sender_jid, sender_candidates and
transcription text) — replace those raw values with redacted or categorical data
before logging: update the tracing::warn!/info! calls in the block that
references sender_jid and sender_candidates (and the similar calls around the
section at lines handling transcription bodies) to log only non-sensitive
indicators (e.g., "unknown-sender" or hash/boolean flags like
sender_in_allowed_list and transcription_truncated=true) or counts, remove
transcript body content entirely, and ensure any helper functions (e.g., the
code that builds sender_candidates or the transcription logging path) perform
the redaction so no raw PII or full message bodies are emitted to logs.
- Around line 394-425: The branch handling msg.get_base_message().audio_message
should only process true voice notes and must fail closed when duration is
missing: replace the current
audio.seconds.unwrap_or(0)/audio.ptt.unwrap_or(false) behavior by first checking
audio.ptt == Some(true) and returning/ignoring when not a PTT, then require
audio.seconds to be Some(duration) and log/warn+return if missing (do not treat
as 0), then compare that duration against transcription_config.max_duration_secs
before calling _client.download; update the tracing messages accordingly and use
early returns on unsupported states so callers never download unknown-size
media.
---
Nitpick comments:
In `@src/channels/whatsapp_web.rs`:
- Around line 808-824: Add tests that exercise the runtime handler path for
message events instead of only the builder: write unit tests that invoke the
Channel implementation's event handling (the Event::Message path) for the
whatsapp_web channel created via make_channel() with with_transcription(),
covering allowlist resolution, voice-note filtering, missing-duration handling,
download/transcription failures, and the successful-forwarding path; use the
Channel trait methods (send/listen/health_check/typing semantics) or the
specific handler function used by the whatsapp_web implementation to feed
synthetic message events and assert expected outcomes (transcription forwarded
when enabled, ignored when disabled or filtered, proper error handling/logging
on download/transcription failure, and health/auth behavior).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d8483f8a-b790-466f-ba2b-eb31617395d5
📒 Files selected for processing (2)
src/channels/mod.rssrc/channels/whatsapp_web.rs
src/channels/whatsapp_web.rs
Outdated
| tracing::warn!( | ||
| "WhatsApp Web: message from {} not in allowed list (candidates: {:?})", | ||
| sender_jid, | ||
| sender_candidates | ||
| ); |
There was a problem hiding this comment.
Remove raw phone candidates and transcript bodies from logs.
These new logs write sender identifiers and the full transcription text to normal warn/info logs. That is PII plus user message content, so one voice note now leaks directly into application logs.
As per coding guidelines, "Deny-by-default for access and exposure boundaries; never log secrets, raw tokens, or sensitive payloads; keep network/filesystem/shell scope as narrow as possible unless explicitly justified."
Also applies to: 443-447
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/channels/whatsapp_web.rs` around lines 377 - 381, The warn/info logs in
the WhatsApp Web handler currently include PII and message content (e.g.,
logging sender_jid, sender_candidates and transcription text) — replace those
raw values with redacted or categorical data before logging: update the
tracing::warn!/info! calls in the block that references sender_jid and
sender_candidates (and the similar calls around the section at lines handling
transcription bodies) to log only non-sensitive indicators (e.g.,
"unknown-sender" or hash/boolean flags like sender_in_allowed_list and
transcription_truncated=true) or counts, remove transcript body content
entirely, and ensure any helper functions (e.g., the code that builds
sender_candidates or the transcription logging path) perform the redaction so no
raw PII or full message bodies are emitted to logs.
src/channels/whatsapp_web.rs
Outdated
| } else if let Some(ref audio) = msg.get_base_message().audio_message { | ||
| // Voice note / audio message — try transcription | ||
| let duration = audio.seconds.unwrap_or(0); | ||
| tracing::info!( | ||
| "WhatsApp Web audio from {} in {} ({}s, ptt={})", | ||
| sender, chat, duration, audio.ptt.unwrap_or(false) | ||
| ); | ||
|
|
||
| let config = match transcription_config.as_ref() { | ||
| Some(c) => c, | ||
| None => { | ||
| tracing::debug!( | ||
| "WhatsApp Web: transcription disabled, ignoring audio from {}", | ||
| normalized | ||
| ); | ||
| return; | ||
| } | ||
| }; | ||
|
|
||
| if u64::from(duration) > config.max_duration_secs { | ||
| tracing::info!( | ||
| "WhatsApp Web: skipping audio ({}s > {}s limit)", | ||
| duration, config.max_duration_secs | ||
| ); | ||
| return; | ||
| } | ||
|
|
||
| if let Err(e) = tx_inner | ||
| .send(ChannelMessage { | ||
| id: uuid::Uuid::new_v4().to_string(), | ||
| channel: "whatsapp".to_string(), | ||
| sender: normalized.clone(), | ||
| // Reply to the originating chat JID (DM or group). | ||
| reply_target: chat, | ||
| content: trimmed.to_string(), | ||
| timestamp: chrono::Utc::now().timestamp() as u64, | ||
| thread_ts: None, | ||
| }) | ||
| .await | ||
| let audio_data = match _client.download(audio.as_ref()).await { | ||
| Ok(d) => d, | ||
| Err(e) => { | ||
| tracing::warn!("WhatsApp Web: failed to download audio: {e}"); | ||
| return; |
There was a problem hiding this comment.
Gate this branch to actual voice notes and fail closed when duration is missing.
audio_message also covers regular audio attachments, so this currently sends any caption-less audio file through transcription. On top of that, Line 396 turns an unknown duration into 0, which bypasses the pre-download duration guard that src/channels/transcription.rs expects callers to enforce. That broadens the feature beyond the intended ptt=true voice-note case and can pull unknown-size media into memory before any size check runs.
🔧 Suggested fix
- } else if let Some(ref audio) = msg.get_base_message().audio_message {
- // Voice note / audio message — try transcription
- let duration = audio.seconds.unwrap_or(0);
+ } else if let Some(ref audio) = msg.get_base_message().audio_message {
+ if !audio.ptt.unwrap_or(false) {
+ tracing::debug!(
+ "WhatsApp Web: ignoring non-voice audio from {}",
+ normalized
+ );
+ return;
+ }
+
+ let Some(duration) = audio.seconds else {
+ tracing::warn!(
+ "WhatsApp Web: voice note duration missing; skipping download"
+ );
+ return;
+ };
tracing::info!(
"WhatsApp Web audio from {} in {} ({}s, ptt={})",
sender, chat, duration, audio.ptt.unwrap_or(false)
);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/channels/whatsapp_web.rs` around lines 394 - 425, The branch handling
msg.get_base_message().audio_message should only process true voice notes and
must fail closed when duration is missing: replace the current
audio.seconds.unwrap_or(0)/audio.ptt.unwrap_or(false) behavior by first checking
audio.ptt == Some(true) and returning/ignoring when not a PTT, then require
audio.seconds to be Some(duration) and log/warn+return if missing (do not treat
as 0), then compare that duration against transcription_config.max_duration_secs
before calling _client.download; update the tracing messages accordingly and use
early returns on unsupported states so callers never download unknown-size
media.
|
Note: this PR needs a manual rebase due to structural conflicts — master has since rewritten the WhatsApp Web event handler with retry/reconnect architecture. The voice transcription additions need to be integrated with the new handler structure. Will rework and re-push. |
b25bbba to
505109b
Compare
|
Reworked from scratch — the previous version accidentally removed the retry/reconnect state machine. This new version starts from current master and adds voice transcription cleanly on top. Changes (2 files, ~90 lines added):
Ready for review — could a maintainer approve the CI workflow run and take a look? cc @rikitrader |
Adds audio message detection and transcription to WhatsApp Web channel. Voice messages (PTT) are downloaded, transcribed via the existing transcription subsystem (Groq Whisper), and delivered as text content. - TranscriptionConfig field with builder pattern - Duration limit enforcement before download - MIME type mapping for audio formats - Graceful error handling (skip on failure) - Preserves full retry/reconnect state machine from master
505109b to
81db398
Compare
|
Superseded: reopening from |
Summary
masterptt=true) were silently dropped becausetext_content()returns empty for audio messages, hitting thetrimmed.is_empty()guardClient::download(), and transcription via the existing Whisper API pipeline (shared with Telegram channel). WiredTranscriptionConfigintoWhatsAppWebChannelvia builder pattern (matching Telegram channel's approach)Files changed
src/channels/whatsapp_web.rs: Addedtranscriptionfield,with_transcription()builder, audio message handling inEvent::Messagewith duration limit, download, and transcriptionsrc/channels/mod.rs: Wired.with_transcription(config.transcription.clone())in WhatsApp Web factoryLabel Snapshot (required)
risk: mediumsize: Schannelchannel: whatsapp-webChange Metadata
featurechannelLinked Issue
Supersede Attribution (required when
Supersedes #is used)N/A
Validation Evidence (required)
Commands and result summary:
Security Impact (required)
Privacy and Data Hygiene (required)
passCompatibility / Migration
[transcription]config section)i18n Follow-Through (required when docs or user-facing wording changes)
Human Verification (required)
Side Effects / Blast Radius (required)
Agent Collaboration Notes (recommended)
Rollback Plan (required)
git revert <commit>[transcription] enabled = truegates the featureRisks and Mitigations
Summary by CodeRabbit
New Features
Tests