Skip to content

feat(whatsapp-web): add voice message transcription support#2920

Closed
rareba wants to merge 1 commit intozeroclaw-labs:masterfrom
rareba:feature/whatsapp-web-media-support
Closed

feat(whatsapp-web): add voice message transcription support#2920
rareba wants to merge 1 commit intozeroclaw-labs:masterfrom
rareba:feature/whatsapp-web-media-support

Conversation

@rareba
Copy link
Contributor

@rareba rareba commented Mar 6, 2026

Summary

  • Base branch target: master
  • Problem: WhatsApp voice notes (audio messages with ptt=true) were silently dropped because text_content() returns empty for audio messages, hitting the trimmed.is_empty() guard
  • Why it matters: Users sending voice messages via WhatsApp got no response from the agent
  • What changed: Added audio message detection, download via Client::download(), and transcription via the existing Whisper API pipeline (shared with Telegram channel). Wired TranscriptionConfig into WhatsAppWebChannel via builder pattern (matching Telegram channel's approach)
  • What did not change (scope boundary): No changes to Telegram transcription, no changes to audio format handling, no changes to existing text message flow

Files changed

  • src/channels/whatsapp_web.rs: Added transcription field, with_transcription() builder, audio message handling in Event::Message with duration limit, download, and transcription
  • src/channels/mod.rs: Wired .with_transcription(config.transcription.clone()) in WhatsApp Web factory

Label Snapshot (required)

  • Risk label: risk: medium
  • Size label: size: S
  • Scope labels: channel
  • Module labels: channel: whatsapp-web
  • Contributor tier label: (auto-managed)
  • If any auto-label is incorrect, note requested correction: N/A

Change Metadata

  • Change type: feature
  • Primary scope: channel

Linked Issue

Supersede Attribution (required when Supersedes # is used)

N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check   # passes
cargo clippy --features whatsapp-web   # no new warnings (3 pre-existing warnings unchanged)
cargo test   # all 14 whatsapp_web tests pass (12 existing + 2 new)
  • Evidence provided: unit test results
  • If any command is intentionally skipped, explain why: N/A

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No (reuses existing Whisper API pipeline)
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Privacy and Data Hygiene (required)

  • Data-hygiene status: pass
  • Redaction/anonymization notes: Audio data is transient, passed to existing transcription pipeline
  • Neutral wording confirmation: Confirmed

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No (reuses existing [transcription] config section)
  • Migration needed? No

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? No — code changes only

Human Verification (required)

  • Verified scenarios: Unit tests for transcription config wiring, audio message detection
  • Edge cases checked: Empty audio, duration limit enforcement
  • What was not verified: Manual end-to-end test with live WhatsApp voice note (marked as TODO in original test plan)

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: WhatsApp Web channel message handling
  • Potential unintended effects: None — audio handling is additive, existing text flow unchanged
  • Guardrails/monitoring for early detection: Duration limit prevents processing excessively long audio

Agent Collaboration Notes (recommended)

  • Agent tools used: Claude Code
  • Workflow/plan summary: Followed existing Telegram channel transcription pattern
  • Verification focus: Builder pattern consistency, test coverage
  • Confirmation: naming + architecture boundaries followed

Rollback Plan (required)

  • Fast rollback command/path: git revert <commit>
  • Feature flags or config toggles: Existing [transcription] enabled = true gates the feature
  • Observable failure symptoms: Voice notes silently dropped (returns to pre-feature behavior)

Risks and Mitigations

  • Risk: Audio download may fail for large files
    • Mitigation: Duration limit enforced before download attempt

Summary by CodeRabbit

  • New Features

    • Added voice transcription support for WhatsApp Web channel, enabling automatic transcription of audio messages to text.
    • Added configurable transcription settings that can be applied when setting up the WhatsApp Web channel.
  • Tests

    • Added unit tests to verify transcription configuration behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'tools', 'path_filters', 'review_instructions'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

WhatsApp Web channel gains voice transcription support. The transcription configuration is bound to the channel at creation time via a builder method, enabling downloaded audio messages to be transcribed and forwarded as text instead of being silently dropped.

Changes

Cohort / File(s) Summary
Channel Initialization
src/channels/mod.rs
WhatsAppWebChannel construction now chains with_transcription(config.transcription.clone()) to bind transcription settings at channel creation.
WhatsApp Web Transcription Support
src/channels/whatsapp_web.rs
Adds optional transcription field to WhatsAppWebChannel struct. Introduces feature-gated with_transcription() builder method and updates event handling to download, transcribe, and forward audio messages as text. Includes unit tests for transcription configuration behavior.

Sequence Diagram

sequenceDiagram
    actor User
    participant WhatsAppWeb as WhatsApp Web<br/>Channel
    participant Download as Audio<br/>Download
    participant Transcribe as Transcription<br/>Service
    participant Bot as Bot/Agent

    User->>WhatsAppWeb: Send voice message
    WhatsAppWeb->>WhatsAppWeb: Check for audio content
    WhatsAppWeb->>Download: Download audio file
    Download-->>WhatsAppWeb: Return audio bytes
    WhatsAppWeb->>Transcribe: transcribe_audio(bytes, config)
    Transcribe-->>WhatsAppWeb: Return transcribed text
    WhatsAppWeb->>Bot: Send ChannelMessage(text)
    Bot-->>User: Process transcribed text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

size: M, risk: medium, channel

Suggested reviewers

  • theonlyhennygod
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature addition: adding voice message transcription support to the WhatsApp Web channel.
Description check ✅ Passed The PR description covers the problem, solution, and changes made with a test plan, though it lacks several required template sections like risk labels, scope labels, and backward compatibility details.
Linked Issues check ✅ Passed The PR fully addresses issue #2918 requirements: voice notes are now detected, downloaded, and transcribed using the existing Whisper API, with transcription wired via builder pattern matching Telegram's approach.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #2918: adding voice transcription support to WhatsApp Web. No unrelated or out-of-scope modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willsarg willsarg changed the base branch from main to master March 7, 2026 18:29
@willsarg
Copy link
Collaborator

willsarg commented Mar 7, 2026

Local validation complete on local-master-builder (starting from commit a6102f8, cumulative merge workflow). This PR merged cleanly in sequence and passed: cargo fmt --all -- --check, cargo clippy --all-targets -- -D warnings (no new warnings introduced), and cargo test. Marking this PR as validated and safe-to-merge from local integration testing.

@rikitrader
Copy link

This PR properly integrates transcription into WhatsApp Web via .with_transcription() builder pattern — good approach. However, it may conflict with #3029 which also modifies the transcription system. Please rebase onto main after #3029 is resolved and verify no merge conflicts in whatsapp_web.rs and channels/mod.rs.

@rareba rareba force-pushed the feature/whatsapp-web-media-support branch from c25d411 to b2a880b Compare March 9, 2026 20:55
@rareba
Copy link
Contributor Author

rareba commented Mar 9, 2026

Rebased onto current master. This PR now has a clean 2-file diff (src/channels/whatsapp_web.rs and src/channels/mod.rs).

Re: potential conflict with #3029 — both branches are now based on the same master commit. If #3029 merges first, we'll rebase to resolve any overlap in whatsapp_web.rs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/channels/whatsapp_web.rs (1)

808-824: Please cover the handler path, not just the builder.

These tests only pin with_transcription(). The risky logic is in Event::Message—allowlist resolution, voice-note filtering, missing-duration handling, download/transcription failures, and successful forwarding—so this change still lacks direct coverage where regressions are most likely.

Based on learnings, "Applies to src/channels/**/*.rs : Implement Channel trait in src/channels/, keep send, listen, health_check, typing semantics consistent, cover auth/allowlist/health behavior with tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/channels/whatsapp_web.rs` around lines 808 - 824, Add tests that exercise
the runtime handler path for message events instead of only the builder: write
unit tests that invoke the Channel implementation's event handling (the
Event::Message path) for the whatsapp_web channel created via make_channel()
with with_transcription(), covering allowlist resolution, voice-note filtering,
missing-duration handling, download/transcription failures, and the
successful-forwarding path; use the Channel trait methods
(send/listen/health_check/typing semantics) or the specific handler function
used by the whatsapp_web implementation to feed synthetic message events and
assert expected outcomes (transcription forwarded when enabled, ignored when
disabled or filtered, proper error handling/logging on download/transcription
failure, and health/auth behavior).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/channels/whatsapp_web.rs`:
- Around line 377-381: The warn/info logs in the WhatsApp Web handler currently
include PII and message content (e.g., logging sender_jid, sender_candidates and
transcription text) — replace those raw values with redacted or categorical data
before logging: update the tracing::warn!/info! calls in the block that
references sender_jid and sender_candidates (and the similar calls around the
section at lines handling transcription bodies) to log only non-sensitive
indicators (e.g., "unknown-sender" or hash/boolean flags like
sender_in_allowed_list and transcription_truncated=true) or counts, remove
transcript body content entirely, and ensure any helper functions (e.g., the
code that builds sender_candidates or the transcription logging path) perform
the redaction so no raw PII or full message bodies are emitted to logs.
- Around line 394-425: The branch handling msg.get_base_message().audio_message
should only process true voice notes and must fail closed when duration is
missing: replace the current
audio.seconds.unwrap_or(0)/audio.ptt.unwrap_or(false) behavior by first checking
audio.ptt == Some(true) and returning/ignoring when not a PTT, then require
audio.seconds to be Some(duration) and log/warn+return if missing (do not treat
as 0), then compare that duration against transcription_config.max_duration_secs
before calling _client.download; update the tracing messages accordingly and use
early returns on unsupported states so callers never download unknown-size
media.

---

Nitpick comments:
In `@src/channels/whatsapp_web.rs`:
- Around line 808-824: Add tests that exercise the runtime handler path for
message events instead of only the builder: write unit tests that invoke the
Channel implementation's event handling (the Event::Message path) for the
whatsapp_web channel created via make_channel() with with_transcription(),
covering allowlist resolution, voice-note filtering, missing-duration handling,
download/transcription failures, and the successful-forwarding path; use the
Channel trait methods (send/listen/health_check/typing semantics) or the
specific handler function used by the whatsapp_web implementation to feed
synthetic message events and assert expected outcomes (transcription forwarded
when enabled, ignored when disabled or filtered, proper error handling/logging
on download/transcription failure, and health/auth behavior).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8483f8a-b790-466f-ba2b-eb31617395d5

📥 Commits

Reviewing files that changed from the base of the PR and between f7fefd4 and b2a880b.

📒 Files selected for processing (2)
  • src/channels/mod.rs
  • src/channels/whatsapp_web.rs

Comment on lines +377 to +381
tracing::warn!(
"WhatsApp Web: message from {} not in allowed list (candidates: {:?})",
sender_jid,
sender_candidates
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove raw phone candidates and transcript bodies from logs.

These new logs write sender identifiers and the full transcription text to normal warn/info logs. That is PII plus user message content, so one voice note now leaks directly into application logs.

As per coding guidelines, "Deny-by-default for access and exposure boundaries; never log secrets, raw tokens, or sensitive payloads; keep network/filesystem/shell scope as narrow as possible unless explicitly justified."

Also applies to: 443-447

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/channels/whatsapp_web.rs` around lines 377 - 381, The warn/info logs in
the WhatsApp Web handler currently include PII and message content (e.g.,
logging sender_jid, sender_candidates and transcription text) — replace those
raw values with redacted or categorical data before logging: update the
tracing::warn!/info! calls in the block that references sender_jid and
sender_candidates (and the similar calls around the section at lines handling
transcription bodies) to log only non-sensitive indicators (e.g.,
"unknown-sender" or hash/boolean flags like sender_in_allowed_list and
transcription_truncated=true) or counts, remove transcript body content
entirely, and ensure any helper functions (e.g., the code that builds
sender_candidates or the transcription logging path) perform the redaction so no
raw PII or full message bodies are emitted to logs.

Comment on lines +394 to +425
} else if let Some(ref audio) = msg.get_base_message().audio_message {
// Voice note / audio message — try transcription
let duration = audio.seconds.unwrap_or(0);
tracing::info!(
"WhatsApp Web audio from {} in {} ({}s, ptt={})",
sender, chat, duration, audio.ptt.unwrap_or(false)
);

let config = match transcription_config.as_ref() {
Some(c) => c,
None => {
tracing::debug!(
"WhatsApp Web: transcription disabled, ignoring audio from {}",
normalized
);
return;
}
};

if u64::from(duration) > config.max_duration_secs {
tracing::info!(
"WhatsApp Web: skipping audio ({}s > {}s limit)",
duration, config.max_duration_secs
);
return;
}

if let Err(e) = tx_inner
.send(ChannelMessage {
id: uuid::Uuid::new_v4().to_string(),
channel: "whatsapp".to_string(),
sender: normalized.clone(),
// Reply to the originating chat JID (DM or group).
reply_target: chat,
content: trimmed.to_string(),
timestamp: chrono::Utc::now().timestamp() as u64,
thread_ts: None,
})
.await
let audio_data = match _client.download(audio.as_ref()).await {
Ok(d) => d,
Err(e) => {
tracing::warn!("WhatsApp Web: failed to download audio: {e}");
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate this branch to actual voice notes and fail closed when duration is missing.

audio_message also covers regular audio attachments, so this currently sends any caption-less audio file through transcription. On top of that, Line 396 turns an unknown duration into 0, which bypasses the pre-download duration guard that src/channels/transcription.rs expects callers to enforce. That broadens the feature beyond the intended ptt=true voice-note case and can pull unknown-size media into memory before any size check runs.

🔧 Suggested fix
-                            } else if let Some(ref audio) = msg.get_base_message().audio_message {
-                                // Voice note / audio message — try transcription
-                                let duration = audio.seconds.unwrap_or(0);
+                            } else if let Some(ref audio) = msg.get_base_message().audio_message {
+                                if !audio.ptt.unwrap_or(false) {
+                                    tracing::debug!(
+                                        "WhatsApp Web: ignoring non-voice audio from {}",
+                                        normalized
+                                    );
+                                    return;
+                                }
+
+                                let Some(duration) = audio.seconds else {
+                                    tracing::warn!(
+                                        "WhatsApp Web: voice note duration missing; skipping download"
+                                    );
+                                    return;
+                                };
                                 tracing::info!(
                                     "WhatsApp Web audio from {} in {} ({}s, ptt={})",
                                     sender, chat, duration, audio.ptt.unwrap_or(false)
                                 );
As per coding guidelines, "Prefer explicit `bail!`/errors for unsupported or unsafe states; never silently broaden permissions/capabilities; document fallback behavior when fallback is intentional and safe."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/channels/whatsapp_web.rs` around lines 394 - 425, The branch handling
msg.get_base_message().audio_message should only process true voice notes and
must fail closed when duration is missing: replace the current
audio.seconds.unwrap_or(0)/audio.ptt.unwrap_or(false) behavior by first checking
audio.ptt == Some(true) and returning/ignoring when not a PTT, then require
audio.seconds to be Some(duration) and log/warn+return if missing (do not treat
as 0), then compare that duration against transcription_config.max_duration_secs
before calling _client.download; update the tracing messages accordingly and use
early returns on unsupported states so callers never download unknown-size
media.

@rareba
Copy link
Contributor Author

rareba commented Mar 12, 2026

Note: this PR needs a manual rebase due to structural conflicts — master has since rewritten the WhatsApp Web event handler with retry/reconnect architecture. The voice transcription additions need to be integrated with the new handler structure. Will rework and re-push.

@rareba rareba force-pushed the feature/whatsapp-web-media-support branch from b25bbba to 505109b Compare March 12, 2026 14:30
@rareba
Copy link
Contributor Author

rareba commented Mar 12, 2026

Reworked from scratch — the previous version accidentally removed the retry/reconnect state machine. This new version starts from current master and adds voice transcription cleanly on top.

Changes (2 files, ~90 lines added):

  • src/channels/whatsapp_web.rs:

    • TranscriptionConfig field + .with_transcription() builder
    • Audio detection via msg.get_base_message().audio_message
    • Duration limit enforcement before download
    • Download via client.download(), MIME type mapping
    • Transcription via existing transcription::transcribe_audio() subsystem
    • Graceful error handling (skip on failure, log warnings)
    • 2 new unit tests
    • Full retry/reconnect/backoff/session-purge logic preserved
  • src/channels/mod.rs: wired .with_transcription() into WhatsApp Web channel creation

cargo fmt, cargo clippy -D warnings pass clean on Linux CI.

Ready for review — could a maintainer approve the CI workflow run and take a look? cc @rikitrader

Adds audio message detection and transcription to WhatsApp Web channel.
Voice messages (PTT) are downloaded, transcribed via the existing
transcription subsystem (Groq Whisper), and delivered as text content.

- TranscriptionConfig field with builder pattern
- Duration limit enforcement before download
- MIME type mapping for audio formats
- Graceful error handling (skip on failure)
- Preserves full retry/reconnect state machine from master
@rareba rareba force-pushed the feature/whatsapp-web-media-support branch from 505109b to 81db398 Compare March 15, 2026 14:40
@rareba
Copy link
Contributor Author

rareba commented Mar 15, 2026

Superseded: reopening from feat/whatsapp-web-media-support branch (corrected prefix per CONTRIBUTING.md).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WhatsApp Web: support audio/media messages (voice notes, images, documents)

3 participants