Voice via mobile browser: ASR + TTS on iOS Safari + Android Chrome

## Goal
Make voice input + output work reliably in mobile Safari + Chrome browsers — the path users hit when they reach GAIA via the tunnel from their phone. Without this, the mobile-via-tunnel story is text-only.

## Why this matters for consumer adoption
The morning-brief / voice-research / daily-companion use cases all assume voice. Telegram covers async voice (#889 + voice notes). The tunnel covers sync mobile access. **Voice via the tunnel** is the connector — "I'm holding my phone, I open GAIA, I tap mic, I speak" — and mobile browser voice has known platform quirks.

## Scope (single PR, v0.18.2 or v0.19.0)

### A. Audit current state
- [ ] Test the existing Agent UI mic button on:
  - iOS Safari 17+ (mobile)
  - iOS Chrome / Edge (uses WebKit on iOS — same engine, different policies)
  - Android Chrome
  - Android Firefox (different engine)
- [ ] Document what works, what's broken, what's flaky
- [ ] Output: short audit committed to PR description

### B. Known mobile-browser pitfalls to address
- [ ] **MediaRecorder MIME types** — iOS Safari historically restricted to `audio/mp4; codecs=mp4a.40.2`; Android Chrome prefers `audio/webm;codecs=opus`. Need codec detection + fallback.
- [ ] **Microphone permission UX** — first-tap should request permission with a clear in-app explanation; rejected permission must show actionable recovery (Settings → Safari → Microphone)
- [ ] **Background tab handling** — recording stops when tab goes background on iOS; document expected behavior
- [ ] **TTS playback autoplay restrictions** — mobile Safari blocks autoplay without user gesture; first agent voice response must be triggered by the same gesture that sent the message
- [ ] **Lock-screen / silent mode** — on iOS, audio output respects silent switch; explicit volume + audio session config helps

### C. Mobile-specific UX improvements
- [ ] Push-to-talk button (long-press to record, release to send) — better than tap-to-start/tap-to-stop on mobile
- [ ] Visible recording indicator + audio level meter
- [ ] Transcript preview before sending (gives confidence the ASR worked)
- [ ] Voice playback controls per agent message (play/pause, speed)

### D. Backend coordination
- [ ] Confirm Whisper ASR endpoint accepts the codecs mobile browsers actually emit
- [ ] Confirm Kokoro TTS output format works in mobile browsers (`<audio>` element or Web Audio API)
- [ ] Streaming TTS chunks if available (faster perceived latency than wait-for-full-response)

### E. Tests
- [ ] Playwright #883 — add mobile-browser voice scenarios: tap mic → record → assert transcript appears → assert TTS audio element fires `play` event
- [ ] Manual smoke test matrix: 4 browsers × 3 use-case skills
- [ ] Doc: `docs/guides/voice-on-mobile.mdx` with troubleshooting for permission issues

## What this is NOT
- ❌ Not Telegram voice notes (#889 scope expansion)
- ❌ Not a native iOS / Android voice integration
- ❌ Not the broader voice-first parity work in v0.21.0 (#702) — this is specifically the mobile-browser path

## Acceptance criteria
- Tap mic → speak → transcript appears within 3 seconds on iOS Safari + Android Chrome
- Agent voice response plays automatically in response to the tap that sent the message
- Permission rejection shows actionable recovery message
- Push-to-talk works smoothly on touch
- Playwright mobile voice scenarios in #883 pass

## Attribution / prior art
- **MDN MediaRecorder docs** — codec compatibility tables
- **WebKit blog** — iOS Safari audio policies

## Dependencies
- **Adjacent:** #872 (tunnel — entry path), M1 (PWA shell), M2 (responsive UI for the mic button), #883 (Playwright)
- **Reuses:** existing Whisper ASR (`src/gaia/audio/whisper_asr.py`) and Kokoro TTS (`src/gaia/audio/kokoro_tts.py`)
- **Adjacent:** #702 (voice-first parity v0.21.0) — this is the mobile-browser leg of that broader work


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice via mobile browser: ASR + TTS on iOS Safari + Android Chrome #896

Goal

Why this matters for consumer adoption

Scope (single PR, v0.18.2 or v0.19.0)

A. Audit current state

B. Known mobile-browser pitfalls to address

C. Mobile-specific UX improvements

D. Backend coordination

E. Tests

What this is NOT

Acceptance criteria

Attribution / prior art

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Voice via mobile browser: ASR + TTS on iOS Safari + Android Chrome #896

Description

Goal

Why this matters for consumer adoption

Scope (single PR, v0.18.2 or v0.19.0)

A. Audit current state

B. Known mobile-browser pitfalls to address

C. Mobile-specific UX improvements

D. Backend coordination

E. Tests

What this is NOT

Acceptance criteria

Attribution / prior art

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions