Skip to content

Voice via mobile browser: ASR + TTS on iOS Safari + Android Chrome #896

@kovtcharov

Description

@kovtcharov

Goal

Make voice input + output work reliably in mobile Safari + Chrome browsers — the path users hit when they reach GAIA via the tunnel from their phone. Without this, the mobile-via-tunnel story is text-only.

Why this matters for consumer adoption

The morning-brief / voice-research / daily-companion use cases all assume voice. Telegram covers async voice (#889 + voice notes). The tunnel covers sync mobile access. Voice via the tunnel is the connector — "I'm holding my phone, I open GAIA, I tap mic, I speak" — and mobile browser voice has known platform quirks.

Scope (single PR, v0.18.2 or v0.19.0)

A. Audit current state

  • Test the existing Agent UI mic button on:
    • iOS Safari 17+ (mobile)
    • iOS Chrome / Edge (uses WebKit on iOS — same engine, different policies)
    • Android Chrome
    • Android Firefox (different engine)
  • Document what works, what's broken, what's flaky
  • Output: short audit committed to PR description

B. Known mobile-browser pitfalls to address

  • MediaRecorder MIME types — iOS Safari historically restricted to audio/mp4; codecs=mp4a.40.2; Android Chrome prefers audio/webm;codecs=opus. Need codec detection + fallback.
  • Microphone permission UX — first-tap should request permission with a clear in-app explanation; rejected permission must show actionable recovery (Settings → Safari → Microphone)
  • Background tab handling — recording stops when tab goes background on iOS; document expected behavior
  • TTS playback autoplay restrictions — mobile Safari blocks autoplay without user gesture; first agent voice response must be triggered by the same gesture that sent the message
  • Lock-screen / silent mode — on iOS, audio output respects silent switch; explicit volume + audio session config helps

C. Mobile-specific UX improvements

  • Push-to-talk button (long-press to record, release to send) — better than tap-to-start/tap-to-stop on mobile
  • Visible recording indicator + audio level meter
  • Transcript preview before sending (gives confidence the ASR worked)
  • Voice playback controls per agent message (play/pause, speed)

D. Backend coordination

  • Confirm Whisper ASR endpoint accepts the codecs mobile browsers actually emit
  • Confirm Kokoro TTS output format works in mobile browsers (<audio> element or Web Audio API)
  • Streaming TTS chunks if available (faster perceived latency than wait-for-full-response)

E. Tests

What this is NOT

Acceptance criteria

Attribution / prior art

  • MDN MediaRecorder docs — codec compatibility tables
  • WebKit blog — iOS Safari audio policies

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    audioAudio (ASR/TTS) changesconsumerBlocks consumer adoption — must ship for the v0.20.0 consumer launch windowdomain:surfacesAgent UI, Telegram, WhatsApp, Slack/Discord, mobileenhancementNew feature or requestp1medium priorityspec-readyIssue has implementation spec adequate for coding-agent assignmenttrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions