[MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat. Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM. by JedLee6 · Pull Request #4263 · alibaba/MNN

JedLee6 · 2026-03-15T04:41:40Z

😃 Hi, @wangzhaode @Juude . Could you please review and merge the following Pull Request at your convenience? Thanks! This PR Implement real-time vision capabilities for interactive voice chat. Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM.

Screenshot	Video Demo
	video6174455438680005488.mp4

Key Changes:

Support live camera preview with front and back camera switching.
Integrate CameraX for high-performance, low-latency image capture.
Optimize image processing by combining scaling and rotation into a single operation.
Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM.
Add comprehensive Javadoc and documentation for all vision and image utility logic.
Full-Duplex Interruption: Keeps ASR active during AI responses, allowing users to interrupt the AI by simply speaking.
Auto-Mute Mode: A software-based fallback for echo cancellation. It automatically toggles the microphone state based on the conversation flow.

- Keep ASR recording active during LLM generation and greeting playback. - Add speech detection listener to interrupt AI output when user starts speaking. - Improve responsiveness by allowing users to skip/interrupt AI responses.

…tion function. - Support software-based echo cancellation (Auto-Mute mode) - Automatically mute mic when AI starts speaking/generating - Automatically unmute mic when AI finishes speaking or is interrupted - Maintain full-duplex interruption support in hardware AEC mode - Refine code comments to clarify speech interruption and ASR state logic

…tive voice chat. - Support live camera preview with front and back camera switching. - Integrate CameraX for high-performance, low-latency image capture. - Optimize image processing by combining scaling and rotation into a single operation. - Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM. - Add comprehensive Javadoc and documentation for all vision and image utility logic.

…moke test Cherry-picked from pr-4263 (commit 47cb56d) with additional fixes: - CameraX integration for live camera preview in VoiceChatFragment - Vision mode: capture and send photos during voice chat - Camera toggle button (on/off) with front/back switch support - ImageUtils for image scaling and rotation optimization - VoiceChatPresenter: add muteMicrophone() helper method - Smoke test: 19_regress_vision_chat_ui.sh for E2E verification Tested on device 1b4a0523: - All 9/9 smoke test checks PASS - Camera preview, toggle, switch all functional - Echo cancellation mode preserved 🤖 Generated with [Qoder][https://qoder.com]

JedLee6 added 2 commits March 15, 2026 12:16

JedLee6 changed the title ~~[MNNChat:BugFix] Fix the speech interruption (duplex) function and auto-mute (software AEC) and speech interruption function.~~ [MNNChat:BugFix] Fix the speech interruption (duplex) function and auto-mute (software AEC) function. Mar 15, 2026

wangzhaode requested a review from Juude March 15, 2026 06:16

wangzhaode assigned wangzhaode and Juude Mar 15, 2026

JedLee6 force-pushed the jedlee/ft/master_260316 branch from 81b2409 to 47cb56d Compare March 15, 2026 07:06

JedLee6 changed the title ~~[MNNChat:BugFix] Fix the speech interruption (duplex) function and auto-mute (software AEC) function.~~ [MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat. Mar 15, 2026

JedLee6 changed the title ~~[MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat.~~ [MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat. Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM. Mar 15, 2026

Juude approved these changes Mar 16, 2026

View reviewed changes

wangzhaode merged commit 95b251b into alibaba:master Mar 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat. Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM.#4263

[MNNChat:Feature] Implement real-time vision capabilities for interactive voice chat. Enable "ChatGPT-like" real-time visual dialogue by sending captured frames to LLM.#4263
wangzhaode merged 3 commits intoalibaba:masterfrom
JedLee6:jedlee/ft/master_260316

JedLee6 commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JedLee6 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JedLee6 commented Mar 15, 2026 •

edited

Loading