Skip to content

💡 [REQUEST] - <title>Add Built-in Text-to-Speech (TTS) and Screenshot Input Support to Qwen Windows Client #2059

@tax31

Description

@tax31

起始日期 | Start Date

2025/01/25

实现PR | Implementation PR

No response

相关Issues | Reference Issues

No response

摘要 | Summary

Title: [Feature Request] Add Built-in Text-to-Speech (TTS) and Screenshot Input Support to Qwen Windows Client

Dear Qwen Team,

Thank you for building such a powerful and user-friendly desktop client! As an active Windows user who relies on Qwen for English learning and STEM studies, I’d like to propose two highly valuable features that would significantly enhance the desktop experience:

1. Built-in Text-to-Speech (TTS) / Read-Aloud Function

  • Why it matters: Many users (including students, language learners, and visually impaired individuals) benefit greatly from hearing responses spoken aloud—especially for pronunciation practice, proofreading, or multitasking.
  • Current workaround: Users must copy text and use Windows’ “Read Aloud” (Win + Ctrl + Enter) or Edge browser, which breaks workflow.
  • Suggestion:
    Add a speaker icon 🔊 next to each AI response. Clicking it should read the text using high-quality neural TTS (e.g., Microsoft Azure Neural TTS or Alibaba’s own speech synthesis). Support both Chinese and English voices.

2. Screenshot Input with OCR & Vision Understanding

  • Why it matters: When studying from PDFs, textbooks, or problem sets, users often encounter content as images or scanned documents. Being able to take a screenshot and ask Qwen about it would be transformative.
  • Current limitation: The Windows client lacks any image input capability, forcing users to switch to mobile or web versions.
  • Suggestion:
    Add a “📸” button in the input bar that:
    • Captures screen region (like Snip & Sketch);
    • Extracts text via OCR (for equations, paragraphs, code);
    • Optionally sends the image to Qwen-VL for multimodal understanding (e.g., “Explain this physics diagram”).

Why These Features Belong on Desktop

The Windows client is where users spend hours studying, coding, and writing. Adding TTS and screenshot input would:

  • Close the gap with mobile apps;
  • Enable true multimodal interaction on PC;
  • Make Qwen a complete AI study companion—not just a chatbox.

These features are technically feasible using existing Windows APIs (e.g., Windows.Media.SpeechSynthesis, WinRT OCR) or Alibaba Cloud services, and would set Qwen apart from other desktop AI tools.

Thank you for considering this request. I believe these additions would greatly empower students, educators, and professionals alike!

Best regards,
A Dedicated Qwen Windows User

基本示例 | Basic Example

.

缺陷 | Drawbacks

.

未解决问题 | Unresolved questions

.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions