feat: add Speech AI MCP server for pronunciation, TTS, and STT

fasuizu-br · Claude DevOps Engineer · commit db52e0949849 · 2026-03-10T22:28:23.000-03:00
diff --git a/plugins/wasm-go/mcp-servers/mcp-speech-ai/README.md b/plugins/wasm-go/mcp-servers/mcp-speech-ai/README.md
@@ -0,0 +1,64 @@
+# Speech AI MCP Server
+
+An MCP server that provides pronunciation assessment, speech-to-text, and text-to-speech capabilities for AI agents. Built for language learning, accessibility, and voice applications.
+
+## Features
+
+- **Pronunciation Assessment**: Score English pronunciation at phoneme, word, and sentence level (0-100). 17MB model, <300ms latency. Exceeds human expert accuracy.
+- **Speech-to-Text (STT)**: Transcribe audio with word-level timestamps and confidence scores.
+- **Text-to-Speech (TTS)**: Generate natural speech with 12 English voices (US + UK accents). Ranked #1 on TTS Arena.
+
+Source: [https://github.com/fasuizu-br/speech-ai-examples](https://github.com/fasuizu-br/speech-ai-examples)
+
+Website: [https://brainiall.com](https://brainiall.com)
+
+## Tools
+
+| Tool | Description |
+|------|-------------|
+| `assess_pronunciation` | Score English pronunciation at phoneme, word, and sentence levels (0-100) |
+| `transcribe_audio` | Transcribe audio to text with word-level timestamps |
+| `synthesize_speech` | Generate speech from text with 12 English voices |
+| `list_tts_voices` | List available TTS voices |
+
+# Usage Guide
+
+## Get API Key
+
+1. Visit [Azure Marketplace](https://azuremarketplace.microsoft.com) and search for "Speech AI"
+2. Subscribe to a plan (Free tier available)
+3. Your API key will be provided after subscription
+
+Or contact fasuizu@brainiall.com for a key.
+
+## Generate SSE URL
+
+On the MCP Server interface, log in and enter the API key to generate the URL.
+
+## Configure MCP Client
+
+Add the generated SSE URL to your MCP client configuration:
+
+```json
+"mcpServers": {
+    "speech-ai": {
+      "url": "https://mcp.higress.ai/mcp-speech-ai/{generate_key}"
+    }
+}
+```
+
+## Example: Pronunciation Assessment
+
+Send base64-encoded audio with the reference text to get detailed pronunciation scores:
+
+- **Overall Score**: 0-100 calibrated score
+- **Word Scores**: Individual word pronunciation quality
+- **Phoneme Scores**: Granular phoneme-level feedback with IPA notation
+
+## Supported Audio Formats
+
+WAV, MP3, OGG, FLAC, WebM
+
+## Pricing
+
+$0.02 per API call. Free tier available via Azure Marketplace.
diff --git a/plugins/wasm-go/mcp-servers/mcp-speech-ai/README_ZH.md b/plugins/wasm-go/mcp-servers/mcp-speech-ai/README_ZH.md
@@ -0,0 +1,56 @@
+# Speech AI MCP Server
+
+MCP 服务器，提供发音评估、语音转文字和文字转语音功能，专为 AI 智能体设计。适用于语言学习、无障碍访问和语音应用场景。
+
+## 功能特性
+
+- **发音评估**：在音素、单词和句子级别对英语发音进行 0-100 分评分。17MB 模型，延迟 <300ms，准确度超过人类专家。
+- **语音转文字（STT）**：将音频转录为文字，提供单词级时间戳和置信度分数。
+- **文字转语音（TTS）**：使用 12 种英语语音（美式和英式口音）生成自然语音。在 TTS Arena 排名第一。
+
+源码：[https://github.com/fasuizu-br/speech-ai-examples](https://github.com/fasuizu-br/speech-ai-examples)
+
+官网：[https://brainiall.com](https://brainiall.com)
+
+## 工具列表
+
+| 工具 | 描述 |
+|------|------|
+| `assess_pronunciation` | 在音素、单词和句子级别评估英语发音（0-100分） |
+| `transcribe_audio` | 将音频转录为文字，提供单词级时间戳 |
+| `synthesize_speech` | 使用 12 种英语语音从文字生成语音 |
+| `list_tts_voices` | 列出可用的 TTS 语音 |
+
+# 使用指南
+
+## 获取 API 密钥
+
+1. 访问 [Azure Marketplace](https://azuremarketplace.microsoft.com) 搜索 "Speech AI"
+2. 订阅计划（提供免费层级）
+3. 订阅后将获得 API 密钥
+
+或联系 fasuizu@brainiall.com 获取密钥。
+
+## 生成 SSE URL
+
+在 MCP Server 界面登录并输入 API 密钥生成 URL。
+
+## 配置 MCP 客户端
+
+将生成的 SSE URL 添加到 MCP 客户端配置中：
+
+```json
+"mcpServers": {
+    "speech-ai": {
+      "url": "https://mcp.higress.ai/mcp-speech-ai/{generate_key}"
+    }
+}
+```
+
+## 支持的音频格式
+
+WAV, MP3, OGG, FLAC, WebM
+
+## 定价
+
+每次 API 调用 $0.02。通过 Azure Marketplace 提供免费层级。
diff --git a/plugins/wasm-go/mcp-servers/mcp-speech-ai/mcp-server.yaml b/plugins/wasm-go/mcp-servers/mcp-speech-ai/mcp-server.yaml
@@ -0,0 +1,130 @@
+server:
+  name: speech-ai-server
+  config:
+    apiKey: ""
+tools:
+- name: assess_pronunciation
+  description: "Evaluate English pronunciation quality by comparing spoken audio against reference text. Returns calibrated 0-100 scores at overall, word, and phoneme levels with IPA notation. 17MB model, <300ms latency."
+  args:
+  - name: audio
+    description: "Base64-encoded audio data (WAV, MP3, OGG, FLAC, or WebM)"
+    type: string
+    required: true
+  - name: text
+    description: "Reference text that was spoken in the audio"
+    type: string
+    required: true
+  - name: format
+    description: "Audio format"
+    type: string
+    required: false
+    default: "wav"
+    enum: ["wav", "mp3", "ogg", "flac", "webm"]
+  requestTemplate:
+    url: "https://api.brainiall.com/v1/pronunciation/assess/base64"
+    method: POST
+    headers:
+    - key: Content-Type
+      value: "application/json"
+    - key: Ocp-Apim-Subscription-Key
+      value: "{{.config.apiKey}}"
+    body: |
+      {
+        "audio": "{{.args.audio}}",
+        "text": "{{.args.text}}",
+        "format": "{{.args.format}}"
+      }
+  responseTemplate:
+    body: |
+      ## Pronunciation Assessment Result
+      - **Overall Score**: {{.overallScore}}/100
+      - **Sentence Score**: {{.sentenceScore}}/100
+      - **Confidence**: {{.confidence}}
+      {{- range $index, $word := .words }}
+      ### Word: {{$word.word}} (Score: {{$word.score}})
+      {{- range $pi, $ph := $word.phonemes }}
+      - {{$ph.phoneme}}: {{$ph.score}}
+      {{- end }}
+      {{- end }}
+
+- name: transcribe_audio
+  description: "Transcribe audio to text with word-level timestamps and confidence scores. Supports WAV, MP3, OGG, FLAC, and WebM formats."
+  args:
+  - name: audio
+    description: "Base64-encoded audio data"
+    type: string
+    required: true
+  - name: format
+    description: "Audio format"
+    type: string
+    required: false
+    default: "wav"
+    enum: ["wav", "mp3", "ogg", "flac", "webm"]
+  requestTemplate:
+    url: "https://api.brainiall.com/v1/stt/transcribe/base64"
+    method: POST
+    headers:
+    - key: Content-Type
+      value: "application/json"
+    - key: Ocp-Apim-Subscription-Key
+      value: "{{.config.apiKey}}"
+    body: |
+      {
+        "audio": "{{.args.audio}}",
+        "format": "{{.args.format}}"
+      }
+  responseTemplate:
+    body: |
+      ## Transcription Result
+      - **Text**: {{.text}}
+      {{- range $index, $word := .words }}
+      - {{$word.word}} ({{$word.start}}s - {{$word.end}}s, confidence: {{$word.confidence}})
+      {{- end }}
+
+- name: synthesize_speech
+  description: "Generate natural speech from text with 12 English voices (US and UK accents). Returns base64-encoded audio. Ranked #1 on TTS Arena."
+  args:
+  - name: text
+    description: "Text to synthesize into speech"
+    type: string
+    required: true
+  - name: voice
+    description: "Voice ID to use for synthesis"
+    type: string
+    required: false
+    default: "af_heart"
+  requestTemplate:
+    url: "https://api.brainiall.com/v1/tts/synthesize"
+    method: POST
+    headers:
+    - key: Content-Type
+      value: "application/json"
+    - key: Ocp-Apim-Subscription-Key
+      value: "{{.config.apiKey}}"
+    body: |
+      {
+        "text": "{{.args.text}}",
+        "voice": "{{.args.voice}}"
+      }
+  responseTemplate:
+    body: |
+      ## Speech Synthesis Result
+      - **Voice**: {{.voice}}
+      - **Duration**: {{.duration_ms}}ms
+      - **Audio**: {{.audio_base64}}
+
+- name: list_tts_voices
+  description: "List all available text-to-speech voices with their names, genders, and accent information."
+  args: []
+  requestTemplate:
+    url: "https://api.brainiall.com/v1/tts/voices"
+    method: GET
+    headers:
+    - key: Ocp-Apim-Subscription-Key
+      value: "{{.config.apiKey}}"
+  responseTemplate:
+    body: |
+      ## Available Voices
+      {{- range $index, $voice := .voices }}
+      - **{{$voice.id}}**: {{$voice.name}} ({{$voice.gender}}, {{$voice.accent}})
+      {{- end }}