An MCP (Model Context Protocol) server for real-time audio analysis, key detection, and pre-session context building in music production environments. Connects to Ableton Live via the ableton-mcp Remote Script for MIDI-based analysis and session data.
git clone https://github.com/YOUR_USERNAME/audio-analysis-mcp
cd audio-analysis-mcp
pip install -r requirements.txtInstall optional dependencies for the features you need:
pip install -r requirements-optional.txt| Package | Required for |
|---|---|
qdrant-client |
Vector database write tools, auto-sync |
openai-whisper + soundfile |
Live vocal transcription (start_stt_listener) |
basic-pitch + soundfile |
Polyphonic instrument analysis (analyze_instrument_audio, start_polyphonic_listener) |
Add to your claude_desktop_config.json:
{
"mcpServers": {
"audio-analysis": {
"command": "python",
"args": ["/absolute/path/to/server.py"]
}
}
}Config file locations:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Restart Claude Desktop after editing.
Enable Stereo Mix (Realtek) in your sound settings to capture system output without additional routing. Use list_audio_devices to find the device index.
Install BlackHole or Loopback for system audio capture.
Install the ableton-mcp Remote Script and select it as a Control Surface in Ableton Preferences → MIDI. This server connects to it over localhost:9877 for MIDI data, session info, and change events.
Without Ableton connected, all mix analysis and audio file tools still work. MIDI-dependent tools (key detection from MIDI, session context builder, auto-sync) will return connection errors.
python server.pyThe server starts and waits for MCP client connections.
| Tool | Description |
|---|---|
list_audio_devices |
List available audio input devices with index numbers |
start_capture(device_index, window_seconds) |
Start continuous audio capture for analysis. Default window is 8s. Use 30-60s for better key/BPM detection. |
stop_capture |
Stop the capture loop |
get_mix_analysis |
Full spectral analysis — frequency bands, RMS, peak, headroom, stereo field, key candidates, brightness |
get_mix_report |
Plain English mixing notes with actionable suggestions |
get_frequency_report |
Focused frequency balance report with EQ suggestions |
get_stereo_analysis |
Stereo width, L/R balance, mono compatibility, correlation |
capture_and_analyze_file(file_path) |
Analyze any audio file directly (WAV, MP3, FLAC) |
| Tool | Description |
|---|---|
get_key_with_voting(vocal_file_path) |
Primary key detection tool. Finds bass and melody tracks automatically, runs three-source voting. |
get_key_from_midi(melodic_track_indices, vocal_file_path) |
Detect key from specific MIDI track indices |
analyze_bounced_instrumental |
Detect key from a session containing only a bounced WAV/MP3 with no MIDI |
set_key_override(key) |
Set global key manually. Pass empty string to clear. Format: "E minor", "C major", "A dorian" |
get_key_override |
Check if a manual override is active |
build_per_section_keys |
Run key detection per section — supports songs that modulate between sections |
set_section_key_override(section_index, key, label) |
Override the key for one section. Sets is_override flag so all models trust it. |
| Tool | Description |
|---|---|
build_session_context(vocal_file_path) |
Build full pre-session context payload — key, tempo, time sig, structure, harmonic rhythm, melodic range, priors. Run once before recording starts. |
get_current_session_context |
Retrieve current context with all progressive updates |
get_song_context |
Simplified musical context object for ML model initialization |
detect_section_repetitions |
Fingerprint MIDI content per section to identify verse/chorus patterns and auto-label sections |
update_section_heard(section_index, section_label) |
Mark a section complete. Updates prior confidence level from genre-only toward full-song context. |
set_recording_section(section_index) |
Tell the STT and pitch listeners which section is currently being recorded |
| Tool | Description |
|---|---|
start_stt_listener(device_index, section_index) |
Start Whisper transcription on a mic. Writes lyrics with timestamps to the performed layer. Requires openai-whisper and soundfile. |
stop_stt_listener |
Stop STT listener |
start_pitch_listener(device_index, section_index) |
Start pyin vocal pitch tracking. Writes detected notes to the performed layer. |
stop_pitch_listener |
Stop pitch listener |
analyze_instrument_audio(file_path, track_name, section_index) |
Run Basic-Pitch on a saved polyphonic audio file (WAV, MP3, FLAC). Extracts chord and note events and writes to the performed layer. Use this for recorded takes. Requires basic-pitch and soundfile. |
start_polyphonic_listener(device_index, track_name, section_index, window_seconds) |
Live polyphonic chord and harmony detection via Basic-Pitch. Works for piano, guitar, violin, or any instrument playing multiple notes simultaneously. Captures audio in windows (default 2.5s) and writes detected notes to the performed layer in near-real time. There is inherent latency equal to the window size — use Ableton MIDI routing when the instrument can output MIDI. Requires basic-pitch and soundfile. |
stop_polyphonic_listener |
Stop the live polyphonic listener. |
get_performed_context(section_index) |
View what has been captured — lyrics, notes sung, instrument note data. Pass -1 for all sections. |
| Tool | Description |
|---|---|
write_context_to_qdrant(qdrant_url, collection_name) |
Write full session context to Qdrant. One point per section plus a global point. Creates the collection if it doesn't exist. |
update_section_in_qdrant(section_index, qdrant_url, collection_name) |
Update a single section after new performed data arrives. Faster than rewriting everything. |
start_auto_sync(qdrant_url, collection_name) |
Open a persistent connection to Ableton and listen for change events. Tempo, time signature, and clip changes automatically trigger re-analysis and Qdrant updates. |
stop_auto_sync |
Stop auto-sync |
get_auto_sync_status |
Check whether auto-sync is running |
Standard libraries run Krumhansl-Schmuckler directly on raw chroma. This server uses a four-stage approach:
Stage 1 — Note classification. Each pitch class is classified as primary, secondary, passing, or accidental based on duration and rhythmic context. Short fast notes (melisma, passing tones) are weighted low. Notes that only appear in fast contexts are flagged as likely accidentals and excluded from key detection. This prevents chromatic lines and modal color notes from corrupting the key read.
Stage 2 — Tonic detection. The tonic is found from primary notes only, weighted by beat position (beat 1 of bar weighted highest), note duration, first/last note position, and velocity.
Stage 3 — Scale matching. Primary note distribution is matched against 11 scale templates: major, natural minor, dorian, phrygian, lydian, mixolydian, locrian, harmonic minor, blues, pentatonic minor, pentatonic major. The tonic detection result receives a confidence bonus.
Stage 4 — Multi-source voting. Bass track MIDI and melody track MIDI are compared independently. Agreement = high confidence result. Disagreement = third source tiebreaker (vocal audio if provided, otherwise a bounced WAV or additional MIDI track). All-disagree cases return the highest confidence result with a manual_verification_needed flag.
Per-section key detection follows the same process applied to notes within each section's time range, supporting songs that modulate mid-arrangement.
point 0 — global context: key, tempo, time sig, section count, priors
point 1 — section 0: key, mode, scale degrees, performed layer, override flags
point 2 — section 1: ...
point N+1 — section N: ...
Each section point vector is the 12-dimensional pitch class distribution for that section. Sections with manual key overrides carry an is_override: true flag in the payload.
The companion repo backing-vocalist-orchestrator provides a Python HTTP client for calling these tools from your own orchestrator model or plugin without going through Claude Desktop.