WildScore is a benchmark for multimodal symbolic music reasoning on real-world score images and questions. Each example pairs a user-generated music theory question with a corresponding score image and multiple-choice answers, enabling rigorous, scalable evaluation of MLLMs in music theory.
Highlights
- 807 MCQ items sourced from real discussions (2012–2022) with score images.
- Five high-level categories (Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form) + 12 subcategories.
- Two modes: Image+Text and Text-only (ablation).
- Ground truth from community “score” (upvotes–downvotes), with LLM tie-break on ties.
- Overview
- Results (EMNLP 2025)
- Repository Structure
- Setup
- Configuration
- Usage
- Data Card
- Security & Ethics
- License
- Citation
- Contact
What WildScore evaluates. Models must interpret symbolic score images and answer real musicological questions—covering harmony/tonality, rhythm/meter, texture, expression/performance, and form—via multiple-choice questions. This keeps in-the-wild authenticity while enabling automatic scoring.
Where data comes from. We collected a decade of posts with embedded score images, standardized questions, and paired them with top-level answers. Items were filtered for content/engagement and reformulated as MCQs. Final set: 807 high-quality examples.
Multimodal filtering. A detector fine-tuned on annotated images filtered candidates to symbolic-music images.
Two evaluation modes.
- Image+Text (full multimodal)
- Text-only (ablation)
Overall accuracy (%) on WildScore:
| Model | Image+Text | Text-only |
|---|---|---|
| GPT-4.1-mini | 68.31 | 65.76 |
| Phi-3-Vision | 48.82 | 47.72 |
| Qwen-VL | 49.73 | 49.18 |
| MiniCPM | 45.90 | 52.09 |
| InternVL | 39.34 | 45.54 |
| LLaVA | 32.97 | 37.16 |
Diagnostics
- Perception-only probe (symbol reading): GPT-4.1-mini 52%, InternVL 38%, LLaVA 26%. Many failures are perceptual rather than reasoning.
- ABC reconstruction from images: GPT-4.1-mini often valid on short/simple excerpts but degrades on longer/denser passages; InternVL/LLaVA frequently degenerate.
Takeaway. Image context helps some models (e.g., GPT-4.1-mini, +2.55 pts) but can hurt others (MiniCPM, InternVL, LLaVA), underscoring notation-perception and alignment gaps.
musictheory/final\_code/
├── config.py # Centralized configuration
├── gpt.py # GPT-4.1-mini evaluation
├── phi.py # Phi-3-Vision evaluation
├── qwen.py # Qwen-VL family evaluation
├── internvlm.py # InternVL evaluation
├── llava.py # LLaVA evaluation
├── miniCPM.py # MiniCPM evaluation
├── requirements.txt # Dependencies
├── data/ # Dataset manifests (CSV/JSONL) + splits
└── images/ # Symbolic score crops (if distributed)
Each example contains: score image, MCQ question, candidate answers from comments, and a ground-truth label (community score + LLM tie-break). Modes: Image+Text and Text-only.
- Python 3.8+
- CUDA-compatible GPU (recommended for local VLMs)
- OpenAI API key and/or HuggingFace token (as required by chosen models)
git clone <repo-url>
cd musictheory/final_code
pip install -r requirements.txt# Copy example env and edit
cp env.example .env
nano .env
# Or set them directly
export OPENAI_API_KEY="your-openai-key"
export HF_TOKEN="your-huggingface-token"
export MUSIC_THEORY_BASE_DIR="/path/to/your/data"All knobs live in config.py. You can override via environment variables.
MUSIC_THEORY_BASE_DIR: Base directory for your dataIMAGE_FOLDER: Path to sheet music imagesDEVICE: CPU/GPU preference
- GPT:
GPT_CSV_PATH,GPT_OUTPUT_CSV,GPT_MODEL_NAME - Phi:
PHI_CSV_PATH,PHI_OUTPUT_CSV,PHI_MODEL_ID - Qwen:
QWEN_CSV_PATH,QWEN_OUTPUT_CSV,QWEN_MODEL_NAME - InternVL:
INTERNVLM_CSV_PATH,INTERNVLM_OUTPUT_CSV,INTERNVLM_MODEL_NAME - LLaVA:
LLAVA_CSV_PATH,LLAVA_OUTPUT_CSV,LLAVA_MODEL_ID - MiniCPM:
MINICPM_CSV_PATH,MINICPM_OUTPUT_CSV,MINICPM_MODEL_ID
- Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form (12 subcategories e.g., Chord Progressions, Modal Mixture, Modulation, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, etc.)
Run per-model evaluators:
# OpenAI (vision-capable)
python gpt.py
# Phi-3-Vision
python phi.py
# Qwen-VL family
python qwen.py
# InternVL
python internvlm.py
# LLaVA
python llava.py
# MiniCPM
python miniCPM.py- Predictions CSV with model choices for each item (option letters), with/without images
- Accuracy (overall, per-category/subcategory)
- Cost & token usage for API models (if configured)
Source & period. Public threads with embedded score images (2012–2022); posts standardized into evaluation format.
Filtering & image detection. Fine-tuned detector used to select symbolic-score images from ~4k candidates; content/engagement rules (e.g., word count, ≥3 top-level comments). Final 807 examples.
Ground truth. Comment with max score
Taxonomy. 5 high-level categories, 12 subcategories for fine-grained analysis (e.g., Chord Progressions, Modulation, Modal Mixture, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, Texture types, etc.).
Quality review. Multiple annotators validated question quality and removed ambiguous/incorrect items.
Evaluation modes. Image+Text and Text-only (ablation).
If you use WildScore, please cite:
@inproceedings{Mundada2025WildScore,
title = {WildScore: Benchmarking MLLMs in the Wild for Symbolic Music Reasoning},
author = {Mundada, Gagan and Vishe, Yash and Namburi, Amit and Xu, Xin and Novack, Zachary and McAuley, Julian and Wu, Junda},
booktitle = {EMNLP},
year = {2025}
}Questions or issues? Open a GitHub issue or reach out: [email protected], [email protected].