WildScore: In-the-Wild Symbolic Music Reasoning Benchmark (EMNLP 2025)

WildScore is a benchmark for multimodal symbolic music reasoning on real-world score images and questions. Each example pairs a user-generated music theory question with a corresponding score image and multiple-choice answers, enabling rigorous, scalable evaluation of MLLMs in music theory.

Highlights

807 MCQ items sourced from real discussions (2012–2022) with score images.

Five high-level categories (Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form) + 12 subcategories.

Two modes: Image+Text and Text-only (ablation).

Ground truth from community “score” (upvotes–downvotes), with LLM tie-break on ties.

Overview

What WildScore evaluates. Models must interpret symbolic score images and answer real musicological questions—covering harmony/tonality, rhythm/meter, texture, expression/performance, and form—via multiple-choice questions. This keeps in-the-wild authenticity while enabling automatic scoring.

Where data comes from. We collected a decade of posts with embedded score images, standardized questions, and paired them with top-level answers. Items were filtered for content/engagement and reformulated as MCQs. Final set: 807 high-quality examples.

Multimodal filtering. A detector fine-tuned on annotated images filtered candidates to symbolic-music images.

Two evaluation modes.

Image+Text (full multimodal)
Text-only (ablation)

Results (EMNLP 2025)

Overall accuracy (%) on WildScore:

Model	Image+Text	Text-only
GPT-4.1-mini	68.31	65.76
Phi-3-Vision	48.82	47.72
Qwen-VL	49.73	49.18
MiniCPM	45.90	52.09
InternVL	39.34	45.54
LLaVA	32.97	37.16

Diagnostics

Perception-only probe (symbol reading): GPT-4.1-mini 52%, InternVL 38%, LLaVA 26%. Many failures are perceptual rather than reasoning.
ABC reconstruction from images: GPT-4.1-mini often valid on short/simple excerpts but degrades on longer/denser passages; InternVL/LLaVA frequently degenerate.

Takeaway. Image context helps some models (e.g., GPT-4.1-mini, +2.55 pts) but can hurt others (MiniCPM, InternVL, LLaVA), underscoring notation-perception and alignment gaps.

Repository Structure


musictheory/final\_code/
├── config.py               # Centralized configuration
├── gpt.py                  # GPT-4.1-mini evaluation
├── phi.py                  # Phi-3-Vision evaluation
├── qwen.py                   # Qwen-VL family evaluation
├── internvlm.py               # InternVL evaluation
├── llava.py           # LLaVA evaluation
├── miniCPM.py         # MiniCPM evaluation
├── requirements.txt        # Dependencies
├── data/                   # Dataset manifests (CSV/JSONL) + splits
└── images/                 # Symbolic score crops (if distributed)

Each example contains: score image, MCQ question, candidate answers from comments, and a ground-truth label (community score + LLM tie-break). Modes: Image+Text and Text-only.

Setup

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended for local VLMs)
OpenAI API key and/or HuggingFace token (as required by chosen models)

Installation

git clone <repo-url>
cd musictheory/final_code
pip install -r requirements.txt

Environment

# Copy example env and edit
cp env.example .env
nano .env

# Or set them directly
export OPENAI_API_KEY="your-openai-key"
export HF_TOKEN="your-huggingface-token"
export MUSIC_THEORY_BASE_DIR="/path/to/your/data"

Configuration

All knobs live in config.py. You can override via environment variables.

Base Configuration

MUSIC_THEORY_BASE_DIR: Base directory for your data
IMAGE_FOLDER: Path to sheet music images
DEVICE: CPU/GPU preference

Model-Specific Configuration

GPT: GPT_CSV_PATH, GPT_OUTPUT_CSV, GPT_MODEL_NAME
Phi: PHI_CSV_PATH, PHI_OUTPUT_CSV, PHI_MODEL_ID
Qwen: QWEN_CSV_PATH, QWEN_OUTPUT_CSV, QWEN_MODEL_NAME
InternVL: INTERNVLM_CSV_PATH, INTERNVLM_OUTPUT_CSV, INTERNVLM_MODEL_NAME
LLaVA: LLAVA_CSV_PATH, LLAVA_OUTPUT_CSV, LLAVA_MODEL_ID
MiniCPM: MINICPM_CSV_PATH, MINICPM_OUTPUT_CSV, MINICPM_MODEL_ID

Categories for Analysis

Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form (12 subcategories e.g., Chord Progressions, Modal Mixture, Modulation, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, etc.)

Usage

Run per-model evaluators:

# OpenAI (vision-capable)
python gpt.py

# Phi-3-Vision
python phi.py

# Qwen-VL family
python qwen.py

# InternVL
python internvlm.py

# LLaVA
python llava.py

# MiniCPM
python miniCPM.py

Outputs

Predictions CSV with model choices for each item (option letters), with/without images
Accuracy (overall, per-category/subcategory)
Cost & token usage for API models (if configured)

Data Card

Source & period. Public threads with embedded score images (2012–2022); posts standardized into evaluation format.

Filtering & image detection. Fine-tuned detector used to select symbolic-score images from ~4k candidates; content/engagement rules (e.g., word count, ≥3 top-level comments). Final 807 examples.

Ground truth. Comment with max score $S = upvotes − downvotes$ is the answer; ties resolved by an LLM judge grounded in the thread.

Taxonomy. 5 high-level categories, 12 subcategories for fine-grained analysis (e.g., Chord Progressions, Modulation, Modal Mixture, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, Texture types, etc.).

Quality review. Multiple annotators validated question quality and removed ambiguous/incorrect items.

Evaluation modes. Image+Text and Text-only (ablation).

Citation

If you use WildScore, please cite:

@inproceedings{Mundada2025WildScore,
  title   = {WildScore: Benchmarking MLLMs in the Wild for Symbolic Music Reasoning},
  author  = {Mundada, Gagan and Vishe, Yash and Namburi, Amit and Xu, Xin and Novack, Zachary and McAuley, Julian and Wu, Junda},
  booktitle = {EMNLP},
  year    = {2025}
}

Contact

Questions or issues? Open a GitHub issue or reach out: [email protected], [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WildScore: In-the-Wild Symbolic Music Reasoning Benchmark (EMNLP 2025)

Table of Contents

Overview

Results (EMNLP 2025)

Repository Structure

Setup

Prerequisites

Installation

Environment

Configuration

Base Configuration

Model-Specific Configuration

Categories for Analysis

Usage

Outputs

Data Card

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
config.py		config.py
env.example		env.example
gpt.py		gpt.py
internvlm.py		internvlm.py
llava.py		llava.py
miniCPM.py		miniCPM.py
phi.py		phi.py
qwen.py		qwen.py
requirements.txt		requirements.txt

GaganVM/WildScore

Folders and files

Latest commit

History

Repository files navigation

WildScore: In-the-Wild Symbolic Music Reasoning Benchmark (EMNLP 2025)

Table of Contents

Overview

Results (EMNLP 2025)

Repository Structure

Setup

Prerequisites

Installation

Environment

Configuration

Base Configuration

Model-Specific Configuration

Categories for Analysis

Usage

Outputs

Data Card

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages