Skip to content

GaganVM/WildScore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WildScore: In-the-Wild Symbolic Music Reasoning Benchmark (EMNLP 2025)

WildScore is a benchmark for multimodal symbolic music reasoning on real-world score images and questions. Each example pairs a user-generated music theory question with a corresponding score image and multiple-choice answers, enabling rigorous, scalable evaluation of MLLMs in music theory.

Highlights

  • 807 MCQ items sourced from real discussions (2012–2022) with score images.
  • Five high-level categories (Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form) + 12 subcategories.
  • Two modes: Image+Text and Text-only (ablation).
  • Ground truth from community “score” (upvotes–downvotes), with LLM tie-break on ties.

Table of Contents


Overview

What WildScore evaluates. Models must interpret symbolic score images and answer real musicological questions—covering harmony/tonality, rhythm/meter, texture, expression/performance, and form—via multiple-choice questions. This keeps in-the-wild authenticity while enabling automatic scoring.

Where data comes from. We collected a decade of posts with embedded score images, standardized questions, and paired them with top-level answers. Items were filtered for content/engagement and reformulated as MCQs. Final set: 807 high-quality examples.

Multimodal filtering. A detector fine-tuned on annotated images filtered candidates to symbolic-music images.

Two evaluation modes.

  • Image+Text (full multimodal)
  • Text-only (ablation)

Results (EMNLP 2025)

Overall accuracy (%) on WildScore:

Model Image+Text Text-only
GPT-4.1-mini 68.31 65.76
Phi-3-Vision 48.82 47.72
Qwen-VL 49.73 49.18
MiniCPM 45.90 52.09
InternVL 39.34 45.54
LLaVA 32.97 37.16

Diagnostics

  • Perception-only probe (symbol reading): GPT-4.1-mini 52%, InternVL 38%, LLaVA 26%. Many failures are perceptual rather than reasoning.
  • ABC reconstruction from images: GPT-4.1-mini often valid on short/simple excerpts but degrades on longer/denser passages; InternVL/LLaVA frequently degenerate.

Takeaway. Image context helps some models (e.g., GPT-4.1-mini, +2.55 pts) but can hurt others (MiniCPM, InternVL, LLaVA), underscoring notation-perception and alignment gaps.


Repository Structure


musictheory/final\_code/
├── config.py               # Centralized configuration
├── gpt.py                  # GPT-4.1-mini evaluation
├── phi.py                  # Phi-3-Vision evaluation
├── qwen.py                   # Qwen-VL family evaluation
├── internvlm.py               # InternVL evaluation
├── llava.py           # LLaVA evaluation
├── miniCPM.py         # MiniCPM evaluation
├── requirements.txt        # Dependencies
├── data/                   # Dataset manifests (CSV/JSONL) + splits
└── images/                 # Symbolic score crops (if distributed)

Each example contains: score image, MCQ question, candidate answers from comments, and a ground-truth label (community score + LLM tie-break). Modes: Image+Text and Text-only.


Setup

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended for local VLMs)
  • OpenAI API key and/or HuggingFace token (as required by chosen models)

Installation

git clone <repo-url>
cd musictheory/final_code
pip install -r requirements.txt

Environment

# Copy example env and edit
cp env.example .env
nano .env

# Or set them directly
export OPENAI_API_KEY="your-openai-key"
export HF_TOKEN="your-huggingface-token"
export MUSIC_THEORY_BASE_DIR="/path/to/your/data"

Configuration

All knobs live in config.py. You can override via environment variables.

Base Configuration

  • MUSIC_THEORY_BASE_DIR: Base directory for your data
  • IMAGE_FOLDER: Path to sheet music images
  • DEVICE: CPU/GPU preference

Model-Specific Configuration

  • GPT: GPT_CSV_PATH, GPT_OUTPUT_CSV, GPT_MODEL_NAME
  • Phi: PHI_CSV_PATH, PHI_OUTPUT_CSV, PHI_MODEL_ID
  • Qwen: QWEN_CSV_PATH, QWEN_OUTPUT_CSV, QWEN_MODEL_NAME
  • InternVL: INTERNVLM_CSV_PATH, INTERNVLM_OUTPUT_CSV, INTERNVLM_MODEL_NAME
  • LLaVA: LLAVA_CSV_PATH, LLAVA_OUTPUT_CSV, LLAVA_MODEL_ID
  • MiniCPM: MINICPM_CSV_PATH, MINICPM_OUTPUT_CSV, MINICPM_MODEL_ID

Categories for Analysis

  • Harmony & Tonality, Rhythm & Meter, Texture, Expression & Performance, Form (12 subcategories e.g., Chord Progressions, Modal Mixture, Modulation, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, etc.)

Usage

Run per-model evaluators:

# OpenAI (vision-capable)
python gpt.py

# Phi-3-Vision
python phi.py

# Qwen-VL family
python qwen.py

# InternVL
python internvlm.py

# LLaVA
python llava.py

# MiniCPM
python miniCPM.py

Outputs

  • Predictions CSV with model choices for each item (option letters), with/without images
  • Accuracy (overall, per-category/subcategory)
  • Cost & token usage for API models (if configured)

Data Card

Source & period. Public threads with embedded score images (2012–2022); posts standardized into evaluation format.

Filtering & image detection. Fine-tuned detector used to select symbolic-score images from ~4k candidates; content/engagement rules (e.g., word count, ≥3 top-level comments). Final 807 examples.

Ground truth. Comment with max score $S = upvotes − downvotes$ is the answer; ties resolved by an LLM judge grounded in the thread.

Taxonomy. 5 high-level categories, 12 subcategories for fine-grained analysis (e.g., Chord Progressions, Modulation, Modal Mixture, Metric Structure, Rhythmic Patterns, Dynamics & Articulation, Texture types, etc.).

Quality review. Multiple annotators validated question quality and removed ambiguous/incorrect items.

Evaluation modes. Image+Text and Text-only (ablation).


Citation

If you use WildScore, please cite:

@inproceedings{Mundada2025WildScore,
  title   = {WildScore: Benchmarking MLLMs in the Wild for Symbolic Music Reasoning},
  author  = {Mundada, Gagan and Vishe, Yash and Namburi, Amit and Xu, Xin and Novack, Zachary and McAuley, Julian and Wu, Junda},
  booktitle = {EMNLP},
  year    = {2025}
}

Contact

Questions or issues? Open a GitHub issue or reach out: [email protected], [email protected].


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages