Skip to content

This repository contains our code for the paper: "Music for All : Representational bias and cross-cultural adaptability in Music generation models."

Notifications You must be signed in to change notification settings

atharva20038/music4all

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Music4All

This repository contains our code for the paper: "Music for All : Representational bias and cross-cultural adaptability in music generation models."

Survey | Model | Paper

Hugging Face Spaces

We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models -- MusicGen and Mustango, for two underrepresented non-Western music traditions -- Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.

Global Music Generation Analysis

Datasets

The Compmusic dataset contains 120+ hours of Turkish Makam and Hindustani Classical data.

The MTG-Saraga dataset contains 40+ hours of Hindustani Classical annotated data.

For Hindustani Classical, the dataset includes five instrument types—sarangi, harmonium, tabla, violin, and tanpura—along with voice. It comprises 41 ragas across two laya types: Madhya and Vilambit. The dataset features 15 instruments specific to Turkish makam, including the oud, tanbur, ney, davul, clarinet, kös, kudüm, yaylı tanbur, tef, kanun, zurna, bendir, darbuka, classical kemençe, rebab, and çevgen. It encompasses 93 different makams and 63 distinct usuls.

Adapter Positioning

Mustango

To enhance this process, a Bottleneck Residual Adapter with convolution layers is integrated into the up-sampling, middle, and down-sampling blocks of the UNet, positioned just after the cross-attention block. This design facilitates cultural adaptation while preserving computational efficiency. The adapters reduce channel dimensions by a factor of 8, using a kernel size of 1 and GeLU activation after the down-projection layers to introduce non-linearity.

MusicGen

In MusicGen, we enhance the model with an additional 2 million parameters by integrating Linear Bottleneck Residual Adapter after the transformer decoder within the MusicGen architecture after thorough experimentation with other placements.

The total parameter count of both the models is ~2 billion, making the adapter only 0.1% of the total size (2M params). For both models, we used two RTX A6000 GPUs over a period of around 10 hours. The adapter block was fine-tuned, using the AdamW optimizer using MSE (Reconstruction Loss).

Evaluations

Objective Evaluation Metrics for Music Models

The table below presents the objective evaluation metrics for Hindustani Classical Music and Turkish Makam, assessing the quality of generated music based on Fréchet Audio Distance (FAD), Fréchet Distance (FD), Kullback-Leibler Divergence (KLD), and Peak Signal-to-Noise Ratio (PSNR).

Hindustani Classical Music
ModelFAD ↓FD ↓KLD ↓PSNR ↑
MusicGen Baseline40.0575.766.5316.23
MusicGen Finetuned40.0472.656.1216.18
Mustango Baseline6.3645.312.7316.78
Mustango Finetuned5.1822.031.2617.70
Turkish Makam
ModelFAD ↓FD ↓KLD ↓PSNR ↑
MusicGen Baseline39.6557.297.3514.60
MusicGen Finetuned39.6856.717.2114.46
Mustango Baseline8.6575.216.0116.60
Mustango Finetuned2.5720.564.8116.17

Human Evaluation (ELO Ratings, ↑)

The table below presents the human evaluation scores (ELO Ratings) for Hindustani Classical Music and Turkish Makam, where higher values indicate better performance.

Hindustani Classical Music - All Queries

Model OA ↑ Inst. ↑ MC ↑ RC ↑ CR ↑
MusicGen Baseline 1525 1520 1540 1552 1546
Mustango Baseline 1449 1466 1409 1470 1518
MusicGen Finetuned 1448 1454 1428 1439 1448
Mustango Finetuned 1577 1559 1623 1538 1487

Turkish Makam - All Queries

Model OA ↑ Inst. ↑ MC ↑ RC ↑ CR ↑
MusicGen Baseline 1539 1562 1597 1622 1603
Mustango Baseline 1527 1531 1499 1523 1560
MusicGen Finetuned 1597 1529 1570 1570 1541
Mustango Finetuned 1337 1377 1334 1286 1297

Legend:

  • OA (Overall Accuracy)
  • Inst. (Instrumentation)
  • MC (Melodic Consistency)
  • RC (Rhythmic Consistency)
  • CR (Creativity)

Citation

Please consider citing the following article if you found our work useful:

@inproceedings{mehta-etal-2025-music,
    title = "Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models",
    author = "Mehta, Atharva  and
      Chauhan, Shivam  and
      Djanibekov, Amirbek  and
      Kulkarni, Atharva  and
      Xia, Gus  and
      Choudhury, Monojit",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.258/",
    doi = "10.18653/v1/2025.findings-naacl.258",
    pages = "4569--4585",
    ISBN = "979-8-89176-195-7",
    abstract = "The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7{\%} of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres.We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models {--} MusicGen and Mustango, for two underrepresented non-Western music traditions {--} Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning."
}

About

This repository contains our code for the paper: "Music for All : Representational bias and cross-cultural adaptability in Music generation models."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published