This repository contains our code for the paper: "Music for All : Representational bias and cross-cultural adaptability in music generation models."
We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models -- MusicGen and Mustango, for two underrepresented non-Western music traditions -- Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.
The Compmusic dataset contains 120+ hours of Turkish Makam and Hindustani Classical data.
The MTG-Saraga dataset contains 40+ hours of Hindustani Classical annotated data.
For Hindustani Classical, the dataset includes five instrument types—sarangi, harmonium, tabla, violin, and tanpura—along with voice. It comprises 41 ragas across two laya types: Madhya and Vilambit. The dataset features 15 instruments specific to Turkish makam, including the oud, tanbur, ney, davul, clarinet, kös, kudüm, yaylı tanbur, tef, kanun, zurna, bendir, darbuka, classical kemençe, rebab, and çevgen. It encompasses 93 different makams and 63 distinct usuls.
To enhance this process, a Bottleneck Residual Adapter with convolution layers is integrated into the up-sampling, middle, and down-sampling blocks of the UNet, positioned just after the cross-attention block. This design facilitates cultural adaptation while preserving computational efficiency. The adapters reduce channel dimensions by a factor of 8, using a kernel size of 1 and GeLU activation after the down-projection layers to introduce non-linearity.
In MusicGen, we enhance the model with an additional 2 million parameters by integrating Linear Bottleneck Residual Adapter after the transformer decoder within the MusicGen architecture after thorough experimentation with other placements.
The total parameter count of both the models is ~2 billion, making the adapter only 0.1% of the total size (2M params). For both models, we used two RTX A6000 GPUs over a period of around 10 hours. The adapter block was fine-tuned, using the AdamW optimizer using MSE (Reconstruction Loss).
The table below presents the objective evaluation metrics for Hindustani Classical Music and Turkish Makam, assessing the quality of generated music based on Fréchet Audio Distance (FAD), Fréchet Distance (FD), Kullback-Leibler Divergence (KLD), and Peak Signal-to-Noise Ratio (PSNR).
| Model | FAD ↓ | FD ↓ | KLD ↓ | PSNR ↑ |
|---|---|---|---|---|
| MusicGen Baseline | 40.05 | 75.76 | 6.53 | 16.23 |
| MusicGen Finetuned | 40.04 | 72.65 | 6.12 | 16.18 |
| Mustango Baseline | 6.36 | 45.31 | 2.73 | 16.78 |
| Mustango Finetuned | 5.18 | 22.03 | 1.26 | 17.70 |
| Model | FAD ↓ | FD ↓ | KLD ↓ | PSNR ↑ |
|---|---|---|---|---|
| MusicGen Baseline | 39.65 | 57.29 | 7.35 | 14.60 |
| MusicGen Finetuned | 39.68 | 56.71 | 7.21 | 14.46 |
| Mustango Baseline | 8.65 | 75.21 | 6.01 | 16.60 |
| Mustango Finetuned | 2.57 | 20.56 | 4.81 | 16.17 |
The table below presents the human evaluation scores (ELO Ratings) for Hindustani Classical Music and Turkish Makam, where higher values indicate better performance.
| Model | OA ↑ | Inst. ↑ | MC ↑ | RC ↑ | CR ↑ |
|---|---|---|---|---|---|
| MusicGen Baseline | 1525 | 1520 | 1540 | 1552 | 1546 |
| Mustango Baseline | 1449 | 1466 | 1409 | 1470 | 1518 |
| MusicGen Finetuned | 1448 | 1454 | 1428 | 1439 | 1448 |
| Mustango Finetuned | 1577 | 1559 | 1623 | 1538 | 1487 |
| Model | OA ↑ | Inst. ↑ | MC ↑ | RC ↑ | CR ↑ |
|---|---|---|---|---|---|
| MusicGen Baseline | 1539 | 1562 | 1597 | 1622 | 1603 |
| Mustango Baseline | 1527 | 1531 | 1499 | 1523 | 1560 |
| MusicGen Finetuned | 1597 | 1529 | 1570 | 1570 | 1541 |
| Mustango Finetuned | 1337 | 1377 | 1334 | 1286 | 1297 |
- OA (Overall Accuracy)
- Inst. (Instrumentation)
- MC (Melodic Consistency)
- RC (Rhythmic Consistency)
- CR (Creativity)
Please consider citing the following article if you found our work useful:
@inproceedings{mehta-etal-2025-music,
title = "Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models",
author = "Mehta, Atharva and
Chauhan, Shivam and
Djanibekov, Amirbek and
Kulkarni, Atharva and
Xia, Gus and
Choudhury, Monojit",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-naacl.258/",
doi = "10.18653/v1/2025.findings-naacl.258",
pages = "4569--4585",
ISBN = "979-8-89176-195-7",
abstract = "The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7{\%} of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres.We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models {--} MusicGen and Mustango, for two underrepresented non-Western music traditions {--} Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning."
}

