GitHub - shivam-MBZUAI/cross-cultural-mel-bias

CROSS-CULTURAL BIAS IN MEL-SCALE REPRESENTATIONS: EVIDENCE AND ALTERNATIVES FROM SPEECH AND MUSIC

Abstract

Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We demonstrate that mel-scale features achieve 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% absolute gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: ERB-scale filtering cuts disparities by 31% with only 1% computational overhead, while CQT achieves 52% reduction in music performance gaps.

1. Contributions

Systematic evaluation of seven audio front-ends across 11 languages, 6 musical collections, and 10 European cities
Demonstrating mel-scale bias: 31.2% WER for tonal vs 18.7% for non-tonal languages (12.5% gap)
Revealing critical frequencies: 200-500 Hz where mel resolution is insufficient for tonal languages
Showing alternatives work: CQT (52% music gap reduction), ERB (31% across domains with minimal overhead)
Releasing FairAudioBench: Benchmark for cross-cultural audio evaluation

2. Results

Performance Gaps (Figure 1)

Domain	Mel Baseline Gap	Best Alternative	Reduction
Speech (Tonal vs Non-tonal)	12.5% WER	LEAF: 8.3%	34%
Music (Western vs Non-Western)	15.7% F1	CQT: 7.6%	52%
Scenes (Europe-1 vs Europe-2)	5.6% Acc	ERB: 5.0%	11%

Comprehensive Results (Table 1)

Front-end	Speech WER/CER (%)		Music F1 (%)		Scenes Acc (%)		Overhead
	Tonal	Non-tonal	Non-West	West	Europe-1	Europe-2
mel	31.2±1.2	18.7±0.8	56.7±2.1	72.4±1.5	71.2±1.4	76.8±1.2	1.00×
ERB	26.4±1.0	17.8±0.7	62.8±2.0	73.1±1.4	72.6±1.3	77.2±1.1	1.01×
Bark	27.2±1.0	18.1±0.8	61.9±2.1	72.8±1.5	72.2±1.3	76.9±1.2	1.01×
CQT	28.8±1.1	19.2±0.9	65.3±1.9	72.9±1.4	-	-	1.15×
LEAF	25.8±0.9	17.5±0.7	62.4±2.0	73.5±1.4	72.5±1.3	77.5±1.1	1.08×
SincNet	30.8±1.1	18.5±0.8	58.3±2.1	72.5±1.5	71.4±1.3	76.9±1.2	1.06×
mel+PCEN	28.9±1.1	18.2±0.7	59.2±2.2	72.6±1.5	72.3±1.3	77.1±1.1	1.04×

Fairness Metrics

Metric	Formula	Speech	Music	Scenes
WGS	min(Acc)	68.8→74.2	56.7→65.3	71.2→72.5
Δ	max-min	12.5→8.3	15.7→7.6	5.6→5.0
ρ	min/max	0.85→0.90	0.78→0.90	0.93→0.94

3. Experimental Setup

Datasets

Speech Recognition CommonVoice v17.0

Tonal Languages (5): Mandarin Chinese (4 tones), Vietnamese (6 tones), Thai (5 tones), Punjabi (3 tones), Cantonese (6 tones)
Non-tonal Languages (6): English, Spanish, German, French, Italian, Dutch
Samples: 2,000 test samples per language
Metrics: CER for tonal, WER for non-tonal

Music Analysis

Western Collections:
- GTZAN (10 genres, 1000 tracks)
- FMA-small (8 genres, 8000 tracks)
Non-Western Collections (CompMusic):
- Hindustani (1124 recordings, 195 ragas)
- Carnatic (2380 recordings, 227 ragas)
- Turkish makam (6500 recordings, 155 makams)
- Arab-Andalusian (338 recordings, 11 mizans)
Samples: 300 recordings per collection

Acoustic Scenes TAU Urban Acoustic Scenes 2020 Mobile

Europe-1 (Northern): Helsinki, Stockholm, Amsterdam, London, Vienna
Europe-2 (Southern): Barcelona, Lisbon, Paris, Lyon, Prague
Scene Types: 10 urban acoustic environments
Samples: 100 recordings per city

4. Implemented Front-ends In Codebase

This codebase implements 7 audio front-ends with fixed transformations that can be used as drop-in replacements:

Front-end	Type	Parameters	Description
Mel	Fixed	40 filters, 25ms window, 10ms hop	Standard mel-scale filterbank (baseline)
ERB	Fixed	32 ERB-spaced filters	Equivalent Rectangular Bandwidth scale
Bark	Fixed	24 critical bands	Psychoacoustic Bark scale
CQT	Fixed	84 bins (7 octaves × 12 bins)	Constant-Q Transform for music
Mel+PCEN	Fixed	Mel + per-channel normalization	Adaptive gain normalization
LEAF	Learnable	64 learnable Gabor filters	Data-driven frequency allocation, adapts to task
SincNet	Learnable	64 learnable sinc filters	Learnable bandpass filters, sinc-based

5. Quick Start

git clone https://github.com/shivam-MBZUAI/cross-cultural-mel-bias.git
cd cross-cultural-mel-bias

# Install dependencies
pip install -r requirements.txt

# For HuggingFace datasets (required for speech data)
pip install huggingface_hub
huggingface-cli login  # Login with your HF token

Dataset Preparation

Assuming the data files are already downloaded and are present inside data/ folder.

Preprocessing

### Preprocessing

# Process specific domains
python preprocess_datasets.py --data_dir /path/to/data --output_dir processed_data --domain speech
python preprocess_datasets.py --data_dir /path/to/data --output_dir processed_data --domain music
python preprocess_datasets.py --data_dir /path/to/data --output_dir processed_data --domain scenes

# Parameters:
#   --data_dir: Directory containing raw data with speech/, music/, scenes/ subdirectories
#   --output_dir: Output directory for processed evaluation sets
#   --domain: Which domain to process [speech|music|scenes]

# This creates balanced evaluation sets:
#   - Speech: Max 2,000 samples per language (11 languages)
#   - Music: Max 300 samples per tradition (6 traditions)
#   - Scenes: Max 100 samples per region (2 regions)

Running Experiments

# Run complete evaluation pipeline - Single run
python frontends_eval.py

# Multi-run with significance testing
python frontends_eval.py multi

# This will:
# 1. Load your processed audio data
# 2. Evaluate 7 implemented front-ends on all 3 tasks  
# 3. Calculate fairness metrics (WGS, Gap, DI)

5. FairAudioBench

We introduce FairAudioBench, the first comprehensive benchmark for evaluating cross-cultural bias in audio systems present in file 'preprocess_datasets.py' and 'frontends_eval.py':

Components

Curated Datasets: Balanced evaluation splits across 11 languages, 8 musical traditions, 10 cities ('preprocess_datasets.py')
Evaluation Suite: Automated computation of fairness metrics (WGS, Δ, ρ) with statistical significance ('frontends_eval.py')
Reproducible Pipeline: Complete evaluation in single script ('frontends_eval.py')

Citation

If you use this code or our FairAudioBench dataset in your research, please cite our ICASSP 2026 paper:

@inproceedings{melbias2026,
  title={Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music},
  author={Shivam Chauhan and Ajay Pundhir},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.gitignore		.gitignore
README.md		README.md
frontends_eval.py		frontends_eval.py
preprocess_datasets.py		preprocess_datasets.py
requirements.txt		requirements.txt
train_models.py		train_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CROSS-CULTURAL BIAS IN MEL-SCALE REPRESENTATIONS: EVIDENCE AND ALTERNATIVES FROM SPEECH AND MUSIC

Abstract

1. Contributions

2. Results

Performance Gaps (Figure 1)

Comprehensive Results (Table 1)

Fairness Metrics

3. Experimental Setup

Datasets

Speech Recognition CommonVoice v17.0

Music Analysis

Acoustic Scenes TAU Urban Acoustic Scenes 2020 Mobile

4. Implemented Front-ends In Codebase

5. Quick Start

Dataset Preparation

Preprocessing

Running Experiments

5. FairAudioBench

Components

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CROSS-CULTURAL BIAS IN MEL-SCALE REPRESENTATIONS: EVIDENCE AND ALTERNATIVES FROM SPEECH AND MUSIC

Abstract

1. Contributions

2. Results

Performance Gaps (Figure 1)

Comprehensive Results (Table 1)

Fairness Metrics

3. Experimental Setup

Datasets

Speech Recognition CommonVoice v17.0

Music Analysis

Acoustic Scenes TAU Urban Acoustic Scenes 2020 Mobile

4. Implemented Front-ends In Codebase

5. Quick Start

Dataset Preparation

Preprocessing

Running Experiments

5. FairAudioBench

Components

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages