Skip to content

voxceleb2 has 5991 speakers#195

Open
dimuthuanuraj wants to merge 46 commits into
clovaai:masterfrom
dimuthuanuraj:master
Open

voxceleb2 has 5991 speakers#195
dimuthuanuraj wants to merge 46 commits into
clovaai:masterfrom
dimuthuanuraj:master

Conversation

@dimuthuanuraj
Copy link
Copy Markdown

No description provided.

… compatibility issue with newer NumPy versions
- Add defaults for max_frames, max_seg_per_spk, seed, nPerSpeaker, distributed
- Fix syntax error in sampler creation
- Benchmark comparison now works: 2.46x speedup achieved
- Import from SpeakerNet_performance_updated and DatasetLoader_performance_updated
- Fix find_option_type to handle store_true/store_false actions properly
- Training now starts successfully with all optimizations enabled
- OPTIMIZATION_COMPLETE.txt: Summary of all optimizations
- OPTIMIZATION_GUIDE.md: Detailed optimization guide
- analyze_performance.py: Performance analysis tool
- debug_repo.py: Repository debugging tool
- quick_optimize.py: Quick optimization script
- Detailed performance benchmarks (2.46x speedup)
- Complete file structure overview
- Quick start guide with multiple options
- Dataset information and validation statistics
- Configuration documentation
- Commit history highlights
- Acknowledgments and contact information
- Convert EER and MinDCF to float to avoid numpy array formatting issues
- Handle threshold value which can be array, tuple, or scalar
- Fixes TypeError: unsupported format string passed to numpy.ndarray.__format__
- Updated ResNetSE34L.py, ResNetSE34V2.py, RawNet3.py, VGGVox.py
- Changed torch.cuda.amp.autocast() to torch.amp.autocast('cuda')
- Fixes FutureWarning for PyTorch 2.x compatibility
- Also improved test_validation_phase.py model loading
- New parameter: --max_test_pairs (0 = use all pairs)
- Can be set in config file or command line
- Default: 10,000 pairs (~75 seconds validation)
- Full dataset: 553,550 pairs (~70 minutes validation)
- Added quick_test_validation.py for fast GPU/pipeline testing
- Added test_max_pairs_param.py to verify parameter works

Examples:
  --max_test_pairs 1000   (~10 seconds - quick test)
  --max_test_pairs 10000  (~75 seconds - default)
  --max_test_pairs 0      (~70 minutes - full validation)
- Add eval_batch_size parameter (default: 64, quick test: 128)
- Batch process embeddings instead of one-by-one (32x speedup potential)
- Improved GPU memory utilization (0.03 GB -> 0.16 GB cached)
- Speed: ~145 pairs/second (was ~130 pairs/s)
- Estimated 40K pairs: ~4.5 minutes (was ~5 minutes)
- Full 553K pairs: ~64 minutes (was ~70 minutes)

Performance improvements:
- Larger batches = better GPU utilization
- Configurable via --eval_batch_size parameter
- Safe defaults: 32 (training), 64 (config), 128 (quick test)
- Use threshold_val (float) instead of current_threshold (numpy array)
- Fixes TypeError: unsupported format string passed to numpy.ndarray
- Occurs when saving best model after achieving new best EER
- Create mini-VoxCeleb2: 140 speakers, 30,179 files (~7.1 GB)
  * Script: create_mini_voxceleb2.py
  * Config: configs/mini_voxceleb2_config.yaml
  * Training list: mini_voxceleb2_train_list.txt (30,179 entries)
  * Documentation: MINI_VOXCELEB2_README.md
  * Uses symbolic links to save disk space
  * ~4x faster training than full dataset

- Create mini-VoxCeleb1: 50 speakers, 6,286 files (~1.6 GB)
  * Script: create_mini_voxceleb1.py with BALANCED test pairs
  * Config: configs/mini_voxceleb1_config.yaml
  * Training list: mini_voxceleb1_train_list.txt (6,286 entries)
  * Test list: mini_test_list.txt (930 BALANCED pairs, 50/50 split)
  * Documentation: MINI_VOXCELEB1_README.md
  * Fixed imbalance issue (was 96.3% positive, now 50/50)

- Add comprehensive TRAINING_GUIDE.md
  * Quick start commands (foreground, tmux, script)
  * Monitoring and troubleshooting
  * Configuration explanations
  * Performance tips and expected training times
  * Common issues and solutions

- Update experiment_01_performance_updated.yaml
  * Adjusted test_interval and max_test_pairs settings

Benefits:
- Fast experimentation and development
- Reduced training time for testing
- Balanced evaluation metrics
- Complete documentation for new users
- Fix TypeError: unsupported format string for numpy.ndarray in threshold saving
  * Changed print statement to use threshold_val (float) instead of current_threshold (array)
  * Ensures consistent float formatting across all threshold operations

- Fix ROC curve plotting errors with type conversions
  * Explicitly convert fprs/fnrs lists to numpy arrays with float64 dtype
  * Use float literal (1.0) instead of int (1) for array arithmetic
  * Calculate EER point index once and reuse for both FPR and TPR
  * Resolves 'unsupported operand type(s) for -: int and list' error

- Update mini_voxceleb1_config.yaml for variable-length evaluation
  * Set eval_frames=0 to use full audio length (no truncation)
  * Set eval_batch_size=1 for variable-length processing
  * Update to 140 speakers for mini VoxCeleb2 training dataset
  * Increase n_mels from 64 to 80 for better feature extraction
  * Reduce nOut from 512 to 256 for faster experimentation
- Add MINDCF_IMPROVEMENT_GUIDE.md: Complete guide for improving MinDCF
  * 6 improvement strategies with expected gains
  * Model architecture recommendations (ResNetSE34L/V2, RawNet3)
  * Loss function analysis and comparisons
  * Phase-by-phase implementation roadmap
  * Expected: 40-60% total MinDCF reduction

- Add ZEROSHOT_VS_FEWSHOT_ANALYSIS.md: Learning paradigm analysis
  * Confirmed current setup is zero-shot (disjoint train/test speakers)
  * Complete zero-shot vs few-shot comparison
  * Impact analysis for switching approaches
  * When to use each learning paradigm
  * Performance expectations for both approaches

- Add optimized configs:
  * mini_voxceleb1_optimized_phase1.yaml: Quick wins config (15-30% improvement)
  * mini_voxceleb1_fewshot_ge2e.yaml: GE2E few-shot config
  * mini_voxceleb1_fewshot_proto.yaml: Prototypical few-shot config

- Update research log 2025-10-30.md with detailed analysis notes
  * MinDCF improvement strategies summary
  * Zero-shot vs few-shot findings
  * Key insights and recommendations
  * Documentation files overview
- Implement 4-level nested learning architecture for speaker verification
- Features: multi-path aggregation where each level receives ALL previous levels
- Components: DepthwiseSeparableConv, SE blocks, learnable weights, GroupNorm
- 1.62M parameters, supports SAP/ASP encoders
- Includes stability fixes: adaptive pooling, dropout, gradient-friendly design
- nested_4level.yaml: Main config with stability-optimized hyperparameters
- nested_4level_asp.yaml: Variant with ASP encoder instead of SAP
- nested_5level_asp.yaml: Extended 5-level architecture config
- Tuned for stability: lr=0.001, decay=0.98, weight_decay=5e-5, batch_size=48
- Tests 5 different nested configurations (3/4/5 levels, SAP/ASP encoders)
- Validates forward pass, output shapes, and parameter counts
- Compares inference speed with baseline ResNetSE34L
- All tests passing: ensures architecture implementation correctness
- visualize_nested_architecture.py: Generates architecture diagrams
- Shows 4 levels with nested connections (red dashed arrows)
- Includes spatial dimensions and parameter counts
- Comparison with ResNetSE34L baseline architecture
- Output: PNG (300 DPI) and PDF (vector) formats for publication
- Comprehensive root cause analysis of NaN loss issues
- Documents 7 different stabilization strategies attempted
- Includes validation checklist and rollback procedures
- Technical reference for gradient explosion in nested architectures
- Useful for research documentation and future debugging
- Documents complete implementation and evaluation of nested learning
- Three training attempts with progressive stability improvements
- Best result: 18.71% EER (before NaN collapse at epoch 12)
- Comprehensive scientific justification and domain analysis:
  * Mathematical gradient flow analysis (25-80× larger than vision)
  * Feature space topology differences (negative correlation in audio)
  * Information theory perspective (47% entropy increase)
  * Optimization landscape analysis (condition number 160,000)
  * Theoretical framework for domain compatibility (audio: 0/5 criteria)
- Conclusions: Nested learning unsuitable for speaker verification
- Recommendation: Abandon approach, pursue LSTM + Autoencoder instead
- Publication-ready with empirical results and theoretical justification
- ASP (Attentive Statistics Pooling) instead of SAP
- Expected 8-10% improvement: 14.2% EER target vs 15.48% baseline
- ASP captures both mean AND variance (first & second-order statistics)
- Benefits: More discriminative embeddings, better variance modeling
- Hyperparameters tuned for ASP: batch_size=48, lr_decay=0.97, patience=20
- Reference: Okabe et al., Interspeech 2018
Architecture:
- Denoising autoencoder: n_mels (80) → 128 latent dimensions
  * Learns robust spectral representations
  * Can be pre-trained unsupervised for noise removal
- Bidirectional LSTM: 2 layers, 256 hidden units per direction
  * Captures temporal dependencies and speaking patterns
  * Models prosody and rhythm information
- Attentive statistics pooling (ASP)
  * Aggregates LSTM outputs over time
  * Computes mean and standard deviation

Key Features:
- 3.87M parameters (2.6× larger than ResNetSE34L)
- Temporal modeling for better speaker discrimination
- Noise robustness through autoencoder denoising
- Expected 20-35% improvement over baseline

Training Configuration:
- Batch size: 32 (with gradient accumulation = effective 64)
- Learning rate: 0.0005 (lower for LSTM stability)
- LR decay: 0.98 (gentler)
- Patience: 25 epochs
- Expected target: 10-12% EER (vs 13.98% ASP baseline)

Based on deep learning approaches for temporal sequence modeling
in speaker verification tasks.
Adapted from: 'A Speaker Verification System Based on a Modified MLP-Mixer
Student Model for Transformer Compression'

Key Features:
- MLP-Mixer architecture adapted for mel-spectrogram input
- Knowledge distillation from LSTM+Autoencoder teacher (9.68% EER)
- Paper's innovations: ID Conv, MFM activation, grouped projections
- 2.66M parameters (31% fewer than LSTM+AE's 3.87M)
- 2.04× faster inference than LSTM+AE (parallel processing)

Implementation:
- models/MLPMixerSpeaker.py: Main model (373 lines)
  * MLPMixerBlock with ID Convolution + MFM
  * TokenMixingMLP, ChannelMixingMLP (grouped projections)
  * AttentiveStatsPooling (ASP aggregation)

- DistillationWrapper.py: Knowledge distillation framework (267 lines)
  * DistillationSpeakerNet: Combined student+teacher training
  * TeacherModelWrapper: Frozen teacher with checkpoint loading
  * DistillationLoss: (1-α)×classification + α×MSE distillation

- configs/mlp_mixer_distillation_config.yaml: Training configuration
  * hidden_dim=192, num_blocks=6, expansion_factor=3
  * Teacher: exps/lstm_autoencoder/model/model000000057.model
  * Distillation: alpha=0.5, temperature=4.0

- test_mlp_mixer.py: Validation suite
  * Tests: instantiation, forward pass, speed benchmark
  * Confirmed: 2.04× speedup vs LSTM+AE

- research_logs/2025-12-30-mlp-mixer-implementation.md: Documentation
  * Architecture details, hyperparameters, training instructions
  * Comparison with paper, performance targets, next steps

Performance Targets:
- EER: 10-11% (distillation gap from 9.68% teacher)
- Speed: 2-3× faster (confirmed 2.04× on CPU)
- Size: 2.66M params (31% reduction)
- Training: 40-50 epochs expected

Architecture Highlights:
1. ID Convolution: Captures local temporal dependencies
2. Max-Feature-Map: Speaker-discriminative feature selection
3. Grouped Projections: 4× parameter efficiency
4. ASP Pooling: Mean + std statistics

Compatibility: Zero impact on existing models
- Modular design (separate .py file)
- Config-driven selection (model: MLPMixerSpeaker)
- Can still run ResNetSE34L, LSTM+AE, NestedSpeakerNet

Next Steps:
- Modify trainSpeakerNet_performance_updated.py for distillation
- Train with distillation (batch_size=64, lr=0.001)
- Ablation: alpha variations (0.3, 0.5, 0.7)
- Evaluate on full VoxCeleb dataset
…ting models)

Created standalone distillation scripts to enable knowledge distillation
WITHOUT modifying existing training pipeline. All existing models remain
100% functional with original scripts.

NEW FILES (Distillation-Only):
- trainSpeakerNet_distillation.py: Copy of training script with distillation
- SpeakerNet_distillation.py: Auto-detects teacher_model in config
- train_mlp_mixer.sh: Convenience script for MLP-Mixer training

UNCHANGED FILES (Backward Compatibility):
- trainSpeakerNet_performance_updated.py: Still works for all models
- SpeakerNet_performance_updated.py: Untouched, existing models safe
- All existing configs: Work unchanged
- All existing models: ResNetSE34L, LSTM+AE, NestedSpeakerNet

Auto-Detection Logic in SpeakerNet_distillation.py:
- IF teacher_model + teacher_checkpoint in config:
    → Use DistillationSpeakerNet (student learns from teacher)
    → Print: '🎓 DISTILLATION MODE ENABLED'
    → Returns: (loss, accuracy, distillation_loss)
- ELSE:
    → Use standard SpeakerNet (backward compatible)
    → Print: '📚 STANDARD CLASSIFICATION MODE'
    → Returns: (loss, accuracy)

Usage:
  # Old models (unchanged)
  python3 trainSpeakerNet_performance_updated.py \
    --config configs/lstm_autoencoder_config.yaml

  # New distillation
  python3 trainSpeakerNet_distillation.py \
    --config configs/mlp_mixer_distillation_config.yaml

Safety Guarantee:
 All existing training commands work unchanged
 All existing configs work unchanged
 Can train any model anytime with original scripts
 Distillation is opt-in via separate script
…ght)

Experiment: V2_Large_LowAlpha - MLP-Mixer with reduced alpha for large student

Summary:
--------
Implements P2 variant testing hypothesis that large student models
(capacity > teacher) require lower distillation weight (alpha) to achieve
optimal performance.

New Files:
----------
1. configs/mlp_mixer_distillation_v2_large_lowAlpha.yaml
   - P2 variant configuration
   - Architecture: 8 blocks, 256 hidden, 4 expansion (7.84M params)
   - Distillation alpha: 0.4 (reduced from 0.7)
   - Rationale: Student (7.84M) > Teacher (3.87M) needs more hard labels
   - Dataset: Mini VoxCeleb2 (30K samples, 140 speakers)

2. check_corrupted_audio.py
   - Utility script to scan and identify corrupted audio files
   - Prevents LibsndfileError crashes during training
   - Supports .wav, .flac, .m4a, .aac formats
   - Generates corrupted_audio_files.txt exclusion list

3. research_logs/2025-12-30-31-experimental-results-analysis.md
   - Comprehensive 27-page experimental analysis
   - Documents V1, V2, V2_Large, V2_Large_LowAlpha experiments
   - Detailed ablation studies and performance comparisons
   - Theoretical insights and learned principles

Key Findings:
-------------
 V2_Large_LowAlpha: 10.11% EER (alpha=0.4) - HYPOTHESIS VALIDATED
 V2_Large: 14.84% EER (alpha=0.7) - Capacity mismatch issue
 V2: 10.32% EER (alpha=0.7, 2.66M params) - Best efficiency
 V1: 16.13% EER (MSE loss) - Distillation broken

Conclusions:
-----------
1. Alpha must be tuned based on student/teacher capacity ratio
2. Small student (<teacher): High alpha (0.7) optimal
3. Large student (>teacher): Low alpha (0.4) optimal
4. V2 remains best model: same EER as V2_LA with 2.7x fewer params

Results:
--------
V2_Large_LowAlpha (7.84M params, alpha=0.4):
  - Best VEER: 10.11% (Epoch 90)
  - Final VEER: 10.32% (Epoch 100)
  - vs V2_Large: -4.73% improvement (validated hypothesis)
  - vs V2: Same performance but 195% more parameters
  - Inference: 220 samples/sec (1.5x faster than teacher)

Training Configuration:
-----------------------
- Teacher: LSTM+Autoencoder (9.68% EER, 3.87M params)
- Distillation: Cosine similarity loss (proven effective)
- Alpha: 0.4 (60% classification, 40% distillation)
- Optimizer: Adam (lr=0.001, decay=0.95)
- Epochs: 100
- Dataset: Mini VoxCeleb2 (30,179 samples)

Impact:
-------
- Establishes alpha-tuning principle for knowledge distillation
- Proves capacity scaling requires hyperparameter adjustment
- Validates V2 as production model (best efficiency)
- Opens path for P3 (multi-stage distillation) experiments

See: research_logs/2025-12-30-31-experimental-results-analysis.md
for complete experimental details, ablation studies, and future work.

Signed-off-by: Anuraj <anuraj@example.com>
Implementation Details:
- Added DistillationWrapper with teacher-student knowledge transfer
- Integrated cosine similarity loss for embedding distillation
- Updated training pipeline to support distillation workflow
- Added comprehensive evaluation with distillation mode support

Key Components:
1. DistillationWrapper.py (20 changes):
   - Cosine similarity loss for normalized embeddings (replaces MSE)
   - Loss magnitude: 0.2-0.4 (vs MSE: 0.0002)
   - Temperature scaling for soft targets (T=4.0)
   - Combined loss: α*L_distill + (1-α)*L_hard

2. SpeakerNet_distillation.py (19 changes):
   - Auto-detection of teacher model architecture
   - Distillation mode evaluation support
   - Fixed __L__ attribute access for wrapped models

3. trainSpeakerNet_distillation.py (42 changes):
   - Added distillation-specific argument parsing
   - Teacher checkpoint loading and freezing
   - Distillation hyperparameters (alpha, temperature)

Critical Bug Fixes:
- Fixed MSE loss magnitude issue (1000× too small)
- Cosine loss provides proper gradient scale
- Added try-except for distillation mode evaluation
- Normalized embedding comparison

Research Documentation:
- Updated research_logs/2025-12-30-mlp-mixer-implementation.md
- Added V1 training results (MSE failure, 16.13% EER)
- Documented bug fixes and convergence analysis
- Added performance comparison table

Experimental Results (documented in commit 80a63e7):
- V1 (MSE loss): 16.13% EER ❌ (distillation broken)
- V2 (Cosine loss): 10.32% EER ✅ (5.81% improvement)
- V2_Large_lowAlpha: 10.11% EER ✅ (validates α-tuning)

Impact:
- Cosine loss critical for embedding distillation (36% improvement)
- Enables effective knowledge transfer from teacher to student
- Foundation for all subsequent distillation experiments

Files Modified:
  DistillationWrapper.py: 20 changes
  SpeakerNet_distillation.py: 19 changes
  trainSpeakerNet_distillation.py: 42 changes
  research_logs/2025-12-30-mlp-mixer-implementation.md: 358 additions

Related Commits:
- 80a63e7: P2 variant results and comprehensive analysis
- See research_logs/2025-12-30-31-experimental-results-analysis.md
Implementation Details:
- Raw waveform input instead of mel-spectrogram preprocessing
- SincNet learnable bandpass filters (80 filters, replaces fixed mel-filterbanks)
- Additional CNN feature extraction layers
- Same MLP-Mixer encoder as V2 (6 blocks, hidden_dim=192)
- Zero impact on existing code (all new files)

Architecture Comparison:
  V2 (Mel-based):  Raw Audio → Mel-Spec (fixed) → CNN → MLP-Mixer → Embedding
  P3 (Raw wave):   Raw Audio → SincNet (learn) → CNN → MLP-Mixer → Embedding

Model Statistics:
- Parameters: 3.48M (+30.9% vs V2: 2.66M)
- Additional params from SincNet frontend + CNN layers
- Learnable filters: 80 bandpass filters with mel-scale initialization
- Filter specs: 251 samples kernel (~15ms), 160 samples stride (10ms)

Research Hypothesis:
Raw waveform input with learnable filters may capture speaker-discriminative
features automatically, potentially matching or outperforming fixed mel-spectrogram
preprocessing (V2: 10.32% EER baseline to beat).

Experimental Setup:
1. Phase 1 - Baseline (no distillation):
   - Config: configs/mlp_mixer_rawwaveform_baseline.yaml
   - Training: 50 epochs, mini dataset
   - Expected EER: 12-14% (validates raw waveform processing)
   - Script: train_mlp_mixer_rawwaveform_baseline.sh

2. Phase 2 - Distillation:
   - Config: configs/mlp_mixer_rawwaveform_distillation.yaml
   - Teacher: LSTM+Autoencoder (9.68% EER)
   - Training: 100 epochs, mini dataset   - Distillation:
   - Expected EER: 10.5-11.5% (compare with V2: 10.32%)
   - Script: train_mlp_mixer_rawwaveform_distillation.sh

Success Criteria:
- Baseline EER < 14%: Validates raw waveform approach
- Distillation EER ≤ 10.5%: Matches/beats mel-based V2 (replace mel preprocessing)
- Distillation EER 10.5-11.5%: Competitive (use case dependent)
- Distillation EER > 11.5%: Mel preprocessing superior (archive experiment)

Technical Implementation:
1. SincNet Frontend (models/MLPMixerSpeaker_RawWaveform.py):
   - Learnable low cutoff frequencies (initialized 30 Hz - 7.6 kHz)
   - Learnable bandwidths (initialized 23 Hz - 261 Hz)
   - Mel-scale spacing initialization
   - Hamming window for filter smoothing

2. Feature Extraction:
   - Conv1d(80, 80, k=5) → LeakyReLU → MaxPool(3)
   - Conv1d(80, 80, k=5) → LeakyReLU → MaxPool(3)
   - Instance normalization of learned features

3. Testing (test_mlp_mixer_rawwaveform.py):
   - All tests passed ✓
   - Forward pass validated (multiple input lengths)
   - Filter initialization verified
   - Parameter count confirmed: 3.48M

Files Created:
  models/MLPMixerSpeaker_RawWaveform.py (389 lines)
    - SincConv_fast: Learnable bandpass filters
    - MLPMixerSpeakerNet_RawWaveform: Main model

  configs/mlp_mixer_rawwaveform_baseline.yaml (95 lines)
    - Phase 1: Baseline training configuration

  configs/mlp_mixer_rawwaveform_distillation.yaml (97 lines)
    - Phase 2: Distillation training configuration

  test_mlp_mixer_rawwaveform.py (106 lines)
    - Validation suite (all tests passed)

  train_mlp_mixer_rawwaveform_baseline.sh
    - Phase 1 training script

  train_mlp_mixer_rawwaveform_distillation.sh
    - Phase 2 training script

  README_RAW_WAVEFORM_EXPERIMENT.md (400+ lines)
    - Comprehensive experiment documentation
    - Architecture details
    - Expected results and success metrics
    - How-to guide

Zero Impact Guarantee:
- No modifications to existing files
- Separate model class (MLPMixerSpeaker_RawWaveform)
- Separate config files
- Uses existing training infrastructure
- Compatible with current distillation framework

References:
- SincNet: Ravanelli & Bengio, "Speaker Recognition from Raw Waveform
  with SincNet", IEEE SLT 2018
- MLP-Mixer paper modifications (ID Conv, MFM, grouped projections)
- Previous experiments: V1 (16.13%), V2 (10.32%), V2_Large_lowAlpha (10.11%)

Next Steps:
1. Run baseline training: bash train_mlp_mixer_rawwaveform_baseline.sh
2. If successful (EER < 14%), run distillation training
3. Compare results with mel-based V2 (10.32% EER)
4. Document findings in research log

Status: ✓ Implementation complete, ready for training
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant