voxceleb2 has 5991 speakers#195
Open
dimuthuanuraj wants to merge 46 commits into
Open
Conversation
… compatibility issue with newer NumPy versions
…ent accumulation, and optimized DataLoader
…on-blocking transfers, and inference mode
…ding, and vectorized operations
…rkers, and mixed precision enabled
…n monitoring and comparison
…with different modes
…ter compatibility
… (max_seg_per_spk, seed)
- Add defaults for max_frames, max_seg_per_spk, seed, nPerSpeaker, distributed - Fix syntax error in sampler creation - Benchmark comparison now works: 2.46x speedup achieved
- Import from SpeakerNet_performance_updated and DatasetLoader_performance_updated - Fix find_option_type to handle store_true/store_false actions properly - Training now starts successfully with all optimizations enabled
- OPTIMIZATION_COMPLETE.txt: Summary of all optimizations - OPTIMIZATION_GUIDE.md: Detailed optimization guide - analyze_performance.py: Performance analysis tool - debug_repo.py: Repository debugging tool - quick_optimize.py: Quick optimization script
- Detailed performance benchmarks (2.46x speedup) - Complete file structure overview - Quick start guide with multiple options - Dataset information and validation statistics - Configuration documentation - Commit history highlights - Acknowledgments and contact information
- Convert EER and MinDCF to float to avoid numpy array formatting issues - Handle threshold value which can be array, tuple, or scalar - Fixes TypeError: unsupported format string passed to numpy.ndarray.__format__
- Updated ResNetSE34L.py, ResNetSE34V2.py, RawNet3.py, VGGVox.py
- Changed torch.cuda.amp.autocast() to torch.amp.autocast('cuda')
- Fixes FutureWarning for PyTorch 2.x compatibility
- Also improved test_validation_phase.py model loading
- New parameter: --max_test_pairs (0 = use all pairs) - Can be set in config file or command line - Default: 10,000 pairs (~75 seconds validation) - Full dataset: 553,550 pairs (~70 minutes validation) - Added quick_test_validation.py for fast GPU/pipeline testing - Added test_max_pairs_param.py to verify parameter works Examples: --max_test_pairs 1000 (~10 seconds - quick test) --max_test_pairs 10000 (~75 seconds - default) --max_test_pairs 0 (~70 minutes - full validation)
- Add eval_batch_size parameter (default: 64, quick test: 128) - Batch process embeddings instead of one-by-one (32x speedup potential) - Improved GPU memory utilization (0.03 GB -> 0.16 GB cached) - Speed: ~145 pairs/second (was ~130 pairs/s) - Estimated 40K pairs: ~4.5 minutes (was ~5 minutes) - Full 553K pairs: ~64 minutes (was ~70 minutes) Performance improvements: - Larger batches = better GPU utilization - Configurable via --eval_batch_size parameter - Safe defaults: 32 (training), 64 (config), 128 (quick test)
- Use threshold_val (float) instead of current_threshold (numpy array) - Fixes TypeError: unsupported format string passed to numpy.ndarray - Occurs when saving best model after achieving new best EER
- Create mini-VoxCeleb2: 140 speakers, 30,179 files (~7.1 GB) * Script: create_mini_voxceleb2.py * Config: configs/mini_voxceleb2_config.yaml * Training list: mini_voxceleb2_train_list.txt (30,179 entries) * Documentation: MINI_VOXCELEB2_README.md * Uses symbolic links to save disk space * ~4x faster training than full dataset - Create mini-VoxCeleb1: 50 speakers, 6,286 files (~1.6 GB) * Script: create_mini_voxceleb1.py with BALANCED test pairs * Config: configs/mini_voxceleb1_config.yaml * Training list: mini_voxceleb1_train_list.txt (6,286 entries) * Test list: mini_test_list.txt (930 BALANCED pairs, 50/50 split) * Documentation: MINI_VOXCELEB1_README.md * Fixed imbalance issue (was 96.3% positive, now 50/50) - Add comprehensive TRAINING_GUIDE.md * Quick start commands (foreground, tmux, script) * Monitoring and troubleshooting * Configuration explanations * Performance tips and expected training times * Common issues and solutions - Update experiment_01_performance_updated.yaml * Adjusted test_interval and max_test_pairs settings Benefits: - Fast experimentation and development - Reduced training time for testing - Balanced evaluation metrics - Complete documentation for new users
- Fix TypeError: unsupported format string for numpy.ndarray in threshold saving * Changed print statement to use threshold_val (float) instead of current_threshold (array) * Ensures consistent float formatting across all threshold operations - Fix ROC curve plotting errors with type conversions * Explicitly convert fprs/fnrs lists to numpy arrays with float64 dtype * Use float literal (1.0) instead of int (1) for array arithmetic * Calculate EER point index once and reuse for both FPR and TPR * Resolves 'unsupported operand type(s) for -: int and list' error - Update mini_voxceleb1_config.yaml for variable-length evaluation * Set eval_frames=0 to use full audio length (no truncation) * Set eval_batch_size=1 for variable-length processing * Update to 140 speakers for mini VoxCeleb2 training dataset * Increase n_mels from 64 to 80 for better feature extraction * Reduce nOut from 512 to 256 for faster experimentation
- Add MINDCF_IMPROVEMENT_GUIDE.md: Complete guide for improving MinDCF * 6 improvement strategies with expected gains * Model architecture recommendations (ResNetSE34L/V2, RawNet3) * Loss function analysis and comparisons * Phase-by-phase implementation roadmap * Expected: 40-60% total MinDCF reduction - Add ZEROSHOT_VS_FEWSHOT_ANALYSIS.md: Learning paradigm analysis * Confirmed current setup is zero-shot (disjoint train/test speakers) * Complete zero-shot vs few-shot comparison * Impact analysis for switching approaches * When to use each learning paradigm * Performance expectations for both approaches - Add optimized configs: * mini_voxceleb1_optimized_phase1.yaml: Quick wins config (15-30% improvement) * mini_voxceleb1_fewshot_ge2e.yaml: GE2E few-shot config * mini_voxceleb1_fewshot_proto.yaml: Prototypical few-shot config - Update research log 2025-10-30.md with detailed analysis notes * MinDCF improvement strategies summary * Zero-shot vs few-shot findings * Key insights and recommendations * Documentation files overview
- Implement 4-level nested learning architecture for speaker verification - Features: multi-path aggregation where each level receives ALL previous levels - Components: DepthwiseSeparableConv, SE blocks, learnable weights, GroupNorm - 1.62M parameters, supports SAP/ASP encoders - Includes stability fixes: adaptive pooling, dropout, gradient-friendly design
- nested_4level.yaml: Main config with stability-optimized hyperparameters - nested_4level_asp.yaml: Variant with ASP encoder instead of SAP - nested_5level_asp.yaml: Extended 5-level architecture config - Tuned for stability: lr=0.001, decay=0.98, weight_decay=5e-5, batch_size=48
- Tests 5 different nested configurations (3/4/5 levels, SAP/ASP encoders) - Validates forward pass, output shapes, and parameter counts - Compares inference speed with baseline ResNetSE34L - All tests passing: ensures architecture implementation correctness
- visualize_nested_architecture.py: Generates architecture diagrams - Shows 4 levels with nested connections (red dashed arrows) - Includes spatial dimensions and parameter counts - Comparison with ResNetSE34L baseline architecture - Output: PNG (300 DPI) and PDF (vector) formats for publication
- Comprehensive root cause analysis of NaN loss issues - Documents 7 different stabilization strategies attempted - Includes validation checklist and rollback procedures - Technical reference for gradient explosion in nested architectures - Useful for research documentation and future debugging
- Documents complete implementation and evaluation of nested learning - Three training attempts with progressive stability improvements - Best result: 18.71% EER (before NaN collapse at epoch 12) - Comprehensive scientific justification and domain analysis: * Mathematical gradient flow analysis (25-80× larger than vision) * Feature space topology differences (negative correlation in audio) * Information theory perspective (47% entropy increase) * Optimization landscape analysis (condition number 160,000) * Theoretical framework for domain compatibility (audio: 0/5 criteria) - Conclusions: Nested learning unsuitable for speaker verification - Recommendation: Abandon approach, pursue LSTM + Autoencoder instead - Publication-ready with empirical results and theoretical justification
- ASP (Attentive Statistics Pooling) instead of SAP - Expected 8-10% improvement: 14.2% EER target vs 15.48% baseline - ASP captures both mean AND variance (first & second-order statistics) - Benefits: More discriminative embeddings, better variance modeling - Hyperparameters tuned for ASP: batch_size=48, lr_decay=0.97, patience=20 - Reference: Okabe et al., Interspeech 2018
Architecture: - Denoising autoencoder: n_mels (80) → 128 latent dimensions * Learns robust spectral representations * Can be pre-trained unsupervised for noise removal - Bidirectional LSTM: 2 layers, 256 hidden units per direction * Captures temporal dependencies and speaking patterns * Models prosody and rhythm information - Attentive statistics pooling (ASP) * Aggregates LSTM outputs over time * Computes mean and standard deviation Key Features: - 3.87M parameters (2.6× larger than ResNetSE34L) - Temporal modeling for better speaker discrimination - Noise robustness through autoencoder denoising - Expected 20-35% improvement over baseline Training Configuration: - Batch size: 32 (with gradient accumulation = effective 64) - Learning rate: 0.0005 (lower for LSTM stability) - LR decay: 0.98 (gentler) - Patience: 25 epochs - Expected target: 10-12% EER (vs 13.98% ASP baseline) Based on deep learning approaches for temporal sequence modeling in speaker verification tasks.
Adapted from: 'A Speaker Verification System Based on a Modified MLP-Mixer Student Model for Transformer Compression' Key Features: - MLP-Mixer architecture adapted for mel-spectrogram input - Knowledge distillation from LSTM+Autoencoder teacher (9.68% EER) - Paper's innovations: ID Conv, MFM activation, grouped projections - 2.66M parameters (31% fewer than LSTM+AE's 3.87M) - 2.04× faster inference than LSTM+AE (parallel processing) Implementation: - models/MLPMixerSpeaker.py: Main model (373 lines) * MLPMixerBlock with ID Convolution + MFM * TokenMixingMLP, ChannelMixingMLP (grouped projections) * AttentiveStatsPooling (ASP aggregation) - DistillationWrapper.py: Knowledge distillation framework (267 lines) * DistillationSpeakerNet: Combined student+teacher training * TeacherModelWrapper: Frozen teacher with checkpoint loading * DistillationLoss: (1-α)×classification + α×MSE distillation - configs/mlp_mixer_distillation_config.yaml: Training configuration * hidden_dim=192, num_blocks=6, expansion_factor=3 * Teacher: exps/lstm_autoencoder/model/model000000057.model * Distillation: alpha=0.5, temperature=4.0 - test_mlp_mixer.py: Validation suite * Tests: instantiation, forward pass, speed benchmark * Confirmed: 2.04× speedup vs LSTM+AE - research_logs/2025-12-30-mlp-mixer-implementation.md: Documentation * Architecture details, hyperparameters, training instructions * Comparison with paper, performance targets, next steps Performance Targets: - EER: 10-11% (distillation gap from 9.68% teacher) - Speed: 2-3× faster (confirmed 2.04× on CPU) - Size: 2.66M params (31% reduction) - Training: 40-50 epochs expected Architecture Highlights: 1. ID Convolution: Captures local temporal dependencies 2. Max-Feature-Map: Speaker-discriminative feature selection 3. Grouped Projections: 4× parameter efficiency 4. ASP Pooling: Mean + std statistics Compatibility: Zero impact on existing models - Modular design (separate .py file) - Config-driven selection (model: MLPMixerSpeaker) - Can still run ResNetSE34L, LSTM+AE, NestedSpeakerNet Next Steps: - Modify trainSpeakerNet_performance_updated.py for distillation - Train with distillation (batch_size=64, lr=0.001) - Ablation: alpha variations (0.3, 0.5, 0.7) - Evaluate on full VoxCeleb dataset
…ting models)
Created standalone distillation scripts to enable knowledge distillation
WITHOUT modifying existing training pipeline. All existing models remain
100% functional with original scripts.
NEW FILES (Distillation-Only):
- trainSpeakerNet_distillation.py: Copy of training script with distillation
- SpeakerNet_distillation.py: Auto-detects teacher_model in config
- train_mlp_mixer.sh: Convenience script for MLP-Mixer training
UNCHANGED FILES (Backward Compatibility):
- trainSpeakerNet_performance_updated.py: Still works for all models
- SpeakerNet_performance_updated.py: Untouched, existing models safe
- All existing configs: Work unchanged
- All existing models: ResNetSE34L, LSTM+AE, NestedSpeakerNet
Auto-Detection Logic in SpeakerNet_distillation.py:
- IF teacher_model + teacher_checkpoint in config:
→ Use DistillationSpeakerNet (student learns from teacher)
→ Print: '🎓 DISTILLATION MODE ENABLED'
→ Returns: (loss, accuracy, distillation_loss)
- ELSE:
→ Use standard SpeakerNet (backward compatible)
→ Print: '📚 STANDARD CLASSIFICATION MODE'
→ Returns: (loss, accuracy)
Usage:
# Old models (unchanged)
python3 trainSpeakerNet_performance_updated.py \
--config configs/lstm_autoencoder_config.yaml
# New distillation
python3 trainSpeakerNet_distillation.py \
--config configs/mlp_mixer_distillation_config.yaml
Safety Guarantee:
All existing training commands work unchanged
All existing configs work unchanged
Can train any model anytime with original scripts
Distillation is opt-in via separate script
…ght) Experiment: V2_Large_LowAlpha - MLP-Mixer with reduced alpha for large student Summary: -------- Implements P2 variant testing hypothesis that large student models (capacity > teacher) require lower distillation weight (alpha) to achieve optimal performance. New Files: ---------- 1. configs/mlp_mixer_distillation_v2_large_lowAlpha.yaml - P2 variant configuration - Architecture: 8 blocks, 256 hidden, 4 expansion (7.84M params) - Distillation alpha: 0.4 (reduced from 0.7) - Rationale: Student (7.84M) > Teacher (3.87M) needs more hard labels - Dataset: Mini VoxCeleb2 (30K samples, 140 speakers) 2. check_corrupted_audio.py - Utility script to scan and identify corrupted audio files - Prevents LibsndfileError crashes during training - Supports .wav, .flac, .m4a, .aac formats - Generates corrupted_audio_files.txt exclusion list 3. research_logs/2025-12-30-31-experimental-results-analysis.md - Comprehensive 27-page experimental analysis - Documents V1, V2, V2_Large, V2_Large_LowAlpha experiments - Detailed ablation studies and performance comparisons - Theoretical insights and learned principles Key Findings: ------------- V2_Large_LowAlpha: 10.11% EER (alpha=0.4) - HYPOTHESIS VALIDATED V2_Large: 14.84% EER (alpha=0.7) - Capacity mismatch issue V2: 10.32% EER (alpha=0.7, 2.66M params) - Best efficiency V1: 16.13% EER (MSE loss) - Distillation broken Conclusions: ----------- 1. Alpha must be tuned based on student/teacher capacity ratio 2. Small student (<teacher): High alpha (0.7) optimal 3. Large student (>teacher): Low alpha (0.4) optimal 4. V2 remains best model: same EER as V2_LA with 2.7x fewer params Results: -------- V2_Large_LowAlpha (7.84M params, alpha=0.4): - Best VEER: 10.11% (Epoch 90) - Final VEER: 10.32% (Epoch 100) - vs V2_Large: -4.73% improvement (validated hypothesis) - vs V2: Same performance but 195% more parameters - Inference: 220 samples/sec (1.5x faster than teacher) Training Configuration: ----------------------- - Teacher: LSTM+Autoencoder (9.68% EER, 3.87M params) - Distillation: Cosine similarity loss (proven effective) - Alpha: 0.4 (60% classification, 40% distillation) - Optimizer: Adam (lr=0.001, decay=0.95) - Epochs: 100 - Dataset: Mini VoxCeleb2 (30,179 samples) Impact: ------- - Establishes alpha-tuning principle for knowledge distillation - Proves capacity scaling requires hyperparameter adjustment - Validates V2 as production model (best efficiency) - Opens path for P3 (multi-stage distillation) experiments See: research_logs/2025-12-30-31-experimental-results-analysis.md for complete experimental details, ablation studies, and future work. Signed-off-by: Anuraj <anuraj@example.com>
Implementation Details: - Added DistillationWrapper with teacher-student knowledge transfer - Integrated cosine similarity loss for embedding distillation - Updated training pipeline to support distillation workflow - Added comprehensive evaluation with distillation mode support Key Components: 1. DistillationWrapper.py (20 changes): - Cosine similarity loss for normalized embeddings (replaces MSE) - Loss magnitude: 0.2-0.4 (vs MSE: 0.0002) - Temperature scaling for soft targets (T=4.0) - Combined loss: α*L_distill + (1-α)*L_hard 2. SpeakerNet_distillation.py (19 changes): - Auto-detection of teacher model architecture - Distillation mode evaluation support - Fixed __L__ attribute access for wrapped models 3. trainSpeakerNet_distillation.py (42 changes): - Added distillation-specific argument parsing - Teacher checkpoint loading and freezing - Distillation hyperparameters (alpha, temperature) Critical Bug Fixes: - Fixed MSE loss magnitude issue (1000× too small) - Cosine loss provides proper gradient scale - Added try-except for distillation mode evaluation - Normalized embedding comparison Research Documentation: - Updated research_logs/2025-12-30-mlp-mixer-implementation.md - Added V1 training results (MSE failure, 16.13% EER) - Documented bug fixes and convergence analysis - Added performance comparison table Experimental Results (documented in commit 80a63e7): - V1 (MSE loss): 16.13% EER ❌ (distillation broken) - V2 (Cosine loss): 10.32% EER ✅ (5.81% improvement) - V2_Large_lowAlpha: 10.11% EER ✅ (validates α-tuning) Impact: - Cosine loss critical for embedding distillation (36% improvement) - Enables effective knowledge transfer from teacher to student - Foundation for all subsequent distillation experiments Files Modified: DistillationWrapper.py: 20 changes SpeakerNet_distillation.py: 19 changes trainSpeakerNet_distillation.py: 42 changes research_logs/2025-12-30-mlp-mixer-implementation.md: 358 additions Related Commits: - 80a63e7: P2 variant results and comprehensive analysis - See research_logs/2025-12-30-31-experimental-results-analysis.md
Implementation Details:
- Raw waveform input instead of mel-spectrogram preprocessing
- SincNet learnable bandpass filters (80 filters, replaces fixed mel-filterbanks)
- Additional CNN feature extraction layers
- Same MLP-Mixer encoder as V2 (6 blocks, hidden_dim=192)
- Zero impact on existing code (all new files)
Architecture Comparison:
V2 (Mel-based): Raw Audio → Mel-Spec (fixed) → CNN → MLP-Mixer → Embedding
P3 (Raw wave): Raw Audio → SincNet (learn) → CNN → MLP-Mixer → Embedding
Model Statistics:
- Parameters: 3.48M (+30.9% vs V2: 2.66M)
- Additional params from SincNet frontend + CNN layers
- Learnable filters: 80 bandpass filters with mel-scale initialization
- Filter specs: 251 samples kernel (~15ms), 160 samples stride (10ms)
Research Hypothesis:
Raw waveform input with learnable filters may capture speaker-discriminative
features automatically, potentially matching or outperforming fixed mel-spectrogram
preprocessing (V2: 10.32% EER baseline to beat).
Experimental Setup:
1. Phase 1 - Baseline (no distillation):
- Config: configs/mlp_mixer_rawwaveform_baseline.yaml
- Training: 50 epochs, mini dataset
- Expected EER: 12-14% (validates raw waveform processing)
- Script: train_mlp_mixer_rawwaveform_baseline.sh
2. Phase 2 - Distillation:
- Config: configs/mlp_mixer_rawwaveform_distillation.yaml
- Teacher: LSTM+Autoencoder (9.68% EER)
- Training: 100 epochs, mini dataset - Distillation:
- Expected EER: 10.5-11.5% (compare with V2: 10.32%)
- Script: train_mlp_mixer_rawwaveform_distillation.sh
Success Criteria:
- Baseline EER < 14%: Validates raw waveform approach
- Distillation EER ≤ 10.5%: Matches/beats mel-based V2 (replace mel preprocessing)
- Distillation EER 10.5-11.5%: Competitive (use case dependent)
- Distillation EER > 11.5%: Mel preprocessing superior (archive experiment)
Technical Implementation:
1. SincNet Frontend (models/MLPMixerSpeaker_RawWaveform.py):
- Learnable low cutoff frequencies (initialized 30 Hz - 7.6 kHz)
- Learnable bandwidths (initialized 23 Hz - 261 Hz)
- Mel-scale spacing initialization
- Hamming window for filter smoothing
2. Feature Extraction:
- Conv1d(80, 80, k=5) → LeakyReLU → MaxPool(3)
- Conv1d(80, 80, k=5) → LeakyReLU → MaxPool(3)
- Instance normalization of learned features
3. Testing (test_mlp_mixer_rawwaveform.py):
- All tests passed ✓
- Forward pass validated (multiple input lengths)
- Filter initialization verified
- Parameter count confirmed: 3.48M
Files Created:
models/MLPMixerSpeaker_RawWaveform.py (389 lines)
- SincConv_fast: Learnable bandpass filters
- MLPMixerSpeakerNet_RawWaveform: Main model
configs/mlp_mixer_rawwaveform_baseline.yaml (95 lines)
- Phase 1: Baseline training configuration
configs/mlp_mixer_rawwaveform_distillation.yaml (97 lines)
- Phase 2: Distillation training configuration
test_mlp_mixer_rawwaveform.py (106 lines)
- Validation suite (all tests passed)
train_mlp_mixer_rawwaveform_baseline.sh
- Phase 1 training script
train_mlp_mixer_rawwaveform_distillation.sh
- Phase 2 training script
README_RAW_WAVEFORM_EXPERIMENT.md (400+ lines)
- Comprehensive experiment documentation
- Architecture details
- Expected results and success metrics
- How-to guide
Zero Impact Guarantee:
- No modifications to existing files
- Separate model class (MLPMixerSpeaker_RawWaveform)
- Separate config files
- Uses existing training infrastructure
- Compatible with current distillation framework
References:
- SincNet: Ravanelli & Bengio, "Speaker Recognition from Raw Waveform
with SincNet", IEEE SLT 2018
- MLP-Mixer paper modifications (ID Conv, MFM, grouped projections)
- Previous experiments: V1 (16.13%), V2 (10.32%), V2_Large_lowAlpha (10.11%)
Next Steps:
1. Run baseline training: bash train_mlp_mixer_rawwaveform_baseline.sh
2. If successful (EER < 14%), run distillation training
3. Compare results with mel-based V2 (10.32% EER)
4. Document findings in research log
Status: ✓ Implementation complete, ready for training
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.