-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Summary
I trained the Stage1 VQ-VAE model from scratch on the complete VOCASET dataset (314 training samples) and compared it with the official checkpoint. Despite using the full dataset and training for 300 epochs, there remains a 32.1x gap in reconstruction loss. I'd like to understand what factors contribute to this gap and whether this performance is sufficient for Stage2 training.
Environment
- GPU: 2x NVIDIA RTX A6000 (49GB each)
- Framework: PyTorch 2.0.1 with PyTorch Lightning
- Dataset: VOCASET from ModelScope (complete version)
- Training samples: 314 (train) + 53 (val) + 53 (test)
Training Configuration
# Stage1 VQ-VAE Training Config
MODEL:
TYPE: vqvae
n_vert: 15069
n_embed: 256
zquant_dim: 64
hidden_size: 1024
num_hidden_layers: 6
num_attention_heads: 8
TRAIN:
EPOCHS: 300
BATCH_SIZE: 2
LR: 1e-4
GPUS: 2 (DDP)
OPTIMIZER: Adam
LR_SCHEDULER: StepLRTraining time: ~55 minutes
Total training steps: ~47,000
Hardware: 2x RTX A6000
Results
Quantitative Comparison
Evaluated on validation set (50 batches, ~100 samples):
| Model | Reconstruction Loss (MSE) | Quantization Loss | Relative Gap |
|---|---|---|---|
| Self-trained (Epoch 299) | 6.86e-6 ± 4.10e-6 | 4.23e-3 ± 6.15e-5 | 32.1x / 36.7x |
| Official checkpoint | 2.10e-7 ± 1.20e-7 | 1.15e-4 ± 5.34e-6 | 1.0x / 1.0x |
Key findings:
- Reconstruction loss: 32.1x higher than official
- Quantization loss: 36.7x higher than official
- Self-trained model achieves micrometer-level precision (6.86e-6)
- Official model achieves nanometer-level precision (2.10e-7)
Training Steps Comparison
# Official checkpoint metadata
official_checkpoint = torch.load('checkpoints/voca_vae.ckpt')
print(f"Epoch: {official_checkpoint['epoch']}") # 199
print(f"Global step: {official_checkpoint['global_step']}") # 62,800
# Self-trained checkpoint
# Epoch: 299
# Global step: ~47,000
# Training steps gap: 62,800 vs 47,000 (1.33x difference)Metadata
Metadata
Assignees
Labels
No labels