Implementation of speculative decoding for OpenAI's Whisper ASR model using Whisper Tiny as the draft model and Whisper Large V3 as the target model. This technique accelerates inference while maintaining identical output quality.
This project implements the assignment requirements for speculative decoding on Whisper:
- Draft Model: Whisper Tiny generates candidate tokens quickly
- Target Model: Whisper Large V3 verifies and refines the output
- Result: Faster inference with same quality as standard Large V3
- Speedup over standard Whisper Large V3 decoding
- Identical output distribution - maintains exact same quality
- Batch processing for multiple audio files
- WER evaluation to compare accuracy
- Configurable parameters for tuning performance
- GPU/CPU support with auto-detection
- Python 3.8+
- PyTorch 2.0+
- OpenAI Whisper
# Install dependencies
pip install -r requirements.txtfrom src import SpeculativeWhisper
# Initialize with Tiny (draft) and Large V3 (target)
sw = SpeculativeWhisper(
draft_model="tiny",
final_model="large-v3",
device="cuda" # or "cpu"
)
# Transcribe audio files
audio_files = ["audio1.wav", "audio2.wav"]
outputs = sw.transcribe(audio_files, max_tokens=200, batch_size=2)
for audio, text in zip(audio_files, outputs):
print(f"{audio}: {text}")# Get detailed performance statistics
results = sw.transcribe(audio_files, return_stats=True)
for result in results:
print(f"Text: {result['text']}")
print(f"Acceptance Rate: {result['stats']['overall_acceptance_rate']:.2%}")
print(f"Speedup: {result['stats']['avg_tokens_per_iteration']:.2f}x")cd examples
python3 benchmark_comparison.pyThis compares:
- Standard Whisper Large V3 - baseline performance
- Speculative Decoding (Tiny → Large V3) - accelerated version
Output includes:
- Transcription results
- Time taken for each approach
- Speedup achieved
- Token acceptance rate
cd examples
python3 evaluate_accuracy.pyCalculates Word Error Rate (WER) for both approaches to verify accuracy is maintained.
Control the behavior of speculative decoding:
from src import SpeculativeWhisper, SpeculativeConfig
config = SpeculativeConfig(
gamma=4, # Tokens to generate per iteration
acceptance_threshold=0.8, # Probability threshold for acceptance
temperature=0.0, # 0 = greedy, >0 = sampling
use_adaptive_gamma=True, # Auto-adjust gamma based on acceptance
)
sw = SpeculativeWhisper(
draft_model="tiny",
final_model="large-v3",
config=config
)| Parameter | Description | Default | Tuning Advice |
|---|---|---|---|
gamma |
Number of draft tokens per iteration | 4 | Higher = more speculation, lower acceptance |
acceptance_threshold |
Min probability ratio to accept | 0.8 | Lower = more aggressive (0.7-0.9) |
use_adaptive_gamma |
Auto-adjust gamma | True | Usually helps |
temperature |
Sampling temperature | 0.0 | 0 = greedy (deterministic) |
- Draft Generation: Whisper Tiny quickly generates γ candidate tokens
- Parallel Verification: Whisper Large V3 verifies all candidates in one pass
- Acceptance/Rejection:
- Accept tokens where target and draft agree
- On mismatch, sample from adjusted distribution
- Repeat: Continue until end-of-text or max length
- Draft model (Tiny) is much faster than target (Large V3)
- Target model verifies multiple tokens in parallel (single forward pass)
- Good acceptance rate means fewer target model calls
- No quality loss - output distribution matches standard decoding exactly
cd examples
python3 api_usage_example.pyShows the simple API from the assignment.
cd examples
python3 basic_usage.pyDemonstrates direct use of SpeculativeWhisperDecoder.
cd examples
python3 benchmark_comparison.pyCompares Standard vs Speculative Large V3 decoding.
cd examples
python3 evaluate_accuracy.pyMeasures WER to verify quality is maintained.
speculative-whisper/
├── src/
│ ├── api.py # SpeculativeWhisper API class
│ ├── speculative_decoder.py # Core algorithm
│ ├── draft_model.py # Draft model implementations
│ ├── metrics.py # WER/CER calculations
│ └── config.py # Configuration
├── examples/
│ ├── api_usage_example.py # API demo (assignment format)
│ ├── benchmark_comparison.py # Speculative vs Standard comparison
│ ├── evaluate_accuracy.py # WER evaluation
│ └── basic_usage.py # Low-level usage
├── tests/
│ ├── test_config.py
│ ├── test_speculative_decoder.py
│ └── test_audio_samples/jfk.flac
└── requirements.txt
- OpenAI Whisper team
- Speculative decoding research by Google and DeepMind
- PyTorch team