-
Notifications
You must be signed in to change notification settings - Fork 152
feat(nemotron): add Nemotron Speech Streaming 0.6B support #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 4m 10s • 2026-01-15T20:13:40.526Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m13s • 01/15/2026, 03:24 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m49s • 01/15/2026, 03:16 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 70.3s diarization time • Test runtime: 2m 46s • 01/15/2026, 03:06 PM EST |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 321.0s processing • Test runtime: 6m 45s • 01/15/2026, 03:16 PM EST |
|
|
||
| /// Configuration for Nemotron Speech Streaming 0.6B | ||
| /// Based on nvidia/nemotron-speech-streaming-en-0.6b with 1.12s chunks | ||
| public struct NemotronStreamingConfig: Sendable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im getting a bit worried about how many different models we have and wether it makes sesne to keep adding different managers for htem when the interface should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes it harder to maintain this longer term and its unclear for users how and which model to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im thinking if softformer is better we should just remove the older pyannote model tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there are distinct cases where pyannote dominates sorrformer
1d952d8 to
cad45a1
Compare
Add streaming ASR support for NVIDIA's Nemotron Speech Streaming 0.6B model converted to CoreML. Features include: - NemotronStreamingAsrManager: Actor-based streaming ASR with encoder cache - True streaming with 1.12s audio chunks and encoder state carryover - Support for int8 and float32 encoder variants (int8 default, 4x smaller) - RNNT greedy decoding with proper decoder LSTM state management - NemotronBenchmark CLI command for LibriSpeech evaluation Performance on LibriSpeech test-clean (100 files): - WER: 1.99% - RTFx: 8.6x (8.6 times faster than real-time) - Memory: 1.4 GB (with int8 encoder) Models available at: alexwengg/nemotron-speech-streaming-en-0.6b-coreml
…gging - Fix encoder path to use subdirectory structure (encoder/encoder_int8.mlmodelc) - Fix download destination to avoid double folder nesting - Add AppLogger.alwaysLogToConsole for CLI release builds - Include both int8 and float32 encoder variants in required models - Models auto-download from HuggingFace on first run
cad45a1 to
b802917
Compare
Results on full test-clean dataset (2,620 files): - WER: 2.51% - RTFx: 5.7x - Memory: 1.452 GB Includes CLI commands for running benchmarks.
- Add NemotronChunkSize enum (1120ms, 560ms, 160ms, 80ms variants) - Update Repo enum with chunk-size specific variants pointing to FluidInference/nemotron-speech-streaming-en-0.6b-coreml - NemotronStreamingConfig now loads dynamically from metadata.json - Support both .mlmodelc and .mlpackage encoder formats - Add --chunk CLI option to nemotron-benchmark command - Auto-download correct model variant based on chunk size selection
Simplified NemotronStreamingAsrManager to only support int8 quantized encoders: - Replaced NemotronEncoderVariant enum with simple NemotronEncoder filename constant - Removed encoderVariant parameter from loadModels() - Removed --encoder CLI flag from benchmark command All HuggingFace model variants now contain only int8 quantized encoders (~564MB vs ~2.2GB for float32), so the float32 option is no longer needed. Co-Authored-By: Claude Opus 4.5 <[email protected]>
All HuggingFace variants now only include int8 quantized encoders. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Covers: - Benchmark results for all chunk sizes (1120ms, 560ms, 160ms, 80ms) - Quick start guide with code examples - Architecture overview and streaming pipeline - CLI benchmark usage - Comparison with Parakeet TDT Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
NemotronStreamingAsrManagerwith true streaming (1.12s chunks, encoder cache)nemotron-benchmarkCLI command for LibriSpeech evaluationPerformance
On LibriSpeech test-clean (100 files):
Test plan
nemotron-benchmark --max-files 100on LibriSpeech test-cleanUsage
🤖 Generated with Claude Code