-
Notifications
You must be signed in to change notification settings - Fork 347
Description
📝 Background
This bounty is for bringing up the Higgs Audio v2 model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
Higgs Audio v2 is a cutting-edge text-audio foundation model from Boson AI that redefines expressiveness in audio generation. Released in 2025, it offers state-of-the-art capabilities:
Exceptional expressiveness: Industry-leading performance on emotion and prosody benchmarks
Multi-speaker dialog: Natural multi-speaker conversations with distinct voices
Voice cloning: Zero-shot voice cloning from reference audio
Diverse audio generation: Speech, sound effects, music, and environmental sounds
Massive training scale: Trained on 10 million hours of audio data (AudioVerse dataset)
Custom tokenizer: Unified semantic and acoustic tokenization
DualFFN architecture: Enhanced LLM for acoustic modeling with minimal overhead
Apache 2.0 license: Fully open source and commercially usable
The model achieves 75.71% win-rate on EmergentTTS-Eval emotions and best-in-class multi-speaker generation quality.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency audio generation across diverse use cases including virtual assistants, audiobooks, content creation, and interactive media.
🎯 What Success Looks Like
A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.
Stage 1 — Bring-Up
- Implement Higgs Audio v2 using TTNN APIs (Python)
- Implements the full generation pipeline:
LLM backbone with DualFFN architecture
Audio tokenizer (semantic + acoustic features)
Audio decoder (token-to-waveform conversion) - Model runs on either N150 or N300 tenstorrent hardware with no errors
- Supports multiple generation modes:
Text-to-speech: Generate speech from text only
Voice cloning: Generate speech with reference audio
Multi-speaker dialog: Generate conversations with distinct speakers - Produces valid audio output on sample texts (English and multilingual)
- Output is verifiable (audio quality assessment, compare with PyTorch reference)
- Achieves baseline throughput target:
At least 30 tokens/second for autoregressive generation
Real-time factor (RTF) < 0.5 for typical sentences - Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
- Audio quality: Passes intelligibility and expressiveness tests
- Clear instructions for setup and running the model
Stage 2 — Basic Optimizations
- Use optimal sharded/interleaved memory configs for LLM layers
- Implement efficient sharding strategy for:
- Token embeddings (text + audio tokens)
- DualFFN transformer layers
- Multi-head attention mechanisms
- Audio tokenizer encoder/decoder
- Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
- Store intermediate activations in L1 where beneficial
- Use recommended TTNN/tt-metal LLM flows
- Leverage TT library of fused ops for attention and MLP blocks
- Optimize DualFFN architecture (dual feed-forward networks)
- Efficient KV-cache management for autoregressive generation
- Optimize audio tokenizer integration
Stage 3 — Deeper Optimization
- Maximize core counts used per inference
- Implement deeper TT-specific optimizations:
- Efficient KV-cache management for long sequences
- Optimized DualFFN computation (parallel FFN paths)
- Flash Attention or equivalent for attention layers - Minimize token generation latency
- Batch processing for multiple utterances/speakers
- Efficient sampling strategies (temperature, top-p, top-k)
- Pipeline audio encoding/decoding with LLM generation
- Minimize memory and TM (tensor manipulation) overheads
- Explore speculative decoding or other acceleration techniques
- Document any advanced tuning, known limitations, or trade-offs
- Target stretched goals:
- 60+ tokens/second generation speed
- RTF < 0.2 for real-time applications - Support for longer contexts (multi-turn dialog)
- Efficient multi-speaker handling
🧭 Guidance & Starting Points
- Use the TTNN model bring-up tech report as your primary reference
- Reference LLM implementations in tt-metal for LLM-based model patterns
- Use the official Higgs Audio v2 repository for model architecture details
- Refer to the Higgs Audio v2 blog for technical details
- Refer to tokenizer blog for audio tokenizer details
- Refer to DualFFN architecture blog for architecture innovations
- Refer to TT Fused ops PR #29236 for optimization opportunities
- The model architecture consists of:
- Audio tokenizer: Unified semantic and acoustic feature extraction
- LLM backbone: Transformer with DualFFN architecture
- DualFFN layers: Dual feed-forward networks for text and audio tokens
- Audio decoder: Token-to-waveform conversion (24 kHz output)
- Key challenges:
- DualFFN architecture optimization (dual FFN paths)
- Large vocabulary (text + audio tokens)
- Autoregressive generation latency
- Audio tokenizer integration
- Multi-speaker context management
- Ask for help or file issues if ops are missing in TTNN
🔎 Possible Approaches
- Start from the official Higgs Audio v2 repository and port components sequentially:
- End-to-end pipeline integration
- Validate each submodule's output against PyTorch reference before integration
- For example the LLM backbone:
- Start with standard transformer layers
- Add DualFFN architecture (key innovation)
- Optimize attention mechanisms
- Implement efficient KV-cache
- Experiment with different sharding strategies
- Use TTNN profiling tools to identify bottlenecks
- Test diverse use cases
- Open a draft PR early to get feedback on your approach
📊 Result Submission Guidelines
Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.
Deliverables:
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review
Links:
📚 Resources
Model Resources
- Higgs Audio v2 Official Repository
- Higgs Audio v2 Release Blog
- Audio Tokenizer Technical Blog
- DualFFN Architecture Blog
- Boson AI Website
Evaluation Resources
TT-Metal Resources
Metadata
Metadata
Assignees
Labels
Type
Projects
Status