[Bounty $1500] Higgs Audio v2 bring up using TTNN APIs

### 📝 Background

This bounty is for bringing up the Higgs Audio v2 model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
Higgs Audio v2 is a cutting-edge text-audio foundation model from Boson AI that redefines expressiveness in audio generation. Released in 2025, it offers state-of-the-art capabilities:
Exceptional expressiveness: Industry-leading performance on emotion and prosody benchmarks
Multi-speaker dialog: Natural multi-speaker conversations with distinct voices
Voice cloning: Zero-shot voice cloning from reference audio
Diverse audio generation: Speech, sound effects, music, and environmental sounds
Massive training scale: Trained on 10 million hours of audio data (AudioVerse dataset)
Custom tokenizer: Unified semantic and acoustic tokenization
DualFFN architecture: Enhanced LLM for acoustic modeling with minimal overhead
Apache 2.0 license: Fully open source and commercially usable
The model achieves 75.71% win-rate on EmergentTTS-Eval emotions and best-in-class multi-speaker generation quality.
The goal is to enable this model to run on TT hardware for high-throughput, low-latency audio generation across diverse use cases including virtual assistants, audiobooks, content creation, and interactive media.


### 🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.


### Stage 1 — Bring-Up

- Implement Higgs Audio v2 using TTNN APIs (Python)
- Implements the full generation pipeline:
         LLM backbone with DualFFN architecture
        Audio tokenizer (semantic + acoustic features)
        Audio decoder (token-to-waveform conversion)
- Model runs on either N150 or N300 tenstorrent hardware with no errors
- Supports multiple generation modes:
      Text-to-speech: Generate speech from text only
      Voice cloning: Generate speech with reference audio
      Multi-speaker dialog: Generate conversations with distinct speakers
- Produces valid audio output on sample texts (English and multilingual)
- Output is verifiable (audio quality assessment, compare with PyTorch reference)
- Achieves baseline throughput target:
      At least 30 tokens/second for autoregressive generation
      Real-time factor (RTF) < 0.5 for typical sentences
- Accuracy evaluation: Token-level accuracy > 95% against PyTorch reference
- Audio quality: Passes intelligibility and expressiveness tests
- Clear instructions for setup and running the model

### Stage 2 — Basic Optimizations

- Use optimal sharded/interleaved memory configs for LLM layers
- Implement efficient sharding strategy for:
    - Token embeddings (text + audio tokens)
    - DualFFN transformer layers
- Multi-head attention mechanisms
- Audio tokenizer encoder/decoder
- Fuse simple ops where possible (e.g., layer normalization, attention patterns, activation functions)
- Store intermediate activations in L1 where beneficial
- Use recommended TTNN/tt-metal LLM flows
- Leverage TT library of fused ops for attention and MLP blocks 
- Optimize DualFFN architecture (dual feed-forward networks)
- Efficient KV-cache management for autoregressive generation
- Optimize audio tokenizer integration

### Stage 3 — Deeper Optimization

- Maximize core counts used per inference
- Implement deeper TT-specific optimizations:
      - Efficient KV-cache management for long sequences
      - Optimized DualFFN computation (parallel FFN paths)
      - Flash Attention or equivalent for attention layers
- Minimize token generation latency
- Batch processing for multiple utterances/speakers
- Efficient sampling strategies (temperature, top-p, top-k)
- Pipeline audio encoding/decoding with LLM generation
- Minimize memory and TM (tensor manipulation) overheads
- Explore speculative decoding or other acceleration techniques
- Document any advanced tuning, known limitations, or trade-offs
- Target stretched goals:
        - 60+ tokens/second generation speed
        - RTF < 0.2 for real-time applications
- Support for longer contexts (multi-turn dialog)
- Efficient multi-speaker handling 

### 🧭 Guidance & Starting Points

* Use the [TTNN model bring-up tech report](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md) as your primary reference
* Reference [LLM implementations in tt-metal](https://github.com/tenstorrent/tt-metal/tree/main/models/demos/wormhole) for LLM-based model patterns
* Use the [official Higgs Audio v2 repository](https://github.com/boson-ai/higgs-audio) for model architecture details
* Refer to the [Higgs Audio v2 blog](https://www.boson.ai/blog/higgs-audio-v2) for technical details
* Refer to [tokenizer blog](https://www.boson.ai/blog/audio-tokenizer) for audio tokenizer details
* Refer to [DualFFN architecture blog](https://www.boson.ai/blog/dualffn-architecture) for architecture innovations
* Refer to [TT Fused ops PR #29236](https://github.com/tenstorrent/tt-metal/pull/29236) for optimization opportunities
* The model architecture consists of:
  * **Audio tokenizer**: Unified semantic and acoustic feature extraction
  * **LLM backbone**: Transformer with DualFFN architecture
  * **DualFFN layers**: Dual feed-forward networks for text and audio tokens
  * **Audio decoder**: Token-to-waveform conversion (24 kHz output)
* Key challenges:
  * DualFFN architecture optimization (dual FFN paths)
  * Large vocabulary (text + audio tokens)
  * Autoregressive generation latency
  * Audio tokenizer integration
  * Multi-speaker context management
* Ask for help or file issues if ops are missing in TTNN

### 🔎 Possible Approaches

- Start from the [official Higgs Audio v2 repository](https://github.com/boson-ai/higgs-audio)  and port components sequentially:            
- End-to-end pipeline integration
- Validate each submodule's output against PyTorch reference before integration
- For example the LLM backbone:
- Start with standard transformer layers
- Add DualFFN architecture (key innovation)
- Optimize attention mechanisms
- Implement efficient KV-cache 
- Experiment with different sharding strategies
- Use TTNN profiling tools to identify bottlenecks 
- Test diverse use cases
- Open a draft PR early to get feedback on your approach


## 📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after **all 3 stages** are completed.

**Deliverables:**
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review

**Links:**
- [Performance Sheet](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md#41-performance-sheet)
- [Perf Header Docs](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)


### 📚 Resources

### Model Resources
* [Higgs Audio v2 Official Repository](https://github.com/boson-ai/higgs-audio)
* [Higgs Audio v2 Release Blog](https://www.boson.ai/blog/higgs-audio-v2)
* [Audio Tokenizer Technical Blog](https://www.boson.ai/blog/audio-tokenizer)
* [DualFFN Architecture Blog](https://www.boson.ai/blog/dualffn-architecture)
* [Boson AI Website](https://www.boson.ai/)

### Evaluation Resources
* [Seed-TTS Eval Benchmark](https://github.com/ByteDance/seed-tts-eval)
* [Emotional Speech Dataset (ESD)](https://github.com/HLTSingapore/Emotional-Speech-Data)
* [EmergentTTS-Eval Paper](https://arxiv.org/abs/2410.01576)

### TT-Metal Resources
* [TTNN Model Bring-up Tech Report](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md)
* [LLM Implementations in tt-metal](https://github.com/tenstorrent/tt-metal/tree/main/models/demos/wormhole)
* [TT Fused Ops PR #29236](https://github.com/tenstorrent/tt-metal/pull/29236)
* [Performance Report Header](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty $1500] Higgs Audio v2 bring up using TTNN APIs #32068

📝 Background

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

🔎 Possible Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bounty $1500] Higgs Audio v2 bring up using TTNN APIs #32068

Description

📝 Background

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

🔎 Possible Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Evaluation Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions