Skip to content

[Bounty $1500] Bark Small bring up using TTNN APIs #32069

@tvardhineniTT

Description

@tvardhineniTT

📝 Background

This bounty is for bringing up the Bark Small model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
Bark is a transformer-based text-to-audio model created by Suno that goes beyond simple text-to-speech. Released in April 2023 with MIT license, it offers unique capabilities:
Multilingual speech: Highly realistic speech generation in 13 languages
Beyond speech: Can generate music, background noise, and sound effects
Expressive audio: Produces nonverbal communications (laughing, sighing, crying)
Three-stage architecture: Text → Semantic → Coarse → Fine tokens
High-quality output: 24 kHz mono audio with natural prosody
EnCodec integration: Uses Facebook's EnCodec codec with 8 codebooks
Voice presets: Consistent speaker characteristics across generations
Open source: MIT license, commercially usable
This bounty targets the Small version: 80M parameters per stage (240M total), making it more efficient while maintaining strong audio quality.
The model is applicable in accessibility tools, audiobook narration, content creation, interactive media, game audio, podcasting, and creative audio applications.

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement all three Bark Small models using TTNN APIs (Python):
    Text-to-semantic model (80M parameters)
    Semantic-to-coarse model (80M parameters)
    Coarse-to-fine model (80M parameters)
  • Models run on either N150 or N300 tenstorrent hardware with no errors
  • Integrates EnCodec decoder for token-to-waveform conversion
  • Produces valid 24 kHz audio output on sample texts (multiple languages)
  • Generates expressive audio with nonverbal elements (using annotations like [laughs], [sighs])
  • Output is verifiable (audio quality assessment, compare with PyTorch reference)
  • Achieves baseline throughput target:
    At least 20 tokens/second for semantic generation
    At least 60 tokens/second for coarse/fine generation
  • Overall RTF < 0.8 for typical sentences
  • Accuracy evaluation: Token-level accuracy > 95% against HuggingFace reference
  • Audio quality: Passes intelligibility and naturalness tests
  • Clear instructions for setup and running the model

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for all three transformer models
  • Implement efficient sharding strategy for each stage:
    Text-to-semantic: BERT tokenizer input, causal attention, 10k vocab output
    Semantic-to-coarse: Semantic input, causal attention, 2×1024 codebook output
    Coarse-to-fine: Coarse input, non-causal attention, 6×1024 codebook output
  • Fuse simple ops where possible (e.g., layer normalization, attention softmax)
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal transformer flows
  • Leverage TT library of fused ops for attention and MLP blocks
  • Optimize the multi-stage pipeline (minimize data transfer between stages)
  • Efficient handling of multiple codebook outputs (2 then 6 codebooks)
  • Optimize causal vs. non-causal attention patterns

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    Pipeline all three models efficiently (overlap computation)
    Optimize causal attention for stages 1-2 (autoregressive)
    Optimize non-causal attention for stage 3 (can parallelize)
  • Optimize EnCodec decoder integration
  • Batch processing for multiple audio segments
  • Efficient voice preset loading and switching
  • Minimize memory and TM (tensor manipulation) overheads
  • Explore faster sampling strategies
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    RTF < 0.4 for real-time applications
  • Support for longer text inputs (500+ characters)
  • Efficient voice preset switching

🧭 Guidance & Starting Points

  • Use the TTNN model bring-up tech report as your primary reference
  • Reference Transformer implementations in tt-metal for transformer patterns
  • Use the HuggingFace Bark-small model as PyTorch reference implementation
  • Use the original Bark repository for architecture details
  • Reference EnCodec repository for codec integration
  • Refer to TT Fused ops PR #29236 for optimization opportunities
  • The three-stage architecture:
    • Stage 1 - Text to Semantic: BERT tokenizer → 80M transformer (causal) → 10k semantic vocab
    • Stage 2 - Semantic to Coarse: Semantic tokens → 80M transformer (causal) → 2×1024 EnCodec codebooks
    • Stage 3 - Coarse to Fine: Coarse tokens → 80M transformer (non-causal) → 6×1024 EnCodec codebooks
    • EnCodec Decoder: 8 codebooks → 24 kHz waveform
  • Model architecture per stage (Small version):
    • Parameters: 80M per stage (240M total)
    • Attention: Causal (stages 1-2), Non-causal (stage 3)
    • Output: 24 kHz mono audio
  • Key challenges:
    • Multi-stage pipeline optimization
    • Multiple codebook outputs (8 total: 2 then 6)
    • Causal vs. non-causal attention patterns
    • Integration with external codec (EnCodec)
    • Handling diverse output types (speech, music, effects, nonverbal)
  • Input format:
    • Plain text: "Hello, my dog is cooler than you!"
    • With emotions: "Hello [laughs] this is amazing [sighs]"
    • Voice presets available for consistent speakers
  • Ask for help or file issues if ops are missing in TTNN

🔎 Possible Approaches

  • Start from the HuggingFace Bark-small implementation and port components sequentially:
  • Validate each stage's output against PyTorch reference before moving to next stage
  • Optimize attention patterns (causal vs. non-causal)
  • Minimize data transfer between stages
  • Use TTNN profiling tools to identify bottlenecks
  • Test diverse use cases:
  • Plain speech in multiple languages (English, Spanish, Chinese, etc.)
  • Speech with emotions using annotations ([laughs], [sighs], [gasps])
  • Background music/sound effects generation
  • Voice presets for consistent speakers
  • Open a draft PR early to get feedback on your approach

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:

📚 Resources

Model Resources

TT-Metal Resources

Metadata

Metadata

Assignees

Type

No type

Projects

Status

PR Submitted 🕒

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions