Skip to content

[Bounty $1500] Granite Timeseries TTM-R1 (Tiny Time Mixer) Bring-Up Using TTNN APIs #32142

@sdawle-TT

Description

@sdawle-TT

📝 Background

This bounty is for bringing up Granite Timeseries TTM-R1 (Tiny Time Mixer) using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).

Granite Timeseries TTM-R1 is a revolutionary compact pre-trained foundation model developed by IBM Research for multivariate time-series forecasting. With less than 1 million parameters, it introduces the concept of "tiny" pre-trained models in the time series domain, achieving state-of-the-art performance that rivals models with billions of parameters in zero-shot and few-shot forecasting scenarios.

Key Capabilities:

  • Ultra-Lightweight Foundation Model: < 1 million parameters
    • Smallest foundation model in time series forecasting
    • 500x smaller than TimesFM (500M params)
    • Efficient deployment on edge devices and resource-constrained environments
    • Fast inference with minimal memory footprint
  • Pre-trained on Massive Scale: 250 million public time-series samples
    • Diverse domains: energy, weather, finance, transportation, etc.
    • Various augmentation techniques for robustness
    • Zero-shot and few-shot forecasting capabilities
  • Tiny Time Mixer Architecture: Lightweight MLP-Mixer variant
    • Adaptive patching strategy
    • Lightweight mixing layers (time and channel)
    • Residual connections
    • Efficient normalization
  • Optimized for Specific Settings: Focused pre-training
    • Context length: 512
    • Forecast length: 96
    • Ideal for minutely to hourly resolutions (10 min, 15 min, 1 hour)
  • Point Forecasting: Direct prediction (not probabilistic)
    • Mean squared error (MSE) loss
    • Fast, deterministic predictions
    • Suitable for real-time applications
  • Zero-Shot and Few-Shot: Minimal data requirements
    • Zero-shot: Works out-of-the-box without fine-tuning
    • Few-shot: Fine-tune with minimal data (< 5% of dataset)
    • Rapid adaptation to new domains

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement Granite TTM-R1 model using TTNN APIs (Python)
  • Implements the Tiny Time Mixer architecture:
    • Adaptive patching layer (learns optimal patch size)
    • Patch embedding with lightweight projection
    • Lightweight Time-Mixing layers (MLP-Mixer style)
    • Lightweight Channel-Mixing layers (cross-variate dependencies)
    • Residual connections throughout
    • Normalization layers (efficient LayerNorm or similar)
    • Forecasting head for point predictions
  • Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
  • Supports zero-shot and few-shot forecasting:
    • Zero-shot: Use pre-trained weights directly without fine-tuning
    • Few-shot: Fine-tune with minimal data (< 5% of dataset)
    • Context length: 512 (optimized for this setting)
    • Forecast length: 96 (optimized for this setting)
  • Loads pre-trained weights from HuggingFace:
    • ibm-granite/granite-timeseries-ttm-r1 (< 1M parameter model)
  • Produces valid predictions on standard benchmarks (ETT, Weather, Electricity)
  • Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
  • Achieves baseline performance targets:
    • Inference throughput: At least 500 sequences/second (tiny model advantage)
    • Latency: < 10ms for single sequence prediction (batch size 1)
    • Memory footprint: < 10MB model size
    • Zero-shot accuracy: Within 10% of published results
  • Accuracy evaluation:
    • MSE and MAE within 5% of PyTorch reference implementation
    • Zero-shot performance on multiple datasets
    • Few-shot performance with limited training data
  • Clear instructions for setup, loading pre-trained weights, and inference

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for:
    • Tiny model (< 1M parameters, very efficient)
    • Adaptive patching layers
    • Lightweight mixing layers (time and channel)
    • Embedding layers
    • Forecasting head
  • Implement efficient sharding strategy for:
    • Lightweight MLP-Mixer blocks
    • Time-mixing operations
    • Channel-mixing operations
    • Residual connections
  • Fuse simple ops where possible:
    • Patching + embedding
    • Mixing layers (time and channel)
    • Normalization + linear layers
    • Residual connections
    • Activation functions
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal MLP flows
  • Leverage TT library of fused ops for:
    • MLP blocks (lightweight version)
    • Normalization layers
    • Residual operations
  • Optimize patch-specific operations:
    • Adaptive patching strategy
    • Patch embedding
    • Efficient patch processing
  • Optimize mixing operations:
    • Lightweight time-mixing
    • Lightweight channel-mixing
    • Minimize transpose overhead

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    • Parallel processing of patches
    • Parallel time-mixing and channel-mixing
    • Efficient residual connections
    • Optimized normalization
    • Minimize memory movement (tiny model advantage)
  • Minimize prediction latency for ultra-fast inference
  • Batch processing for massive throughput
  • Optimize for tiny model characteristics:
    • Leverage < 1M parameters for extreme efficiency
    • Minimize weight loading overhead
    • Optimize for frequent model swaps (multi-tenant scenarios)
    • Cache-friendly inference patterns
  • Optimize adaptive patching:
    • Efficient patch size computation
    • Dynamic patching strategies
    • Minimize overhead
  • Pipeline mixing operations:
    • Overlap time-mixing and channel-mixing
    • Efficient sequential processing
  • Minimize memory and TM (tensor manipulation) overheads
  • Support for streaming inference (online forecasting)
  • Explore multi-model deployment (serve 1000s of TTM instances)
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    • 2000+ sequences/second throughput (tiny model enables this!)
    • < 5ms latency for single sequence prediction
    • < 5MB memory footprint for model
    • Support for edge deployment scenarios
    • Multi-model serving (100+ instances simultaneously)
  • Zero-shot performance within 5% of reference

🧭 Guidance & Starting Points

Primary Resources

  • Use the TTNN model bring-up tech report as your primary reference
  • Reference MLP-Mixer implementations in tt-metal for mixing patterns
  • Use the HuggingFace Granite Timeseries TTM-R1 (model card) as the reference implementation
  • Use the IBM TSFM (Time Series Foundation Models) repository for architecture details
  • Refer to TT Fused ops PR Fuse YoloV4 leaky RELU activations with convolution layers #29236 for optimization opportunities

HuggingFace Implementation Reference

The IBM Granite Timeseries TTM-R1 is available on HuggingFace:

Model Details:

  • Parameters: < 1 million (ultra-lightweight)
  • Pre-training: 250 million time-series samples
  • Architecture: Tiny Time Mixer (lightweight MLP-Mixer variant)
  • Optimized for: Context 512, Forecast 96
  • Resolution: Minutely to hourly (10 min, 15 min, 1 hour)
  • Type: Point forecasting (not probabilistic)
  • License: Apache 2.0

Key Features:

  • Zero-Shot Forecasting: Works out-of-the-box without fine-tuning
  • Few-Shot Learning: Fine-tune with < 5% of data
  • Adaptive Patching: Learns optimal patch size for input
  • Lightweight Mixing: Efficient time and channel mixing
  • Fast Inference: Minimal parameters enable rapid predictions
  • Small Memory: < 5MB model size

🔎 Possible Approaches

Sequential Implementation Strategy

  1. Start from HuggingFace/IBM TSFM implementation and port components sequentially:

    • Begin with adaptive patching layer
    • Implement patch embedding
    • Implement single Tiny Time Mixer layer (time + channel mixing)
    • Replicate for all layers
    • Add forecasting head
    • Test zero-shot inference
    • Optionally add few-shot fine-tuning
  2. Leverage lightweight patterns:

    • Use efficient MLP implementations
    • Optimize for small hidden dimensions
    • Minimize overhead (model is so small, overhead matters!)
    • Cache-friendly access patterns
  3. Progressive testing:

    • Start with synthetic data
    • Test zero-shot on standard benchmarks
    • Test few-shot with limited data
    • Validate against PyTorch reference
    • Measure inference speed (should be very fast!)
  4. Validate each component against PyTorch reference:

    • Test adaptive patching outputs
    • Validate patch embedding
    • Check time-mixing layer
    • Check channel-mixing layer
    • Validate full model output
    • Compare zero-shot performance
  5. Test on standard benchmarks:

    • ETT datasets (optimized for hourly)
    • Weather dataset
    • Electricity dataset
    • Test zero-shot (no fine-tuning)
    • Test few-shot (< 5% data)
    • Compare with published results
  6. Optimize for tiny model:

    • Minimize per-inference overhead
    • Optimize weight loading (should be negligible)
    • Maximize throughput (tiny model enables high throughput)
    • Test multi-model serving scenarios
  7. Use TTNN profiling tools to identify bottlenecks:

    • Measure overhead vs. compute (overhead should be minimal)
    • Profile mixing layers
    • Identify any inefficiencies
    • Optimize for ultra-low latency
  8. Open a draft PR early to get feedback on your approach

Alternative Approaches

  • Modular testing:
    • Implement Tiny Time Mixer layer as standalone
    • Test and optimize
    • Scale to full model
  • Start simple:
    • Test with fewer layers initially
    • Gradually scale to full model
  • Leverage existing code:
    • Use PatchTSMixer as starting point (similar architecture)
    • Adapt for Tiny Time Mixer specifics
    • Add adaptive patching
  • Progressive features:
    • Start with zero-shot inference
    • Add few-shot fine-tuning (optional)
    • Add multi-model serving (stretch goal)

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:


📚 Resources

Model Resources

Related Models

  • Granite TTM Family: IBM's Tiny Time Mixer models
    • TTM-R1: < 1M params, 512→96
    • Other variants for different settings
  • PatchTSMixer: Larger IBM model (similar architecture but bigger)
  • TimesFM: Google's large foundation model (500M params)

Datasets & Benchmarks

Primary Source (Recommended):

Individual Datasets:

Benchmark Scripts:

  • TSLib provides standard evaluation scripts
  • Consistent train/val/test splits across all datasets
  • MSE, MAE metrics computed uniformly

Tiny Model Resources

  • Advantages of Tiny Models:
    • Edge deployment
    • Multi-model serving
    • Low latency
    • Minimal resources
    • Cost-effective
  • Optimization Techniques:
    • Knowledge distillation
    • Efficient architectures
    • Pre-training strategies
    • Few-shot learning

Academic Resources

  • Original Paper: IBM Research, "Tiny Time Mixers", 2024
  • Key Insights:
    • < 1M params achieves SOTA zero-shot
    • Adaptive patching improves efficiency
    • Few-shot with < 5% data
    • Lightweight mixing is sufficient
  • Related Work:
    • MLP-Mixer (vision)
    • PatchTSMixer (larger time series model)
    • Efficient transformers

TT-Metal Resources

Helpful Tools

  • Weight Loading: HuggingFace from_pretrained integration
  • Visualization:
    • Zero-shot vs. supervised comparison
    • Few-shot learning curves
    • Model size comparison
  • Profiling: TTNN profiler for tiny model
  • Testing: pytest framework
  • Benchmarking: IBM's benchmark notebooks

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Worked On 🔨

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions