[Bounty $1500] Granite Timeseries TTM-R1 (Tiny Time Mixer) Bring-Up Using TTNN APIs

### 📝 Background

This bounty is for bringing up **Granite Timeseries TTM-R1** (Tiny Time Mixer) using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).

Granite Timeseries TTM-R1 is a revolutionary compact pre-trained foundation model developed by IBM Research for multivariate time-series forecasting. With **less than 1 million parameters**, it introduces the concept of "tiny" pre-trained models in the time series domain, achieving state-of-the-art performance that rivals models with billions of parameters in zero-shot and few-shot forecasting scenarios.

### Key Capabilities:

- **Ultra-Lightweight Foundation Model**: < 1 million parameters
  - Smallest foundation model in time series forecasting
  - 500x smaller than TimesFM (500M params)
  - Efficient deployment on edge devices and resource-constrained environments
  - Fast inference with minimal memory footprint
- **Pre-trained on Massive Scale**: 250 million public time-series samples
  - Diverse domains: energy, weather, finance, transportation, etc.
  - Various augmentation techniques for robustness
  - Zero-shot and few-shot forecasting capabilities
- **Tiny Time Mixer Architecture**: Lightweight MLP-Mixer variant
  - Adaptive patching strategy
  - Lightweight mixing layers (time and channel)
  - Residual connections
  - Efficient normalization
- **Optimized for Specific Settings**: Focused pre-training
  - Context length: 512
  - Forecast length: 96
  - Ideal for minutely to hourly resolutions (10 min, 15 min, 1 hour)
- **Point Forecasting**: Direct prediction (not probabilistic)
  - Mean squared error (MSE) loss
  - Fast, deterministic predictions
  - Suitable for real-time applications
- **Zero-Shot and Few-Shot**: Minimal data requirements
  - Zero-shot: Works out-of-the-box without fine-tuning
  - Few-shot: Fine-tune with minimal data (< 5% of dataset)
  - Rapid adaptation to new domains

---

### 🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

### Stage 1 — Bring-Up

- [ ] Implement Granite TTM-R1 model using TTNN APIs (Python)
- [ ] Implements the Tiny Time Mixer architecture:
  - Adaptive patching layer (learns optimal patch size)
  - Patch embedding with lightweight projection
  - Lightweight Time-Mixing layers (MLP-Mixer style)
  - Lightweight Channel-Mixing layers (cross-variate dependencies)
  - Residual connections throughout
  - Normalization layers (efficient LayerNorm or similar)
  - Forecasting head for point predictions
- [ ] Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
- [ ] Supports zero-shot and few-shot forecasting:
  - **Zero-shot**: Use pre-trained weights directly without fine-tuning
  - **Few-shot**: Fine-tune with minimal data (< 5% of dataset)
  - **Context length**: 512 (optimized for this setting)
  - **Forecast length**: 96 (optimized for this setting)
- [ ] Loads pre-trained weights from HuggingFace:
  - `ibm-granite/granite-timeseries-ttm-r1` (< 1M parameter model)
- [ ] Produces valid predictions on standard benchmarks (ETT, Weather, Electricity)
- [ ] Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
- [ ] Achieves baseline performance targets:
  - **Inference throughput**: At least 500 sequences/second (tiny model advantage)
  - **Latency**: < 10ms for single sequence prediction (batch size 1)
  - **Memory footprint**: < 10MB model size
  - **Zero-shot accuracy**: Within 10% of published results
- [ ] Accuracy evaluation:
  - MSE and MAE within 5% of PyTorch reference implementation
  - Zero-shot performance on multiple datasets
  - Few-shot performance with limited training data
- [ ] Clear instructions for setup, loading pre-trained weights, and inference

### Stage 2 — Basic Optimizations

- [ ] Use optimal sharded/interleaved memory configs for:
  - Tiny model (< 1M parameters, very efficient)
  - Adaptive patching layers
  - Lightweight mixing layers (time and channel)
  - Embedding layers
  - Forecasting head
- [ ] Implement efficient sharding strategy for:
  - Lightweight MLP-Mixer blocks
  - Time-mixing operations
  - Channel-mixing operations
  - Residual connections
- [ ] Fuse simple ops where possible:
  - Patching + embedding
  - Mixing layers (time and channel)
  - Normalization + linear layers
  - Residual connections
  - Activation functions
- [ ] Store intermediate activations in L1 where beneficial
- [ ] Use recommended TTNN/tt-metal MLP flows
- [ ] Leverage TT library of fused ops for:
  - MLP blocks (lightweight version)
  - Normalization layers
  - Residual operations
- [ ] Optimize patch-specific operations:
  - Adaptive patching strategy
  - Patch embedding
  - Efficient patch processing
- [ ] Optimize mixing operations:
  - Lightweight time-mixing
  - Lightweight channel-mixing
  - Minimize transpose overhead


### Stage 3 — Deeper Optimization

- [ ] Maximize core counts used per inference
- [ ] Implement deeper TT-specific optimizations:
  - Parallel processing of patches
  - Parallel time-mixing and channel-mixing
  - Efficient residual connections
  - Optimized normalization
  - Minimize memory movement (tiny model advantage)
- [ ] Minimize prediction latency for ultra-fast inference
- [ ] Batch processing for massive throughput
- [ ] Optimize for tiny model characteristics:
  - Leverage < 1M parameters for extreme efficiency
  - Minimize weight loading overhead
  - Optimize for frequent model swaps (multi-tenant scenarios)
  - Cache-friendly inference patterns
- [ ] Optimize adaptive patching:
  - Efficient patch size computation
  - Dynamic patching strategies
  - Minimize overhead
- [ ] Pipeline mixing operations:
  - Overlap time-mixing and channel-mixing
  - Efficient sequential processing
- [ ] Minimize memory and TM (tensor manipulation) overheads
- [ ] Support for streaming inference (online forecasting)
- [ ] Explore multi-model deployment (serve 1000s of TTM instances)
- [ ] Document any advanced tuning, known limitations, or trade-offs
- [ ] Target stretched goals:
  - **2000+ sequences/second** throughput (tiny model enables this!)
  - **< 5ms latency** for single sequence prediction
  - **< 5MB memory footprint** for model
  - Support for edge deployment scenarios
  - Multi-model serving (100+ instances simultaneously)
- [ ] Zero-shot performance within 5% of reference

### 🧭 Guidance & Starting Points

### Primary Resources

- Use the **TTNN model bring-up tech report** as your primary reference
- Reference **MLP-Mixer implementations in tt-metal** for mixing patterns
- Use the **HuggingFace Granite Timeseries TTM-R1** ([model card](https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1)) as the reference implementation
- Use the **IBM TSFM (Time Series Foundation Models)** repository for architecture details
- Refer to **TT Fused ops PR #29236** for optimization opportunities

### HuggingFace Implementation Reference

The IBM Granite Timeseries TTM-R1 is available on HuggingFace:

**Model Details:**
- **Parameters**: < 1 million (ultra-lightweight)
- **Pre-training**: 250 million time-series samples
- **Architecture**: Tiny Time Mixer (lightweight MLP-Mixer variant)
- **Optimized for**: Context 512, Forecast 96
- **Resolution**: Minutely to hourly (10 min, 15 min, 1 hour)
- **Type**: Point forecasting (not probabilistic)
- **License**: Apache 2.0

**Key Features:**
- **Zero-Shot Forecasting**: Works out-of-the-box without fine-tuning
- **Few-Shot Learning**: Fine-tune with < 5% of data
- **Adaptive Patching**: Learns optimal patch size for input
- **Lightweight Mixing**: Efficient time and channel mixing
- **Fast Inference**: Minimal parameters enable rapid predictions
- **Small Memory**: < 5MB model size

### 🔎 Possible Approaches

### Sequential Implementation Strategy

1. **Start from HuggingFace/IBM TSFM implementation** and port components sequentially:
   - Begin with adaptive patching layer
   - Implement patch embedding
   - Implement single Tiny Time Mixer layer (time + channel mixing)
   - Replicate for all layers
   - Add forecasting head
   - Test zero-shot inference
   - Optionally add few-shot fine-tuning

2. **Leverage lightweight patterns**:
   - Use efficient MLP implementations
   - Optimize for small hidden dimensions
   - Minimize overhead (model is so small, overhead matters!)
   - Cache-friendly access patterns

3. **Progressive testing**:
   - Start with synthetic data
   - Test zero-shot on standard benchmarks
   - Test few-shot with limited data
   - Validate against PyTorch reference
   - Measure inference speed (should be very fast!)

4. **Validate each component** against PyTorch reference:
   - Test adaptive patching outputs
   - Validate patch embedding
   - Check time-mixing layer
   - Check channel-mixing layer
   - Validate full model output
   - Compare zero-shot performance

5. **Test on standard benchmarks**:
   - ETT datasets (optimized for hourly)
   - Weather dataset
   - Electricity dataset
   - Test zero-shot (no fine-tuning)
   - Test few-shot (< 5% data)
   - Compare with published results

6. **Optimize for tiny model**:
   - Minimize per-inference overhead
   - Optimize weight loading (should be negligible)
   - Maximize throughput (tiny model enables high throughput)
   - Test multi-model serving scenarios

7. **Use TTNN profiling tools** to identify bottlenecks:
   - Measure overhead vs. compute (overhead should be minimal)
   - Profile mixing layers
   - Identify any inefficiencies
   - Optimize for ultra-low latency

8. **Open a draft PR early** to get feedback on your approach

### Alternative Approaches

- **Modular testing**: 
  - Implement Tiny Time Mixer layer as standalone
  - Test and optimize
  - Scale to full model
- **Start simple**: 
  - Test with fewer layers initially
  - Gradually scale to full model
- **Leverage existing code**: 
  - Use PatchTSMixer as starting point (similar architecture)
  - Adapt for Tiny Time Mixer specifics
  - Add adaptive patching
- **Progressive features**: 
  - Start with zero-shot inference
  - Add few-shot fine-tuning (optional)
  - Add multi-model serving (stretch goal)

### 📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after **all 3 stages** are completed.

**Deliverables:**
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review

**Links:**
- [Performance Sheet](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md#41-performance-sheet)
- [Perf Header Docs](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)

---

### 📚 Resources

### Model Resources

- **HuggingFace Model Card**: https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1
- **IBM TSFM Repository**: https://github.com/IBM/tsfm (Time Series Foundation Models)
- **Research Paper**: "Tiny Time Mixers (TTM): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting"
  - Authors: IBM Research
  - arXiv: https://arxiv.org/abs/2401.03955
- **Blog Post**: IBM Research announcement
- **Demo Notebooks**: https://github.com/IBM/tsfm/tree/main/notebooks/hfdemo/tinytimemixer

### Related Models

- **Granite TTM Family**: IBM's Tiny Time Mixer models
  - TTM-R1: < 1M params, 512→96
  - Other variants for different settings
- **PatchTSMixer**: Larger IBM model (similar architecture but bigger)
- **TimesFM**: Google's large foundation model (500M params)

### Datasets & Benchmarks

**Primary Source** (Recommended):
- **Time Series Library (TSLib)**: https://github.com/thuml/Time-Series-Library
  - Contains all preprocessed datasets in consistent format
  - Used by most recent papers for benchmarking
  - Includes train/val/test splits

**Individual Datasets**:

- **ETT (Electricity Transformer Temperature)**: 
  - ETTh1, ETTh2 (hourly, 7 features)
  - ETTm1, ETTm2 (15-minute, 7 features)
  - GitHub: https://github.com/zhouhaoyi/ETDataset
  - Also in TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/ETT

- **Weather Dataset**: 
  - 21 meteorological indicators from 21 weather stations
  - 10-minute intervals, 2020 data
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/weather

- **Traffic Dataset**: 
  - Road occupancy rates from 862 Bay Area sensors
  - Hourly data, 2015-2016
  - Source: PeMS (http://pems.dot.ca.gov/)
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/traffic

- **Electricity (ECL) Dataset**: 
  - Hourly consumption from 321 clients
  - 2012-2014 data
  - UCI: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/electricity

- **Exchange Rate Dataset**: 
  - Daily exchange rates for 8 countries
  - 1990-2016 data
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/exchange_rate

**Benchmark Scripts**:
- TSLib provides standard evaluation scripts
- Consistent train/val/test splits across all datasets
- MSE, MAE metrics computed uniformly

### Tiny Model Resources

- **Advantages of Tiny Models**:
  - Edge deployment
  - Multi-model serving
  - Low latency
  - Minimal resources
  - Cost-effective
- **Optimization Techniques**:
  - Knowledge distillation
  - Efficient architectures
  - Pre-training strategies
  - Few-shot learning

### Academic Resources

- **Original Paper**: IBM Research, "Tiny Time Mixers", 2024
- **Key Insights**:
  - < 1M params achieves SOTA zero-shot
  - Adaptive patching improves efficiency
  - Few-shot with < 5% data
  - Lightweight mixing is sufficient
- **Related Work**:
  - MLP-Mixer (vision)
  - PatchTSMixer (larger time series model)
  - Efficient transformers

### TT-Metal Resources

- **TTNN Model Bring-up Tech Report**: [Link to tech report]
- **MLP-Mixer Implementations**: Check for MLP-based models
- **Lightweight Model Patterns**: Optimization for small models
- **TT Fused Ops PR #29236**: https://github.com/tenstorrent/tt-metal/pull/29236
- **Performance Report Header**: https://github.com/tenstorrent/tt-metal/blob/main/tests/docs/perf_header.md
- **TTNN Documentation**: https://github.com/tenstorrent/tt-metal/tree/main/ttnn

### Helpful Tools

- **Weight Loading**: HuggingFace `from_pretrained` integration
- **Visualization**: 
  - Zero-shot vs. supervised comparison
  - Few-shot learning curves
  - Model size comparison
- **Profiling**: TTNN profiler for tiny model
- **Testing**: pytest framework
- **Benchmarking**: IBM's benchmark notebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty $1500] Granite Timeseries TTM-R1 (Tiny Time Mixer) Bring-Up Using TTNN APIs #32142

📝 Background

Key Capabilities:

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

Primary Resources

HuggingFace Implementation Reference

🔎 Possible Approaches

Sequential Implementation Strategy

Alternative Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Related Models

Datasets & Benchmarks

Tiny Model Resources

Academic Resources

TT-Metal Resources

Helpful Tools

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bounty $1500] Granite Timeseries TTM-R1 (Tiny Time Mixer) Bring-Up Using TTNN APIs #32142

Description

📝 Background

Key Capabilities:

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

Primary Resources

HuggingFace Implementation Reference

🔎 Possible Approaches

Sequential Implementation Strategy

Alternative Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Related Models

Datasets & Benchmarks

Tiny Model Resources

Academic Resources

TT-Metal Resources

Helpful Tools

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions