-
Notifications
You must be signed in to change notification settings - Fork 347
Description
📝 Background
This bounty is for bringing up Granite Timeseries TTM-R1 (Tiny Time Mixer) using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
Granite Timeseries TTM-R1 is a revolutionary compact pre-trained foundation model developed by IBM Research for multivariate time-series forecasting. With less than 1 million parameters, it introduces the concept of "tiny" pre-trained models in the time series domain, achieving state-of-the-art performance that rivals models with billions of parameters in zero-shot and few-shot forecasting scenarios.
Key Capabilities:
- Ultra-Lightweight Foundation Model: < 1 million parameters
- Smallest foundation model in time series forecasting
- 500x smaller than TimesFM (500M params)
- Efficient deployment on edge devices and resource-constrained environments
- Fast inference with minimal memory footprint
- Pre-trained on Massive Scale: 250 million public time-series samples
- Diverse domains: energy, weather, finance, transportation, etc.
- Various augmentation techniques for robustness
- Zero-shot and few-shot forecasting capabilities
- Tiny Time Mixer Architecture: Lightweight MLP-Mixer variant
- Adaptive patching strategy
- Lightweight mixing layers (time and channel)
- Residual connections
- Efficient normalization
- Optimized for Specific Settings: Focused pre-training
- Context length: 512
- Forecast length: 96
- Ideal for minutely to hourly resolutions (10 min, 15 min, 1 hour)
- Point Forecasting: Direct prediction (not probabilistic)
- Mean squared error (MSE) loss
- Fast, deterministic predictions
- Suitable for real-time applications
- Zero-Shot and Few-Shot: Minimal data requirements
- Zero-shot: Works out-of-the-box without fine-tuning
- Few-shot: Fine-tune with minimal data (< 5% of dataset)
- Rapid adaptation to new domains
🎯 What Success Looks Like
A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.
Stage 1 — Bring-Up
- Implement Granite TTM-R1 model using TTNN APIs (Python)
- Implements the Tiny Time Mixer architecture:
- Adaptive patching layer (learns optimal patch size)
- Patch embedding with lightweight projection
- Lightweight Time-Mixing layers (MLP-Mixer style)
- Lightweight Channel-Mixing layers (cross-variate dependencies)
- Residual connections throughout
- Normalization layers (efficient LayerNorm or similar)
- Forecasting head for point predictions
- Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
- Supports zero-shot and few-shot forecasting:
- Zero-shot: Use pre-trained weights directly without fine-tuning
- Few-shot: Fine-tune with minimal data (< 5% of dataset)
- Context length: 512 (optimized for this setting)
- Forecast length: 96 (optimized for this setting)
- Loads pre-trained weights from HuggingFace:
ibm-granite/granite-timeseries-ttm-r1(< 1M parameter model)
- Produces valid predictions on standard benchmarks (ETT, Weather, Electricity)
- Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
- Achieves baseline performance targets:
- Inference throughput: At least 500 sequences/second (tiny model advantage)
- Latency: < 10ms for single sequence prediction (batch size 1)
- Memory footprint: < 10MB model size
- Zero-shot accuracy: Within 10% of published results
- Accuracy evaluation:
- MSE and MAE within 5% of PyTorch reference implementation
- Zero-shot performance on multiple datasets
- Few-shot performance with limited training data
- Clear instructions for setup, loading pre-trained weights, and inference
Stage 2 — Basic Optimizations
- Use optimal sharded/interleaved memory configs for:
- Tiny model (< 1M parameters, very efficient)
- Adaptive patching layers
- Lightweight mixing layers (time and channel)
- Embedding layers
- Forecasting head
- Implement efficient sharding strategy for:
- Lightweight MLP-Mixer blocks
- Time-mixing operations
- Channel-mixing operations
- Residual connections
- Fuse simple ops where possible:
- Patching + embedding
- Mixing layers (time and channel)
- Normalization + linear layers
- Residual connections
- Activation functions
- Store intermediate activations in L1 where beneficial
- Use recommended TTNN/tt-metal MLP flows
- Leverage TT library of fused ops for:
- MLP blocks (lightweight version)
- Normalization layers
- Residual operations
- Optimize patch-specific operations:
- Adaptive patching strategy
- Patch embedding
- Efficient patch processing
- Optimize mixing operations:
- Lightweight time-mixing
- Lightweight channel-mixing
- Minimize transpose overhead
Stage 3 — Deeper Optimization
- Maximize core counts used per inference
- Implement deeper TT-specific optimizations:
- Parallel processing of patches
- Parallel time-mixing and channel-mixing
- Efficient residual connections
- Optimized normalization
- Minimize memory movement (tiny model advantage)
- Minimize prediction latency for ultra-fast inference
- Batch processing for massive throughput
- Optimize for tiny model characteristics:
- Leverage < 1M parameters for extreme efficiency
- Minimize weight loading overhead
- Optimize for frequent model swaps (multi-tenant scenarios)
- Cache-friendly inference patterns
- Optimize adaptive patching:
- Efficient patch size computation
- Dynamic patching strategies
- Minimize overhead
- Pipeline mixing operations:
- Overlap time-mixing and channel-mixing
- Efficient sequential processing
- Minimize memory and TM (tensor manipulation) overheads
- Support for streaming inference (online forecasting)
- Explore multi-model deployment (serve 1000s of TTM instances)
- Document any advanced tuning, known limitations, or trade-offs
- Target stretched goals:
- 2000+ sequences/second throughput (tiny model enables this!)
- < 5ms latency for single sequence prediction
- < 5MB memory footprint for model
- Support for edge deployment scenarios
- Multi-model serving (100+ instances simultaneously)
- Zero-shot performance within 5% of reference
🧭 Guidance & Starting Points
Primary Resources
- Use the TTNN model bring-up tech report as your primary reference
- Reference MLP-Mixer implementations in tt-metal for mixing patterns
- Use the HuggingFace Granite Timeseries TTM-R1 (model card) as the reference implementation
- Use the IBM TSFM (Time Series Foundation Models) repository for architecture details
- Refer to TT Fused ops PR Fuse YoloV4 leaky RELU activations with convolution layers #29236 for optimization opportunities
HuggingFace Implementation Reference
The IBM Granite Timeseries TTM-R1 is available on HuggingFace:
Model Details:
- Parameters: < 1 million (ultra-lightweight)
- Pre-training: 250 million time-series samples
- Architecture: Tiny Time Mixer (lightweight MLP-Mixer variant)
- Optimized for: Context 512, Forecast 96
- Resolution: Minutely to hourly (10 min, 15 min, 1 hour)
- Type: Point forecasting (not probabilistic)
- License: Apache 2.0
Key Features:
- Zero-Shot Forecasting: Works out-of-the-box without fine-tuning
- Few-Shot Learning: Fine-tune with < 5% of data
- Adaptive Patching: Learns optimal patch size for input
- Lightweight Mixing: Efficient time and channel mixing
- Fast Inference: Minimal parameters enable rapid predictions
- Small Memory: < 5MB model size
🔎 Possible Approaches
Sequential Implementation Strategy
-
Start from HuggingFace/IBM TSFM implementation and port components sequentially:
- Begin with adaptive patching layer
- Implement patch embedding
- Implement single Tiny Time Mixer layer (time + channel mixing)
- Replicate for all layers
- Add forecasting head
- Test zero-shot inference
- Optionally add few-shot fine-tuning
-
Leverage lightweight patterns:
- Use efficient MLP implementations
- Optimize for small hidden dimensions
- Minimize overhead (model is so small, overhead matters!)
- Cache-friendly access patterns
-
Progressive testing:
- Start with synthetic data
- Test zero-shot on standard benchmarks
- Test few-shot with limited data
- Validate against PyTorch reference
- Measure inference speed (should be very fast!)
-
Validate each component against PyTorch reference:
- Test adaptive patching outputs
- Validate patch embedding
- Check time-mixing layer
- Check channel-mixing layer
- Validate full model output
- Compare zero-shot performance
-
Test on standard benchmarks:
- ETT datasets (optimized for hourly)
- Weather dataset
- Electricity dataset
- Test zero-shot (no fine-tuning)
- Test few-shot (< 5% data)
- Compare with published results
-
Optimize for tiny model:
- Minimize per-inference overhead
- Optimize weight loading (should be negligible)
- Maximize throughput (tiny model enables high throughput)
- Test multi-model serving scenarios
-
Use TTNN profiling tools to identify bottlenecks:
- Measure overhead vs. compute (overhead should be minimal)
- Profile mixing layers
- Identify any inefficiencies
- Optimize for ultra-low latency
-
Open a draft PR early to get feedback on your approach
Alternative Approaches
- Modular testing:
- Implement Tiny Time Mixer layer as standalone
- Test and optimize
- Scale to full model
- Start simple:
- Test with fewer layers initially
- Gradually scale to full model
- Leverage existing code:
- Use PatchTSMixer as starting point (similar architecture)
- Adapt for Tiny Time Mixer specifics
- Add adaptive patching
- Progressive features:
- Start with zero-shot inference
- Add few-shot fine-tuning (optional)
- Add multi-model serving (stretch goal)
📊 Result Submission Guidelines
Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.
Deliverables:
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review
Links:
📚 Resources
Model Resources
- HuggingFace Model Card: https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1
- IBM TSFM Repository: https://github.com/IBM/tsfm (Time Series Foundation Models)
- Research Paper: "Tiny Time Mixers (TTM): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting"
- Authors: IBM Research
- arXiv: https://arxiv.org/abs/2401.03955
- Blog Post: IBM Research announcement
- Demo Notebooks: https://github.com/IBM/tsfm/tree/main/notebooks/hfdemo/tinytimemixer
Related Models
- Granite TTM Family: IBM's Tiny Time Mixer models
- TTM-R1: < 1M params, 512→96
- Other variants for different settings
- PatchTSMixer: Larger IBM model (similar architecture but bigger)
- TimesFM: Google's large foundation model (500M params)
Datasets & Benchmarks
Primary Source (Recommended):
- Time Series Library (TSLib): https://github.com/thuml/Time-Series-Library
- Contains all preprocessed datasets in consistent format
- Used by most recent papers for benchmarking
- Includes train/val/test splits
Individual Datasets:
-
ETT (Electricity Transformer Temperature):
- ETTh1, ETTh2 (hourly, 7 features)
- ETTm1, ETTm2 (15-minute, 7 features)
- GitHub: https://github.com/zhouhaoyi/ETDataset
- Also in TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/ETT
-
Weather Dataset:
- 21 meteorological indicators from 21 weather stations
- 10-minute intervals, 2020 data
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/weather
-
Traffic Dataset:
- Road occupancy rates from 862 Bay Area sensors
- Hourly data, 2015-2016
- Source: PeMS (http://pems.dot.ca.gov/)
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/traffic
-
Electricity (ECL) Dataset:
- Hourly consumption from 321 clients
- 2012-2014 data
- UCI: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/electricity
-
Exchange Rate Dataset:
- Daily exchange rates for 8 countries
- 1990-2016 data
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/exchange_rate
Benchmark Scripts:
- TSLib provides standard evaluation scripts
- Consistent train/val/test splits across all datasets
- MSE, MAE metrics computed uniformly
Tiny Model Resources
- Advantages of Tiny Models:
- Edge deployment
- Multi-model serving
- Low latency
- Minimal resources
- Cost-effective
- Optimization Techniques:
- Knowledge distillation
- Efficient architectures
- Pre-training strategies
- Few-shot learning
Academic Resources
- Original Paper: IBM Research, "Tiny Time Mixers", 2024
- Key Insights:
- < 1M params achieves SOTA zero-shot
- Adaptive patching improves efficiency
- Few-shot with < 5% data
- Lightweight mixing is sufficient
- Related Work:
- MLP-Mixer (vision)
- PatchTSMixer (larger time series model)
- Efficient transformers
TT-Metal Resources
- TTNN Model Bring-up Tech Report: [Link to tech report]
- MLP-Mixer Implementations: Check for MLP-based models
- Lightweight Model Patterns: Optimization for small models
- TT Fused Ops PR Fuse YoloV4 leaky RELU activations with convolution layers #29236: Fuse YoloV4 leaky RELU activations with convolution layers #29236
- Performance Report Header: https://github.com/tenstorrent/tt-metal/blob/main/tests/docs/perf_header.md
- TTNN Documentation: https://github.com/tenstorrent/tt-metal/tree/main/ttnn
Helpful Tools
- Weight Loading: HuggingFace
from_pretrainedintegration - Visualization:
- Zero-shot vs. supervised comparison
- Few-shot learning curves
- Model size comparison
- Profiling: TTNN profiler for tiny model
- Testing: pytest framework
- Benchmarking: IBM's benchmark notebooks
Metadata
Metadata
Assignees
Labels
Type
Projects
Status