-
Notifications
You must be signed in to change notification settings - Fork 347
Description
📝 Background
This bounty is for bringing up the PatchTSMixer time-series forecasting model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).
PatchTSMixer is a state-of-the-art, lightweight time-series forecasting model based on the MLP-Mixer architecture from computer vision. Developed by IBM Research and presented at ICLR 2024, it achieves superior performance with significantly lower computational costs compared to transformer-based models.
Key Capabilities:
- Patch-Based Architecture: Divides time series into patches and processes them efficiently
- Channel-Mixing and Time-Mixing: Dual mixing strategy for multivariate time series
- Channel-Mixing: Models dependencies across different variables
- Time-Mixing: Captures temporal patterns within each variable
- Hybrid Channel Modeling: Combines channel-independent and channel-mixing approaches
- Gated Attention Mechanism: Optional attention for enhanced feature selection
- Online Reconciliation Head: Ensures hierarchical forecast consistency
- Lightweight Design: MLP-based architecture (no self-attention overhead)
- Transfer Learning Support: Pre-trained models available for fine-tuning
- Multi-Task Support: Forecasting, classification, pre-training, and regression
🎯 What Success Looks Like
A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.
Stage 1 — Bring-Up
- Implement PatchTSMixer model using TTNN APIs (Python)
- Implements the full forecasting pipeline:
- Input patching layer (divides time series into patches)
- Patch normalization (instance normalization or batch normalization)
- Time-Mixing MLP layers (processes temporal patterns)
- Channel-Mixing MLP layers (processes cross-variate patterns)
- Optional gated attention mechanism
- Head module for forecasting/classification/regression
- Optional online reconciliation head
- Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
- Supports multiple task modes:
- Time-series forecasting: Multi-horizon prediction
- Classification: Time-series classification tasks
- Pre-training: Self-supervised pre-training for transfer learning
- Regression: Direct regression tasks
- Supports multiple channel modeling modes:
- Channel-independent: Each variable processed separately
- Channel-mixing: Cross-variate dependencies modeled
- Hybrid: Combination of both approaches
- Produces valid predictions on standard benchmarks (ETT datasets or Weather dataset)
- Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
- Achieves baseline performance targets:
- Inference throughput: At least 200 sequences/second for 512-step input
- Latency: < 30ms for single sequence prediction (batch size 1)
- Accuracy evaluation:
- MSE and MAE within 5% of PyTorch reference implementation
- Prediction correlation coefficient > 0.90 against reference
- Clear instructions for setup and running the model
Stage 2 — Basic Optimizations
- Use optimal sharded/interleaved memory configs for:
- Patch embedding layers
- Time-Mixing MLP layers
- Channel-Mixing MLP layers
- Gated attention computation
- Head projection layers
- Implement efficient sharding strategy for:
- Patch-based processing (parallel patch computation)
- Channel-independent operations
- Cross-channel mixing operations
- Multi-head outputs (for forecasting multiple horizons)
- Fuse simple ops where possible:
- Patching + normalization
- MLP layers (Linear + Activation + Dropout)
- Gated attention computation
- Residual connections
- Store intermediate activations in L1 where beneficial
- Use recommended TTNN/tt-metal MLP flows
- Leverage TT library of fused ops for:
- MLP blocks (Linear layers + activations)
- Normalization layers (instance norm, batch norm, layer norm)
- Gating mechanisms
- Optimize patch-specific operations:
- Efficient patch extraction from time series
- Patch reordering and transpose operations
- Patch normalization strategies
- Efficient channel mixing implementation:
- Transpose operations for channel dimension
- Channel-wise MLP computation
- Hybrid channel modeling logic
Stage 3 — Deeper Optimization
- Maximize core counts used per inference
- Implement deeper TT-specific optimizations:
- Parallel processing of patches across cores
- Efficient MLP layer fusion (multi-layer MLPs as single kernel)
- Optimized transpose operations for channel mixing
- Efficient gated attention implementation
- Pipeline time-mixing and channel-mixing stages
- Minimize prediction latency for real-time forecasting
- Batch processing for multiple time series
- Optimize patch processing:
- Parallel patch extraction and normalization
- Minimize transpose overhead for patch dimensions
- Efficient stride operations for overlapping patches
- Optimize channel operations:
- Efficient channel-independent parallel processing
- Optimized channel-mixing transpose and computation
- Minimize memory movement for hybrid channel modeling
- Pipeline different model stages:
- Overlap patch extraction with computation
- Pipeline time-mixing and channel-mixing operations
- Efficient head computation
- Minimize memory and TM (tensor manipulation) overheads
- Support for streaming inference (online forecasting)
- Explore techniques for very long context (2048+ patches)
- Document any advanced tuning, known limitations, or trade-offs
- Target stretched goals:
- 1000+ sequences/second throughput for batch inference
- < 10ms latency for single sequence prediction
- Support for 2048+ patch inputs (very long context)
- Efficient handling of high-dimensional multivariate data (100+ channels)
- Multi-task parallel inference (forecasting + classification simultaneously)
🧭 Guidance & Starting Points
Primary Resources
- Use the TTNN model bring-up tech report as your primary reference
- Reference Transformer implementations in tt-metal for transformer patterns
- Use the HuggingFace Transformers PatchTSMixer (documentation) as the reference implementation
- Use the IBM PatchTSMixer implementation on GitHub for architecture details
- Refer to TT Fused ops PR Fuse YoloV4 leaky RELU activations with convolution layers #29236 for optimization opportunities
HuggingFace Implementation Reference
The HuggingFace implementation provides multiple model classes:
- PatchTSMixerModel: The bare PatchTSMixer encoder outputting raw hidden states
- PatchTSMixerForPrediction: PatchTSMixer for time-series forecasting with distribution head
- PatchTSMixerForTimeSeriesClassification: For classification tasks
- PatchTSMixerForPretraining: For masked pre-training
- PatchTSMixerForRegression: For regression tasks
Key features to implement:
- Patching Strategy: Divides input sequence into patches of fixed length
patch_length: Length of each patch (e.g., 16)stride: Stride for patch extraction (e.g., 8 for overlapping patches)
- Normalization: Instance normalization or batch normalization applied to patches
- MLP-Mixer Architecture:
- Time-Mixing: MLP operates on time dimension (across patches)
- Channel-Mixing: MLP operates on channel dimension (across variables)
- Gated Attention: Optional attention mechanism for feature selection
- Channel Modeling Modes:
channel_consistent_masking: For pre-trainingunmasked_channel_indices: Specify channels to keep unmasked- Mode selection: "common_channel", "mix_channel"
- Configurable Inputs:
past_values: Historical time series values [batch, seq_len, num_channels]future_values: Target values (for training)past_observed_mask: Mask for missing valuesoutput_hidden_states: Return all layer outputs
🔎 Possible Approaches
Sequential Implementation Strategy
-
Start from HuggingFace implementation and port components sequentially:
- Begin with patching layer (critical component)
- Implement time-mixing MLP block
- Add channel-mixing MLP block
- Implement gated attention (optional)
- Add forecasting head
- Test end-to-end pipeline
-
Validate each component against PyTorch reference before integration:
- Test patching operation output shapes and values
- Validate time-mixing MLP on small examples
- Validate channel-mixing with known inputs
- Check normalization layer outputs
- Compare full model outputs
- Validate end-to-end predictions
-
Start with channel-independent mode first:
- Simpler architecture (no channel-mixing)
- Easier to parallelize
- Validate basic functionality
- Then add channel-mixing capability
-
Test on standard benchmarks:
- Start with ETTh1 dataset (7 channels, hourly data)
- Test different prediction horizons (96, 192, 336, 720)
- Validate on Weather dataset (21 channels)
- Compare MSE/MAE metrics with published results
-
Experiment with optimizations:
- Different sharding strategies for patches
- Fused MLP layers (multi-layer fusion)
- Efficient transpose operations
- Parallel channel processing (for channel-independent mode)
- Pipeline time-mixing and channel-mixing
-
Use TTNN profiling tools to identify bottlenecks:
- Measure patching operation time
- Profile MLP computation
- Identify transpose overhead
- Optimize memory movement
- Profile different batch sizes
-
Open a draft PR early to get feedback on your approach
Alternative Approaches
- Modular testing: Implement and optimize MLP-Mixer block as standalone module first
- Progressive complexity: Start with univariate (channel-independent), then add channel-mixing
- Ablation studies: Compare channel-independent vs. channel-mixing performance on TT hardware
- Multi-scale ensemble: Implement multiple patch lengths and ensemble predictions
📊 Result Submission Guidelines
Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.
Deliverables:
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review
Links:
📚 Resources
Model Resources
- HuggingFace PatchTSMixer Documentation: https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer
- HuggingFace Blog Post: PatchTSMixer for Time-Series Forecasting
- Pretrained Models on HuggingFace:
ibm/patchtsmixer-etth1-forecasting- ETTh1 forecasting modelibm/patchtsmixer-etth2-forecasting- ETTh2 forecasting modelibm-granite/granite-timeseries-patchtsmixer- General pre-trained model- More models: https://huggingface.co/models?search=patchtsmixer
- Original PatchTSMixer Paper (ICLR 2024): "TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting"
- arXiv: https://arxiv.org/abs/2306.09364
- GitHub: https://github.com/IBM/tsfm (Time Series Foundation Models)
- IBM Research Blog: PatchTSMixer announcement
Datasets & Benchmarks
Primary Source (Recommended):
- Time Series Library (TSLib): https://github.com/thuml/Time-Series-Library
- Contains all preprocessed datasets in consistent format
- Used by most recent papers for benchmarking
- Includes train/val/test splits
Individual Datasets:
-
ETT (Electricity Transformer Temperature):
- ETTh1, ETTh2 (hourly, 7 features)
- ETTm1, ETTm2 (15-minute, 7 features)
- GitHub: https://github.com/zhouhaoyi/ETDataset
- Also in TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/ETT
-
Weather Dataset:
- 21 meteorological indicators from 21 weather stations
- 10-minute intervals, 2020 data
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/weather
-
Traffic Dataset:
- Road occupancy rates from 862 Bay Area sensors
- Hourly data, 2015-2016
- Source: PeMS (http://pems.dot.ca.gov/)
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/traffic
-
Electricity (ECL) Dataset:
- Hourly consumption from 321 clients
- 2012-2014 data
- UCI: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/electricity
-
Exchange Rate Dataset:
- Daily exchange rates for 8 countries
- 1990-2016 data
- TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/exchange_rate
Benchmark Scripts:
- TSLib provides standard evaluation scripts
- Consistent train/val/test splits across all datasets
- MSE, MAE metrics computed uniformly
Academic Resources
- Original Paper: Chen et al., "TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting", ICLR 2024
- Related MLP-Mixer Work:
- MLP-Mixer (NeurIPS'21) - Original vision model
- DLinear (AAAI'23) - Simple linear baseline
- TimesNet (ICLR'23) - 2D vision perspective
- Comparison Models:
- PatchTST (comparable patch-based transformer)
- Informer, Autoformer, FEDformer
TT-Metal Resources
- TTNN Model Bring-up Tech Report: [Link to tech report]
- MLP Implementations in tt-metal:
models/demos/- Check for MLP-based models- Look for MLP-Mixer or similar architectures
- TT Fused Ops PR Fuse YoloV4 leaky RELU activations with convolution layers #29236: Fuse YoloV4 leaky RELU activations with convolution layers #29236
- Performance Report Header: https://github.com/tenstorrent/tt-metal/blob/main/tests/docs/perf_header.md
- TTNN Documentation: https://github.com/tenstorrent/tt-metal/tree/main/ttnn
- MLP Optimization Examples: Check existing MLP implementations
Helpful Tools
- Visualization: Use TensorBoard or Weights & Biases for:
- Prediction visualization
- Loss curves
- Patch attention visualization
- Profiling: TTNN profiler for performance analysis
- Testing: pytest framework for model testing
- Dataset Loading: Use HuggingFace datasets library
Metadata
Metadata
Assignees
Labels
Type
Projects
Status