[Bounty $1500] PatchTSMixer Time-Series Model Bring-Up Using TTNN APIs

### 📝 Background

This bounty is for bringing up the **PatchTSMixer** time-series forecasting model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).

PatchTSMixer is a state-of-the-art, lightweight time-series forecasting model based on the MLP-Mixer architecture from computer vision. Developed by IBM Research and presented at ICLR 2024, it achieves superior performance with significantly lower computational costs compared to transformer-based models.

### Key Capabilities:

- **Patch-Based Architecture**: Divides time series into patches and processes them efficiently
- **Channel-Mixing and Time-Mixing**: Dual mixing strategy for multivariate time series
  - Channel-Mixing: Models dependencies across different variables
  - Time-Mixing: Captures temporal patterns within each variable
- **Hybrid Channel Modeling**: Combines channel-independent and channel-mixing approaches
- **Gated Attention Mechanism**: Optional attention for enhanced feature selection
- **Online Reconciliation Head**: Ensures hierarchical forecast consistency
- **Lightweight Design**: MLP-based architecture (no self-attention overhead)
- **Transfer Learning Support**: Pre-trained models available for fine-tuning
- **Multi-Task Support**: Forecasting, classification, pre-training, and regression

---

### 🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

### Stage 1 — Bring-Up

- [ ] Implement PatchTSMixer model using TTNN APIs (Python)
- [ ] Implements the full forecasting pipeline:
  - Input patching layer (divides time series into patches)
  - Patch normalization (instance normalization or batch normalization)
  - Time-Mixing MLP layers (processes temporal patterns)
  - Channel-Mixing MLP layers (processes cross-variate patterns)
  - Optional gated attention mechanism
  - Head module for forecasting/classification/regression
  - Optional online reconciliation head
- [ ] Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
- [ ] Supports multiple task modes:
  - **Time-series forecasting**: Multi-horizon prediction
  - **Classification**: Time-series classification tasks
  - **Pre-training**: Self-supervised pre-training for transfer learning
  - **Regression**: Direct regression tasks
- [ ] Supports multiple channel modeling modes:
  - **Channel-independent**: Each variable processed separately
  - **Channel-mixing**: Cross-variate dependencies modeled
  - **Hybrid**: Combination of both approaches
- [ ] Produces valid predictions on standard benchmarks (ETT datasets or Weather dataset)
- [ ] Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
- [ ] Achieves baseline performance targets:
  - **Inference throughput**: At least 200 sequences/second for 512-step input
  - **Latency**: < 30ms for single sequence prediction (batch size 1)
- [ ] Accuracy evaluation:
  - MSE and MAE within 5% of PyTorch reference implementation
  - Prediction correlation coefficient > 0.90 against reference
- [ ] Clear instructions for setup and running the model

### Stage 2 — Basic Optimizations

- [ ] Use optimal sharded/interleaved memory configs for:
  - Patch embedding layers
  - Time-Mixing MLP layers
  - Channel-Mixing MLP layers
  - Gated attention computation
  - Head projection layers
- [ ] Implement efficient sharding strategy for:
  - Patch-based processing (parallel patch computation)
  - Channel-independent operations
  - Cross-channel mixing operations
  - Multi-head outputs (for forecasting multiple horizons)
- [ ] Fuse simple ops where possible:
  - Patching + normalization
  - MLP layers (Linear + Activation + Dropout)
  - Gated attention computation
  - Residual connections
- [ ] Store intermediate activations in L1 where beneficial
- [ ] Use recommended TTNN/tt-metal MLP flows
- [ ] Leverage TT library of fused ops for:
  - MLP blocks (Linear layers + activations)
  - Normalization layers (instance norm, batch norm, layer norm)
  - Gating mechanisms
- [ ] Optimize patch-specific operations:
  - Efficient patch extraction from time series
  - Patch reordering and transpose operations
  - Patch normalization strategies
- [ ] Efficient channel mixing implementation:
  - Transpose operations for channel dimension
  - Channel-wise MLP computation
  - Hybrid channel modeling logic



### Stage 3 — Deeper Optimization


- [ ] Maximize core counts used per inference
- [ ] Implement deeper TT-specific optimizations:
  - Parallel processing of patches across cores
  - Efficient MLP layer fusion (multi-layer MLPs as single kernel)
  - Optimized transpose operations for channel mixing
  - Efficient gated attention implementation
  - Pipeline time-mixing and channel-mixing stages
- [ ] Minimize prediction latency for real-time forecasting
- [ ] Batch processing for multiple time series
- [ ] Optimize patch processing:
  - Parallel patch extraction and normalization
  - Minimize transpose overhead for patch dimensions
  - Efficient stride operations for overlapping patches
- [ ] Optimize channel operations:
  - Efficient channel-independent parallel processing
  - Optimized channel-mixing transpose and computation
  - Minimize memory movement for hybrid channel modeling
- [ ] Pipeline different model stages:
  - Overlap patch extraction with computation
  - Pipeline time-mixing and channel-mixing operations
  - Efficient head computation
- [ ] Minimize memory and TM (tensor manipulation) overheads
- [ ] Support for streaming inference (online forecasting)
- [ ] Explore techniques for very long context (2048+ patches)
- [ ] Document any advanced tuning, known limitations, or trade-offs
- [ ] Target stretched goals:
  - **1000+ sequences/second** throughput for batch inference
  - **< 10ms latency** for single sequence prediction
  - Support for 2048+ patch inputs (very long context)
  - Efficient handling of high-dimensional multivariate data (100+ channels)
- [ ] Multi-task parallel inference (forecasting + classification simultaneously)


### 🧭 Guidance & Starting Points

### Primary Resources

- Use the [TTNN model bring-up tech report](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md) as your primary reference
- Reference [Transformer implementations in tt-metal](https://github.com/tenstorrent/tt-metal/tree/main/models/experimental) for transformer patterns
- Use the **HuggingFace Transformers PatchTSMixer** ([documentation](https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer)) as the reference implementation
- Use the **IBM PatchTSMixer implementation** on GitHub for architecture details
- Refer to **TT Fused ops PR #29236** for optimization opportunities

### HuggingFace Implementation Reference

The HuggingFace implementation provides multiple model classes:

1. **PatchTSMixerModel**: The bare PatchTSMixer encoder outputting raw hidden states
2. **PatchTSMixerForPrediction**: PatchTSMixer for time-series forecasting with distribution head
3. **PatchTSMixerForTimeSeriesClassification**: For classification tasks
4. **PatchTSMixerForPretraining**: For masked pre-training
5. **PatchTSMixerForRegression**: For regression tasks

Key features to implement:
- **Patching Strategy**: Divides input sequence into patches of fixed length
  - `patch_length`: Length of each patch (e.g., 16)
  - `stride`: Stride for patch extraction (e.g., 8 for overlapping patches)
- **Normalization**: Instance normalization or batch normalization applied to patches
- **MLP-Mixer Architecture**:
  - Time-Mixing: MLP operates on time dimension (across patches)
  - Channel-Mixing: MLP operates on channel dimension (across variables)
  - Gated Attention: Optional attention mechanism for feature selection
- **Channel Modeling Modes**:
  - `channel_consistent_masking`: For pre-training
  - `unmasked_channel_indices`: Specify channels to keep unmasked
  - Mode selection: "common_channel", "mix_channel"
- **Configurable Inputs**:
  - `past_values`: Historical time series values [batch, seq_len, num_channels]
  - `future_values`: Target values (for training)
  - `past_observed_mask`: Mask for missing values
  - `output_hidden_states`: Return all layer outputs

### 🔎 Possible Approaches

### Sequential Implementation Strategy

1. **Start from HuggingFace implementation** and port components sequentially:
   - Begin with patching layer (critical component)
   - Implement time-mixing MLP block
   - Add channel-mixing MLP block
   - Implement gated attention (optional)
   - Add forecasting head
   - Test end-to-end pipeline

2. **Validate each component** against PyTorch reference before integration:
   - Test patching operation output shapes and values
   - Validate time-mixing MLP on small examples
   - Validate channel-mixing with known inputs
   - Check normalization layer outputs
   - Compare full model outputs
   - Validate end-to-end predictions

3. **Start with channel-independent mode first**:
   - Simpler architecture (no channel-mixing)
   - Easier to parallelize
   - Validate basic functionality
   - Then add channel-mixing capability

4. **Test on standard benchmarks**:
   - Start with ETTh1 dataset (7 channels, hourly data)
   - Test different prediction horizons (96, 192, 336, 720)
   - Validate on Weather dataset (21 channels)
   - Compare MSE/MAE metrics with published results

5. **Experiment with optimizations**:
   - Different sharding strategies for patches
   - Fused MLP layers (multi-layer fusion)
   - Efficient transpose operations
   - Parallel channel processing (for channel-independent mode)
   - Pipeline time-mixing and channel-mixing

6. **Use TTNN profiling tools** to identify bottlenecks:
   - Measure patching operation time
   - Profile MLP computation
   - Identify transpose overhead
   - Optimize memory movement
   - Profile different batch sizes

7. **Open a draft PR early** to get feedback on your approach

### Alternative Approaches

- **Modular testing**: Implement and optimize MLP-Mixer block as standalone module first
- **Progressive complexity**: Start with univariate (channel-independent), then add channel-mixing
- **Ablation studies**: Compare channel-independent vs. channel-mixing performance on TT hardware
- **Multi-scale ensemble**: Implement multiple patch lengths and ensemble predictions

### 📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after **all 3 stages** are completed.

**Deliverables:**
- Functional model implementation
- Validation logs (output correctness)
- Performance report + header for final review

**Links:**
- [Performance Sheet](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/ttnn/TTNN-model-bringup.md#41-performance-sheet)
- [Perf Header Docs](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/profiling_ttnn_operations.html#perf-report-headers)

---


### 📚 Resources

### Model Resources

- **HuggingFace PatchTSMixer Documentation**: https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer
- **HuggingFace Blog Post**: [PatchTSMixer for Time-Series Forecasting](https://huggingface.co/blog/patchtsmixer)
- **Pretrained Models on HuggingFace**:
  - `ibm/patchtsmixer-etth1-forecasting` - ETTh1 forecasting model
  - `ibm/patchtsmixer-etth2-forecasting` - ETTh2 forecasting model
  - `ibm-granite/granite-timeseries-patchtsmixer` - General pre-trained model
  - More models: https://huggingface.co/models?search=patchtsmixer
- **Original PatchTSMixer Paper (ICLR 2024)**: "TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting"
  - arXiv: https://arxiv.org/abs/2306.09364
  - GitHub: https://github.com/IBM/tsfm (Time Series Foundation Models)
- **IBM Research Blog**: [PatchTSMixer announcement](https://research.ibm.com/blog/patchtsmixer-time-series)

### Datasets & Benchmarks

**Primary Source** (Recommended):
- **Time Series Library (TSLib)**: https://github.com/thuml/Time-Series-Library
  - Contains all preprocessed datasets in consistent format
  - Used by most recent papers for benchmarking
  - Includes train/val/test splits

**Individual Datasets**:

- **ETT (Electricity Transformer Temperature)**: 
  - ETTh1, ETTh2 (hourly, 7 features)
  - ETTm1, ETTm2 (15-minute, 7 features)
  - GitHub: https://github.com/zhouhaoyi/ETDataset
  - Also in TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/ETT

- **Weather Dataset**: 
  - 21 meteorological indicators from 21 weather stations
  - 10-minute intervals, 2020 data
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/weather

- **Traffic Dataset**: 
  - Road occupancy rates from 862 Bay Area sensors
  - Hourly data, 2015-2016
  - Source: PeMS (http://pems.dot.ca.gov/)
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/traffic

- **Electricity (ECL) Dataset**: 
  - Hourly consumption from 321 clients
  - 2012-2014 data
  - UCI: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/electricity

- **Exchange Rate Dataset**: 
  - Daily exchange rates for 8 countries
  - 1990-2016 data
  - TSLib: https://github.com/thuml/Time-Series-Library/tree/main/dataset/exchange_rate

**Benchmark Scripts**:
- TSLib provides standard evaluation scripts
- Consistent train/val/test splits across all datasets
- MSE, MAE metrics computed uniformly

### Academic Resources

- **Original Paper**: Chen et al., "TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting", ICLR 2024
- **Related MLP-Mixer Work**:
  - MLP-Mixer (NeurIPS'21) - Original vision model
  - DLinear (AAAI'23) - Simple linear baseline
  - TimesNet (ICLR'23) - 2D vision perspective
- **Comparison Models**:
  - PatchTST (comparable patch-based transformer)
  - Informer, Autoformer, FEDformer

### TT-Metal Resources

- **TTNN Model Bring-up Tech Report**: [Link to tech report]
- **MLP Implementations in tt-metal**: 
  - `models/demos/` - Check for MLP-based models
  - Look for MLP-Mixer or similar architectures
- **TT Fused Ops PR #29236**: https://github.com/tenstorrent/tt-metal/pull/29236
- **Performance Report Header**: https://github.com/tenstorrent/tt-metal/blob/main/tests/docs/perf_header.md
- **TTNN Documentation**: https://github.com/tenstorrent/tt-metal/tree/main/ttnn
- **MLP Optimization Examples**: Check existing MLP implementations

### Helpful Tools

- **Visualization**: Use TensorBoard or Weights & Biases for:
  - Prediction visualization
  - Loss curves
  - Patch attention visualization
- **Profiling**: TTNN profiler for performance analysis
- **Testing**: pytest framework for model testing
- **Dataset Loading**: Use HuggingFace datasets library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty $1500] PatchTSMixer Time-Series Model Bring-Up Using TTNN APIs #32138

📝 Background

Key Capabilities:

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

Primary Resources

HuggingFace Implementation Reference

🔎 Possible Approaches

Sequential Implementation Strategy

Alternative Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Datasets & Benchmarks

Academic Resources

TT-Metal Resources

Helpful Tools

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bounty $1500] PatchTSMixer Time-Series Model Bring-Up Using TTNN APIs #32138

Description

📝 Background

Key Capabilities:

🎯 What Success Looks Like

Stage 1 — Bring-Up

Stage 2 — Basic Optimizations

Stage 3 — Deeper Optimization

🧭 Guidance & Starting Points

Primary Resources

HuggingFace Implementation Reference

🔎 Possible Approaches

Sequential Implementation Strategy

Alternative Approaches

📊 Result Submission Guidelines

📚 Resources

Model Resources

Datasets & Benchmarks

Academic Resources

TT-Metal Resources

Helpful Tools

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions