Skip to content

[Bounty $1500] PatchTSMixer Time-Series Model Bring-Up Using TTNN APIs #32138

@sdawle-TT

Description

@sdawle-TT

📝 Background

This bounty is for bringing up the PatchTSMixer time-series forecasting model using TTNN APIs on Tenstorrent hardware (Wormhole or Blackhole).

PatchTSMixer is a state-of-the-art, lightweight time-series forecasting model based on the MLP-Mixer architecture from computer vision. Developed by IBM Research and presented at ICLR 2024, it achieves superior performance with significantly lower computational costs compared to transformer-based models.

Key Capabilities:

  • Patch-Based Architecture: Divides time series into patches and processes them efficiently
  • Channel-Mixing and Time-Mixing: Dual mixing strategy for multivariate time series
    • Channel-Mixing: Models dependencies across different variables
    • Time-Mixing: Captures temporal patterns within each variable
  • Hybrid Channel Modeling: Combines channel-independent and channel-mixing approaches
  • Gated Attention Mechanism: Optional attention for enhanced feature selection
  • Online Reconciliation Head: Ensures hierarchical forecast consistency
  • Lightweight Design: MLP-based architecture (no self-attention overhead)
  • Transfer Learning Support: Pre-trained models available for fine-tuning
  • Multi-Task Support: Forecasting, classification, pre-training, and regression

🎯 What Success Looks Like

A successful submission will fulfill all requirements in the following stages. Payout is made after all three stages are completed.

Stage 1 — Bring-Up

  • Implement PatchTSMixer model using TTNN APIs (Python)
  • Implements the full forecasting pipeline:
    • Input patching layer (divides time series into patches)
    • Patch normalization (instance normalization or batch normalization)
    • Time-Mixing MLP layers (processes temporal patterns)
    • Channel-Mixing MLP layers (processes cross-variate patterns)
    • Optional gated attention mechanism
    • Head module for forecasting/classification/regression
    • Optional online reconciliation head
  • Model runs on Tenstorrent hardware (Wormhole or Blackhole) with no errors
  • Supports multiple task modes:
    • Time-series forecasting: Multi-horizon prediction
    • Classification: Time-series classification tasks
    • Pre-training: Self-supervised pre-training for transfer learning
    • Regression: Direct regression tasks
  • Supports multiple channel modeling modes:
    • Channel-independent: Each variable processed separately
    • Channel-mixing: Cross-variate dependencies modeled
    • Hybrid: Combination of both approaches
  • Produces valid predictions on standard benchmarks (ETT datasets or Weather dataset)
  • Output is verifiable (prediction accuracy, compare with PyTorch/HuggingFace reference)
  • Achieves baseline performance targets:
    • Inference throughput: At least 200 sequences/second for 512-step input
    • Latency: < 30ms for single sequence prediction (batch size 1)
  • Accuracy evaluation:
    • MSE and MAE within 5% of PyTorch reference implementation
    • Prediction correlation coefficient > 0.90 against reference
  • Clear instructions for setup and running the model

Stage 2 — Basic Optimizations

  • Use optimal sharded/interleaved memory configs for:
    • Patch embedding layers
    • Time-Mixing MLP layers
    • Channel-Mixing MLP layers
    • Gated attention computation
    • Head projection layers
  • Implement efficient sharding strategy for:
    • Patch-based processing (parallel patch computation)
    • Channel-independent operations
    • Cross-channel mixing operations
    • Multi-head outputs (for forecasting multiple horizons)
  • Fuse simple ops where possible:
    • Patching + normalization
    • MLP layers (Linear + Activation + Dropout)
    • Gated attention computation
    • Residual connections
  • Store intermediate activations in L1 where beneficial
  • Use recommended TTNN/tt-metal MLP flows
  • Leverage TT library of fused ops for:
    • MLP blocks (Linear layers + activations)
    • Normalization layers (instance norm, batch norm, layer norm)
    • Gating mechanisms
  • Optimize patch-specific operations:
    • Efficient patch extraction from time series
    • Patch reordering and transpose operations
    • Patch normalization strategies
  • Efficient channel mixing implementation:
    • Transpose operations for channel dimension
    • Channel-wise MLP computation
    • Hybrid channel modeling logic

Stage 3 — Deeper Optimization

  • Maximize core counts used per inference
  • Implement deeper TT-specific optimizations:
    • Parallel processing of patches across cores
    • Efficient MLP layer fusion (multi-layer MLPs as single kernel)
    • Optimized transpose operations for channel mixing
    • Efficient gated attention implementation
    • Pipeline time-mixing and channel-mixing stages
  • Minimize prediction latency for real-time forecasting
  • Batch processing for multiple time series
  • Optimize patch processing:
    • Parallel patch extraction and normalization
    • Minimize transpose overhead for patch dimensions
    • Efficient stride operations for overlapping patches
  • Optimize channel operations:
    • Efficient channel-independent parallel processing
    • Optimized channel-mixing transpose and computation
    • Minimize memory movement for hybrid channel modeling
  • Pipeline different model stages:
    • Overlap patch extraction with computation
    • Pipeline time-mixing and channel-mixing operations
    • Efficient head computation
  • Minimize memory and TM (tensor manipulation) overheads
  • Support for streaming inference (online forecasting)
  • Explore techniques for very long context (2048+ patches)
  • Document any advanced tuning, known limitations, or trade-offs
  • Target stretched goals:
    • 1000+ sequences/second throughput for batch inference
    • < 10ms latency for single sequence prediction
    • Support for 2048+ patch inputs (very long context)
    • Efficient handling of high-dimensional multivariate data (100+ channels)
  • Multi-task parallel inference (forecasting + classification simultaneously)

🧭 Guidance & Starting Points

Primary Resources

HuggingFace Implementation Reference

The HuggingFace implementation provides multiple model classes:

  1. PatchTSMixerModel: The bare PatchTSMixer encoder outputting raw hidden states
  2. PatchTSMixerForPrediction: PatchTSMixer for time-series forecasting with distribution head
  3. PatchTSMixerForTimeSeriesClassification: For classification tasks
  4. PatchTSMixerForPretraining: For masked pre-training
  5. PatchTSMixerForRegression: For regression tasks

Key features to implement:

  • Patching Strategy: Divides input sequence into patches of fixed length
    • patch_length: Length of each patch (e.g., 16)
    • stride: Stride for patch extraction (e.g., 8 for overlapping patches)
  • Normalization: Instance normalization or batch normalization applied to patches
  • MLP-Mixer Architecture:
    • Time-Mixing: MLP operates on time dimension (across patches)
    • Channel-Mixing: MLP operates on channel dimension (across variables)
    • Gated Attention: Optional attention mechanism for feature selection
  • Channel Modeling Modes:
    • channel_consistent_masking: For pre-training
    • unmasked_channel_indices: Specify channels to keep unmasked
    • Mode selection: "common_channel", "mix_channel"
  • Configurable Inputs:
    • past_values: Historical time series values [batch, seq_len, num_channels]
    • future_values: Target values (for training)
    • past_observed_mask: Mask for missing values
    • output_hidden_states: Return all layer outputs

🔎 Possible Approaches

Sequential Implementation Strategy

  1. Start from HuggingFace implementation and port components sequentially:

    • Begin with patching layer (critical component)
    • Implement time-mixing MLP block
    • Add channel-mixing MLP block
    • Implement gated attention (optional)
    • Add forecasting head
    • Test end-to-end pipeline
  2. Validate each component against PyTorch reference before integration:

    • Test patching operation output shapes and values
    • Validate time-mixing MLP on small examples
    • Validate channel-mixing with known inputs
    • Check normalization layer outputs
    • Compare full model outputs
    • Validate end-to-end predictions
  3. Start with channel-independent mode first:

    • Simpler architecture (no channel-mixing)
    • Easier to parallelize
    • Validate basic functionality
    • Then add channel-mixing capability
  4. Test on standard benchmarks:

    • Start with ETTh1 dataset (7 channels, hourly data)
    • Test different prediction horizons (96, 192, 336, 720)
    • Validate on Weather dataset (21 channels)
    • Compare MSE/MAE metrics with published results
  5. Experiment with optimizations:

    • Different sharding strategies for patches
    • Fused MLP layers (multi-layer fusion)
    • Efficient transpose operations
    • Parallel channel processing (for channel-independent mode)
    • Pipeline time-mixing and channel-mixing
  6. Use TTNN profiling tools to identify bottlenecks:

    • Measure patching operation time
    • Profile MLP computation
    • Identify transpose overhead
    • Optimize memory movement
    • Profile different batch sizes
  7. Open a draft PR early to get feedback on your approach

Alternative Approaches

  • Modular testing: Implement and optimize MLP-Mixer block as standalone module first
  • Progressive complexity: Start with univariate (channel-independent), then add channel-mixing
  • Ablation studies: Compare channel-independent vs. channel-mixing performance on TT hardware
  • Multi-scale ensemble: Implement multiple patch lengths and ensemble predictions

📊 Result Submission Guidelines

Beyond the model implementation itself, contributors must submit the following material as proof of work.
However, feel free to open a PR at any time if you want us checking that you are on the right track.
Just understand that payout is only made after all 3 stages are completed.

Deliverables:

  • Functional model implementation
  • Validation logs (output correctness)
  • Performance report + header for final review

Links:


📚 Resources

Model Resources

Datasets & Benchmarks

Primary Source (Recommended):

Individual Datasets:

Benchmark Scripts:

  • TSLib provides standard evaluation scripts
  • Consistent train/val/test splits across all datasets
  • MSE, MAE metrics computed uniformly

Academic Resources

  • Original Paper: Chen et al., "TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting", ICLR 2024
  • Related MLP-Mixer Work:
    • MLP-Mixer (NeurIPS'21) - Original vision model
    • DLinear (AAAI'23) - Simple linear baseline
    • TimesNet (ICLR'23) - 2D vision perspective
  • Comparison Models:
    • PatchTST (comparable patch-based transformer)
    • Informer, Autoformer, FEDformer

TT-Metal Resources

Helpful Tools

  • Visualization: Use TensorBoard or Weights & Biases for:
    • Prediction visualization
    • Loss curves
    • Patch attention visualization
  • Profiling: TTNN profiler for performance analysis
  • Testing: pytest framework for model testing
  • Dataset Loading: Use HuggingFace datasets library

Metadata

Metadata

Assignees

Type

No type

Projects

Status

PR Submitted 🕒

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions