Skip to content

[Feature Request] Support for xIELU activation function for Apertus-70B model #37865

@pandeashwary

Description

@pandeashwary

Is your feature request related to a problem? Please describe

We are looking to run the Apertus-70B-Instruct-2509 model on Tenstorrent hardware using the tt-metal stack. While the model architecture is similar to Llama 3 70B (80 layers, 8192 hidden dimension), it utilizes a different activation function.

Currently, tt-metal appears to be missing the xIELU (Expanded Integral of the Exponential Linear Unit) activation operator, which prevents this model from running successfully.

Describe the solution you'd like

I would like to request the implementation and support of the xIELU activation operator within the tt-metal ops library.

Key Details for Implementation:

  • Model Reference: swiss-ai/Apertus-70B-Instruct-2509
  • Operation Type: xIELU is a trainable activation function that involves parameters (alpha_p and alpha_n) learned during training, rather than being a static function like ReLU or SwiGLU.
  • Goal: Having this op supported would allow parity with Llama-style architectures that have transitioned to more modern, trainable activation functions.

Mathematical Definition:
If x > 0: f(x) = (alpha_p * x^2) + 0.5x
If x <= 0: f(x) = alpha_n * (exp(x) - 1) - (alpha_n * x) + 0.5x

Describe alternatives you've considered

  1. Decomposition into existing ops: We considered decomposing xIELU into a sequence of standard operations (Exp, Mul, Add, etc.). However, for a 70B parameter model, this approach would significantly increase kernel launch overhead and memory bandwidth pressure, leading to poor performance compared to a fused native op.

  2. Model Architecture Modification: Replacing xIELU with a standard supported activation like SwiGLU or ReLU is not feasible. This would require re-training the entire 70B model from scratch to maintain accuracy, which is computationally prohibitive.

  3. CPU Fallback: Running this specific activation on the host CPU would create a massive bottleneck due to the constant PCIe data transfer between the Tenstorrent device and the host for every layer of the 80-layer architecture.

Additional context

The xIELU (Expanded Integral of the Exponential Linear Unit) activation function was recently introduced as part of the Swiss AI Initiative's Apertus model series. It is specifically designed to stabilize large-scale training (70B+ parameters) and improve convergence compared to standard static activations like SwiGLU.

Technical References:

  • Research Paper: "Deriving Activation Functions Using Integration" (https://arxiv.org/abs/2411.13010)
  • Model Weights: swiss-ai/Apertus-70B-Instruct-2509 (Hugging Face)
  • Mathematical Implementation: xIELU uses two trainable scalar parameters (alpha_p and alpha_n) learned independently for each layer.

As more "truly open" models like Apertus gain popularity in the research community, having native support for these modern, trainable activation functions in tt-metal will be critical for users looking to deploy high-performance, non-Llama-standard architectures on Tenstorrent hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions