-
Notifications
You must be signed in to change notification settings - Fork 347
Description
Is your feature request related to a problem? Please describe
We are looking to run the Apertus-70B-Instruct-2509 model on Tenstorrent hardware using the tt-metal stack. While the model architecture is similar to Llama 3 70B (80 layers, 8192 hidden dimension), it utilizes a different activation function.
Currently, tt-metal appears to be missing the xIELU (Expanded Integral of the Exponential Linear Unit) activation operator, which prevents this model from running successfully.
Describe the solution you'd like
I would like to request the implementation and support of the xIELU activation operator within the tt-metal ops library.
Key Details for Implementation:
- Model Reference: swiss-ai/Apertus-70B-Instruct-2509
- Operation Type: xIELU is a trainable activation function that involves parameters (alpha_p and alpha_n) learned during training, rather than being a static function like ReLU or SwiGLU.
- Goal: Having this op supported would allow parity with Llama-style architectures that have transitioned to more modern, trainable activation functions.
Mathematical Definition:
If x > 0: f(x) = (alpha_p * x^2) + 0.5x
If x <= 0: f(x) = alpha_n * (exp(x) - 1) - (alpha_n * x) + 0.5x
Describe alternatives you've considered
-
Decomposition into existing ops: We considered decomposing xIELU into a sequence of standard operations (Exp, Mul, Add, etc.). However, for a 70B parameter model, this approach would significantly increase kernel launch overhead and memory bandwidth pressure, leading to poor performance compared to a fused native op.
-
Model Architecture Modification: Replacing xIELU with a standard supported activation like SwiGLU or ReLU is not feasible. This would require re-training the entire 70B model from scratch to maintain accuracy, which is computationally prohibitive.
-
CPU Fallback: Running this specific activation on the host CPU would create a massive bottleneck due to the constant PCIe data transfer between the Tenstorrent device and the host for every layer of the 80-layer architecture.
Additional context
The xIELU (Expanded Integral of the Exponential Linear Unit) activation function was recently introduced as part of the Swiss AI Initiative's Apertus model series. It is specifically designed to stabilize large-scale training (70B+ parameters) and improve convergence compared to standard static activations like SwiGLU.
Technical References:
- Research Paper: "Deriving Activation Functions Using Integration" (https://arxiv.org/abs/2411.13010)
- Model Weights: swiss-ai/Apertus-70B-Instruct-2509 (Hugging Face)
- Mathematical Implementation: xIELU uses two trainable scalar parameters (alpha_p and alpha_n) learned independently for each layer.
As more "truly open" models like Apertus gain popularity in the research community, having native support for these modern, trainable activation functions in tt-metal will be critical for users looking to deploy high-performance, non-Llama-standard architectures on Tenstorrent hardware.