Skip to content

Latest commit

 

History

History

README.md

Softmax

The softmax function is a mathematical function commonly used in machine learning, especially in classification tasks. It transforms a vector of real-valued scores (often called logits) into a probability distribution. The resulting probabilities are positive and sum up to 1, making them suitable for representing categorical distributions.

Key Characteristics

  • Exponential Normalization: The softmax function applies the exponential function to each element of the input vector and then normalizes these values by dividing by the sum of all these exponentials. This has the effect of amplifying the differences between the elements of the input vector, making the highest values stand out more prominently.

  • Formula: For a vector,

    $$\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \cdots & z_n \end{bmatrix}$$

    the softmax function for each element is,

    $$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$$

    where e is the base of the natural logarithm.

  • Output as Probabilities: The output of the softmax function is a vector where each component is between 0 and 1, and the sum of all components is 1. This makes it useful for interpreting the outputs as probabilities.

Compilation details

The softmax function employs the exponential function $e^x$, similar to the example found here. Again to efficiently implement softmax, a lookup table approximation is utilized.

In addition, and unlike any of the other current design examples, this example uses MLIR dialects as direct input, including the vector,affine,arith and math dialects. This is shown in the source. This is intended to be generated from a higher-level description but is shown here as an example of how you can use other MLIR dialects as input.

The compilation process is different from the other design examples, and is shown in the Makefile.

  1. The input MLIR is first vectorized into chunks of size 16, and a C++ file is produced which has mapped the various MLIR dialects into AIE intrinsics, including vector loads and stores, vectorized arithmetic on those registers, and the $e^x$ approximation using look up tables
  2. This generated C++ is compiled into a first object file
  3. A file called lut_based_ops.cpp from the AIE2 runtime library is compiled into a second object file. This file contains the look up table contents to approximate the $e^x$ function.
  4. A wrapper file is also compiled into an object file, which prevents C++ name mangling, and allows the wrapped C function to be called from the strucural Python
  5. These 3 object files are combined into a single .a file, which is then referenced inside the softmax.py structural Python.

This is a slightly more complex process than the rest of the examples, which typically only use a single object file containing the wrapped C++ function call, but is provided to show how a library-based flow can also be used.

  1. softmax.py: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled using aiecc to produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI).

  2. softmax_placed.py: An alternative version of the design in softmax.py, that is expressed in a lower-level version of IRON.

  3. softmax_whole_array_placed.py: This Python script extends the design to utilize the entire AIE array, scaling up from the use of two cores in softmax_placed.py. The number of cores of the AIE array (n_cores) is configurable via the n_col and n_cores_per_col variables.

Usage

For a quick reference of all available options, run:

make help

Build and Run

Build and run with default settings:

make run

Build and run with custom runtime parameters:

make run size=524288 n_iterations=100 n_warmup=20

Placement Modes

There are three placement modes available:

Default mode - Uses softmax.py:

make run

Manual placement mode - Uses softmax_placed.py:

make run use_placed=1

Whole array placement mode - Uses softmax_whole_array_placed.py:

make run use_whole_array=1
make run use_whole_array=1 whole_array_cols=4 whole_array_rows=4
make run use_whole_array=1 whole_array_cols=2 whole_array_rows=2

Configuration Variables

Variable Default Description
size 262144 Input data size (number of elements)
n_iterations 20 Number of benchmark iterations
n_warmup 10 Number of warmup iterations
use_placed 0 Enable manual placement mode
use_whole_array 0 Enable whole array placement mode
whole_array_cols 1 Number of columns (when use_whole_array=1)
whole_array_rows 4 Number of cores per column (when use_whole_array=1)
devicename npu Target device (npu or npu2)

Note: Configuration changes are automatically detected. No need to run make clean when changing parameters.

Profiling

To run with profiling (outputs to results.csv):

make profile

Hardware Tracing

To generate a trace file:

make use_placed=1 trace

Note: Tracing is currently supported with the use_placed=1 mode.