See Vitis™ AI Development Environment on amd.com |
Version: Vitis 2025.2
The Vitis DSP Library currently implements the Fast Fourier Transform (FFT) using a Stockham algorithm. This algorithm computes the transform using a pair of buffers rather than the more popular in-place Cooley-Tukey approach, and maps more efficiently to the vector/SIMD architecture of the AI Engine. This approach yields excellent results, particularly for
You can implement a 1D FFT on the AI Engine array using a 2D FFT algorithm with higher efficiency overall. This alternative "divide & conquer" approach provides a better solution on the AI Engine array because it is less reliant on "butterfly routing" and allows breaking large
The 2D architecture is not new to AI Engine. XAPP1356 first explored its use several years ago to achieve 3.7 GSPS with 10 AI Engine tiles. This tutorial explores using the 2D architecture to implement large point transforms in the SSR > 1 regime, specifically a 64K-pt IFFT @ 2 GSPS. In this scenario, the design requires both AI Engine tile resources for compute and programmable logic (PL) resources for sample reordering as shown in the following section.
A MATLAB® model of the 64K-pt IFFT implemented as a
- Load the input data in column-major order into a
$256 \times 256$ matrix. - Take 256-pt transforms along each row
- Perform a point-wise multiplication of the data matrix with a "twiddle matrix" (see following code for its detailed definition)
- Take 256-pt transforms along each column
- Extract the output data in row-major order
function [ifft_o] = ifft64k_256x256_model( ifft_i )
N = numel(ifft_i);
Nhalf = sqrt(N);
assert( N == 65536 );
% Need to use DIT when computing transforms up front:
twid = complex(zeros(Nhalf,Nhalf),zeros(Nhalf,Nhalf));
for rr = 0 : Nhalf-1
for cc = 0 : Nhalf-1
twid(1+rr,1+cc) = exp(+1i*2*pi*rr*cc/N);
end
end
% First compute transforms along rows:
data0 = reshape(ifft_i,Nhalf,Nhalf);
data1 = Nhalf*ifft(data0,Nhalf,2);
% Apply pointwise twiddle matrix:
% --> Note: first row is multiplication by unity
data2 = data1 .* twid;
% Second compute 16-K point transforms along columns:
data3 = Nhalf*ifft(data2,Nhalf,1);
ifft_o = reshape(data3.',1,[]);
endThis section provides an overview of the 64K-pt IFFT @ 2 GSPS using both AI Engine and PL resources.
This design has a combination of AI Engine tiles and PL resources as follows. Based on the five algorithm steps identified earlier, the row and column 256-pt transforms are partitioned to the AI Engine array. The goal here is to minimize compute resources by using the fewest tiles possible and still meet the 2 GSPS throughput requirement. The AI Engine array also performs point-wise twiddle multiplications in the third step. For a streaming design such as this, the act of performing "row" and "column" transforms translates into a requirement for a "memory transpose" operation. Between these "row" and "column" transforms, the design must stream the samples into a storage buffer in row-major order and then remove the samples in column-major order. This occurs over a number of parallel streams. The number of streams is chosen to meet the overall throughput requirement. Based on these concepts, the design consists of a "front-end" AI Engine subgraph performing "row" transforms, a "back-end" AI Engine subgraph performing "column" transforms, and a "memory transpose" operation located in PL in between.
Based on this system architecture concept, it only remains to identify how many AI Engine resources the design requires to implement the following:
- "row" transforms
- "point-wise twiddle" multiplications,
- "column" transforms
Early prototyping of the AI Engine subgraphs identifies the number of instances required. The design identifies two separate subgraphs to consider:
- One "front-end" subgraph performing a "row" IFFT-256 followed by a "pointwise-twiddle" multiplication of the samples on that row, followed by zero-insertion.
- One "back-end" subgraph performing a "column" IFFT-256 followed by zero-insertion.
The zero-insertion allows simplifying the design of the "memory transpose" in PL, which the following section outlines in detail.
The throughput of prototypes of the these two subgraphs identifies the number of instances of each required to achieve the overall throughput target of 2 GSPS. The following figure shows traces in Vitis Analyzer for the front-end subgraph. This design is hand-coded in AIE API combining all three functions together into a single tile design with small memory footprint. It is throughput is 592 us or ~430 MSPS. Based on a target throughput of 2 GSPS, the design must have five instances of this subgraph in the overall design.
The following figure shows traces in Vitis Analyzer for the back-end subgraph. This design is also hand-coded in AIE API and combines the IFFT-256 with zero-padding into a second single tile design with small memory footprint. It is throughput is 422.4 ns or ~600 MSPS. Based on a target throughput of 2 GSPS, the design must have four instances of this subgraph in the overall design.
Based on these prototyping efforts, the final design architecture is shown in the following diagram. The design uses five instances of each front-end and back-end subgraph. The design uses five instances of the back-end subgraph even though only four are required, as this simplifies the overall design architecture. These instances become time shared over all transform operations required by the 2D algorithm. However, the design requires only 256 transforms in each "row" and "column" dimension, yet this number is not divisible by 5. Consequently, the 2D data cube is "zero-padded" by appending four rows at the bottom and four columns at the right to create a 260 x 260 data cube. Thus, the design can perform 52 transforms per AI Engine tile instance in each case for the "front-end" and "back-end" subgraph. This also means the design supports five I/O streams into and out of each subgraph. This also applies to the "memory transpose" operation in the PL. An important side effect of this zero-padding is its simplification in the construction on that PL design. You can then implement the design using a 5-bank memory architecture, outlined in detail in the following section.
The front-end IFFT-256 kernel performs three functions: a 256-pt IFFT, a point-wise twiddle multiplication, and a zero-padding with four samples. The IFFT is built using a Stockham algorithm hand-coded using the AIE API. The point-wise twiddle multiplication is computed using the vectorized aie::mul() API. The twiddle factors themselves are generated using the non-linear hardware accelerator connected to the scalar unit and vectorized by multiplying each scalar twiddle by a 8-sample vector containing consecutive twiddle phase steps. Finally, the zero-padding is performed by simply writing a vectorized null vector to the output buffer.
The following figure shows the AI Engine graph for the front-end IFFT-256 kernel. The design uses I/O buffers for simplicity and a scratchpad to act as a temporary buffer for the transform computation. Twiddle factors are stored in arrays as part of the system memory. A single tile is required. The full design uses five instances of this graph overall.
The memory transpose PL kernel writes input samples from the front-end transforms in row-major order and then delivers samples to the back-end transforms reading in column-major order. This must be sustained through streaming over five I/O streams at full speed to achieve the 2.0 GSPS throughput target overall. The key challenge for this kernel is partitioning the
The following figure shows the design concept used by the memory transpose PL kernel using a smaller
The Memory Transpose PL kernel implementation uses HLS @ 312.5 MHz. The following figure gives the resource utilization and timing from out-of-context synthesis and place-and-route.
The design of the back-end IFFT-256 is a simplified version of the front-end design. It merely implements the transform and the zero-padding -- there is no point-wise twiddle multiplications. The graph looks almost identical to the front-end design shown earlier. The following figure shows the physical array view of the back-end design. It also uses I/O buffers and requires a memory footprint that spills over to the neighboring tile.
The following figure summarizes the AI Engine and PL resources required to implement the design in the VC1902 device on the VCK190 eval board. The design uses 10 AI Engine tiles for compute and 17 tiles for buffering. The PL design includes the resources required to implement the DMA Source, Memory Transpose, and DMA Sink kernels. This design shows how you can craft Versal AI Engine and PL to create a high performance tightly-coupled custom datapath tailored directly to the algorithm of interest.
IMPORTANT: Before beginning the tutorial, install the Vitis™ 2025.2 software platform. Also, download the Common Images for Embedded Vitis Platforms from this link.
Set the environment variable COMMON_IMAGE_VERSAL to the full path where you have downloaded the Common Images. Then set the environment variable PLATFORM_REPO_PATHS to the value $XILINX_VITIS/base_platforms. You can find more information on this process here.
The remaining environment variables are configured in the top level Makefile <path-to-design>/12-IFFT64K-2D/Makefile file.
RELEASE=2025.2
TOP_DIR ?= $(shell readlink -f .)
PLATFORM_NAME = xilinx_vck190_base_202520_1
PLATFORM_PATH = ${PLATFORM_REPO_PATHS}
export PLATFORM = ${PLATFORM_PATH}/${PLATFORM_NAME}/${PLATFORM_NAME}.xpfm
export SYSROOT = ${COMMON_IMAGE_VERSAL}/sysroots/cortexa72-cortexa53-amd-linux
export KERNEL_IMAGE = ${COMMON_IMAGE_VERSAL}/Image
export ROOTFS = ${COMMON_IMAGE_VERSAL}/rootfs.ext4
export PREBUILT_LINUX_PATH = ${COMMON_IMAGE_VERSAL}
[shell]% cd <path-to-design>/12-IFFT64K-2D
[shell]% make all TARGET=hw_emuThis takes about 90 minutes to run. The build process generates a folder 12-IFFT64K-2D/package containing all the files required for hardware emulation. You can run the emulation as shown below. You can also apply an optional -g to the launch_hw_emu.sh command to launch the Vivado waveform GUI to observe the top-level AXI signal ports in the design.
[shell]% cd <path-to-design>/12-IFFT64K-2D/package
[shell]% ./launch_hw_emu.sh -run-app embedded_exec.shBuild the design for the VCK190 board using the Makefile as follows:
[shell]% cd <path-to-design>/12-IFFT64K-2D
[shell]% make all TARGET=hwThe build process generates the SD card image in the <path-to-design>/12-IFFT64K-2D/package/sd_card folder.
[1]: Vitis DSP Library [2]: XAPP1356: "Block-by-Block Configurable Fast Fourier Transform Implementation on AI Engine
GitHub issues are used to track requests and bugs. For questions, go to support.amd.com.
Copyright © 2023-2026 Advanced Micro Devices, Inc.







