See Vitis™ AI Development Environment on amd.com |
Version: Vitis 2025.2
The Prime Factor Algorithm (PFA) [1] is a Fast Fourier Transform (FFT) algorithm [2] discovered by Good and Thomas before the more popular Cooley-Tukey algorithm with some interesting properties. The PFA is another "divide and conquer" approach for computing a Discrete Fourier Transform (DFT) of size
A second advantage of the PFA approach is that unlike the popular Cooley-Tukey FFT, no extra multiplications by "twiddle factors" need be performed between stages. This fact falls out of the DFT factorization when
This tutorial illustrates these concepts by designing a PFA-1008 transform in Versal using both AI Engine and PL elements working cooperatively. The PFA approach mcan apply here since
The following figure shows a block diagram of a 2D PFA transform. It's corresponding MATLAB® model is shown immediately in the following figure. The algorithm consists of the following five steps:
- An input permutation is applied to the input data. The specific input permutation depends on the values of the relative prime factors
$N_1$ and$N_2$ as outlined in the following figure. - The data is organized into an
$N_2 \times N_1$ matrix and 1D FFT's are performed along the rows of that matrix. - A "matrix transpose" operation converts the data flow from "row-wise" to "column-wise" ordering for the subsequent column-based transforms. This amounts to delivering the data using a "strided" addressing pattern.
- A second set of 1D FFT's are performed along the columns of the original matrix.
- An output permutation is applied to the output data after being read column-wise out of its 2D matrix form. The specific output permutation depends on the values of
$N_1$ and$N_2$ as outlined in the following figure.
The following MATLAB code shows all of these five steps clearly. The routine compute_perm_2d() computes the input permutation P_i and output permutation P_o applied based on the values of
function [sig_o] = fft_pfa_2d( sig_i, N1, N2 )
[P_i,P_o,N] = compute_perm_2d(N1,N2);
% Input permutation
data = reshape(sig_i(1+P_i),N2,N1);
% Row transform
for rr = 1 : N2
data(rr,:) = fft(data(rr,:));
end
% Col transform
for cc = 1 : N1
data(:,cc) = fft(data(:,cc));
end
% Output permutation
data_o = reshape(data,1,[]);
sig_o(1+P_o) = data_o;
endThe following figure shows a block diagram of a 3D PFA transform. It's corresponding MATLAB model is shown immediately in the following figure. The algorithm consists of the same steps similar to the 2D case above with the following differences:
- The I/O permutations now depend on three relatively prime factors
$N_1$ ,$N_2$ and$N_3$ . The mathematics specific to these 2D and 3D permutations is given in detail in the following figure. - The data is now organized in a
$N_1 \times N_2 \times N_3$ cube instead of an$N_1 \times N_2$ rectangle. Transforms are taken in 1D along each of these dimensions in order, first along$N_1$ , then along$N_2$ and finally along$N_3$ . - The "matrix transpose" operations required to extract data in the
$N_2$ and$N_3$ dimensions involve slightly more complicated "stride" patterns. These patterns are computed by the MATLAB routinecompute_addr_3d.m.
function [sig_o] = fft_pfa_3d( sig_i, N1, N2, N3 )
[P_i,P_o,N] = compute_perm_3d(N1,N2,N3);
% Input permutation:
data = reshape(sig_i(1+P_i),N1,N2,N3);
% Transforms
for cc = 1 : N2
for dd = 1 : N3
data(:,cc,dd) = fft(data(:,cc,dd));
end
end
for rr = 1 : N1
for dd = 1 : N3
data(rr,:,dd) = fft(data(rr,:,dd));
end
end
for rr = 1 : N1
for cc = 1 : N2
data(rr,cc,:) = fft(data(rr,cc,:));
end
end
% Output permutation:
data_o = reshape(data,1,[]);
sig_o(1+P_o) = data_o;
endThe full suite of MATLAB models illustrating the operation of the PFA transforms is given in the matlab folder of the repo.
The following figure illustrates how to compute the I/O permutations required for a 2D PFA solution. The input permutation relies on a simple modulo computation with the relatively prime factors
The output permutation mapping relies on a similar modulo computation but with factors
The I/O permutations for a 3D PFA solution are solved in a manner similar to the 2D case as shown in the following figure. Essentially, the modulo computations are extended to include a third dimension, and three modulo-inverses are required to solve for the output permutation. These permutations can be expressed in a 3D matrix form, or as a 1D address permutation
The figure below shows a block diagram of a 3D PFA-1008 hardware design implemented in Versal using AI Engines and PL. The design targets a 2 GSPS throughput (SSR=2). AI Engines implement the three DFT kernels, specifically DFT-7, DFT-9 and DFT-16, using a vector-matrix multiplication approach. The design implements the I/O permutation and matrix transpose kernels using Vitis HLS targeting in PL.
Some details on each kernel design is given in the sections below.
This PL kernel is implemented in HLS @ 312.5 MHz (SSR=8). Samples arriving in polyphase order on two 128-bit streams are written are written into a ping/pong buffer in 4X duplicate fashion. This is required since each buffer supports two R/W ports and we must read or write 8 samples per cycle. The input permutation
This AI Engine kernel implements the DFT-7 using a vector-matrix approach based on mac8() intrinsics to compute two [1x1] x [1x4] OPs in parallel in a single cycle. Two compute tiles and 1 combiner tile are required. The first compute tile is 100% efficient whereas the second compute tile is 75% efficient. Input samples for two separate transforms arrive over two streams, one transform on each stream, over a period of seven cycles. The output samples are produced alternately for each transform on the two output streams over the same period. The graph for this AI Engine kernel is shown below.
This PL kernel implements in PL the matrix transpose operation required to feed the proper 9-point input samples to the DFT-9 on the second dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The input 7-pt transforms arrive over two streams as outlined above. Samples are written linearly into a ping-pong buffer and read back using a strided order (seven points apart) producing output 9-pt transforms alternately on each of two output streams. The design has a latency of 1008/8+2 cycles.
This AI Engine kernel implements the DFT-9 using a similar approach to the DFT-7 kernel above. This kernel requires 3 compute tiles and 1 combiner tiles and operates in a manner very similar to the DFT-7 kernel. I/O transforms are transmitted in alternating fashion over two streams as above. The first two compute kernels are 100% efficient, whereas the third tile is only 25% efficient. The graph for this AI Engine kernel is shown below.
This PL kernel implements in PL the matrix transpose operation required to feed the proper 16-point input samples to the DFT-16 on the third dimension of the 3D cube. The design uses HLS @ 312.5 MHz (SSR=8). The 9-pt transforms arrive on alternate streams. Samples are written into a ping/pong buffer. Samples are read back in a transposed (stride-63) order. The kernel produces four samples alternately on two output streams. The design has a latency of 1008/8+1 cycles.
This AI Engine kernel implements the DFT-16 using a similar approach to the DFT-7 and DFT-9 kernels above. Unlike those kernels, the I/O samples for the DFT-16 are transmitted four consecutive samples each on alternate streams (as compared to full transforms transmitted on alternate streams in the earlier kernels). The graph for this AI Engine kernel is shown below.
This PL kernel is implemented in HLS @ 312.5 MHz (SSR=8). Samples arriving in 4 samples alternately on two 128-bit streams are written are written into a ping/pong buffer in 4X duplicate fashion in a manner similar to the INPUT PERMUTE kernel. The output permutation
The figure below summarizes the AI Engine and PL resources required to implement the design in the VC1902 device on the VCK190 eval board. The design is using 12 AI Engine tiles for compute and 19 tiles for buffeg. The PL portion requires ~100 BRAMs to provide the I/O permutes and matrix transpose operations. This design illustrates how Versal AI Engine and PL can be crafted to create a high performance tightly coupled custom data path tailored directly to the algorithm of interest.
IMPORTANT: Before beginning the tutorial ensure you have installed Vitis™ 2025.2 software. Ensure you have downloaded the Common Images for Embedded Vitis Platforms from this link.
Set the environment variable COMMON_IMAGE_VERSAL to the full path where you have downloaded the Common Images. Then set the environment variable PLATFORM_REPO_PATHS to the value $XILINX_VITIS/base_platforms. Additional information on this process can be found here.
The remaining environment variables are configured in the top level Makefile <path-to-design>/05-Prime-Factor-FFT/Makefile file.
RELEASE=2025.2
TOP_DIR ?= $(shell readlink -f .)
PLATFORM_NAME = xilinx_vck190_base_202520_1
PLATFORM_PATH = ${PLATFORM_REPO_PATHS}
export PLATFORM = ${PLATFORM_PATH}/${PLATFORM_NAME}/${PLATFORM_NAME}.xpfm
export SYSROOT = ${COMMON_IMAGE_VERSAL}/sysroots/cortexa72-cortexa53-amd-linux
export KERNEL_IMAGE = ${COMMON_IMAGE_VERSAL}/Image
export ROOTFS = ${COMMON_IMAGE_VERSAL}/rootfs.ext4
export PREBUILT_LINUX_PATH = ${COMMON_IMAGE_VERSAL}
[shell]% cd <path-to-design>/05-Prime-Factor-FFT
[shell]% make all TARGET=hw_emuThis takes about 90 minutes to run. The build process generates a folder 05-Prime-Factor-FFT/package containing all the files required for hardware emulation. This can be run as shown below. An optional -g can be applied to the launch_hw_emu.sh command to launch the Vivado waveform GUI to observe the top-level AXI signal ports in the design.
[shell]% cd <path-to-design>/05-Prime-Factor-FFT/package
[shell]% ./launch_hw_emu.sh -run-app embedded_exec.shYou can build this design for the VCK190 board using the Makefile as follows:
[shell]% cd <path-to-design>/05-Prime-Factor-FFT
[shell]% make all TARGET=hwThe build process generates the SD card image in the <path-to-design>/05-Prime-Factor-FFT/package/sd_card folder.
[1] Wikipedia, "Prime Factor FFT Algorithm"
[2] C. Sidney Burrus, "Fast Fourier Transforms"
Requests and bugs are tracked using GitHub issues. For questions, go to support.xilinx.com.
Copyright © 2023-2025 Advanced Micro Devices, Inc








