This sample demonstrates the process of Shannonization (named after Claude Shannon) for a simple FPGA design. This optimization improves the fMAX/II of a design by precomputing operations in a loop to remove them from the critical path.
Area | Description |
---|---|
What you will learn | How to make FPGA-specific optimizations to remove computation from the critical path and improve fMAX/II |
Time to complete | 45 minutes |
Category | Code Optimization |
Demonstrate a loop optimization to improve the fMAX/II of an FPGA design.
Optimized for | Description |
---|---|
OS | Ubuntu* 20.04 RHEL*/CentOS* 8 SUSE* 15 Windows* 10, 11 Windows Server* 2019 |
Hardware | Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs |
Software | Intel® oneAPI DPC++/C++ Compiler |
Note: Even though the Intel® oneAPI DPC++/C++ Compiler is enough to compile for emulation, generating reports, generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
For using the simulator flow, you must have Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators installed and accessible through your PATH:
- Questa*-Intel® FPGA Edition
- Questa*-Intel® FPGA Starter Edition
- ModelSim SE
When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.
Warning Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
This sample is part of the FPGA code samples. It is categorized as a Tier 3 sample that demonstrates a design pattern.
flowchart LR
tier1("Tier 1: Get Started")
tier2("Tier 2: Explore the Fundamentals")
tier3("Tier 3: Explore the Advanced Techniques")
tier4("Tier 4: Explore the Reference Designs")
tier1 --> tier2 --> tier3 --> tier4
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier2 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier3 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, links to selected documentation, and more.
Shannonization is the process of removing operations from the critical path of a circuit by precomputation. To demonstrate, consider the trivial case below, which counts the number of elements in an array A
that are less than some runtime value v
.
int A[SIZE] = {/*...*/};
int v = /*some dynamic value*/
int c = 0;
for (int i = 0; i < SIZE; i++) {
if (A[i] < v) {
c++;
}
}
A possible circuit diagram for this algorithm is shown in the image below, where the dotted red line represents a possible critical path in the circuit.
The goal of the shannonization optimization is to remove operations from the critical path. In this case, we can precompute the next value of c
(fittingly named c_next
) for a later iteration of the loop to use when needed (i.e., the next time A[i] < v
). This optimization is shown in the code below.
int A[SIZE] = {/*...*/};
int v = /*some dynamic value*/
int c = 0;
int c_next = 1;
for (int i = 0; i < SIZE; i++) {
if (A[i] < v) {
// these operations can happen in parallel!
c = c_next;
c_next++;
}
}
A possible circuit diagram for this optimized algorithm is shown in the image below, where the dotted red line represents a possible critical path in the circuit. Notice that we have removed the +
operation from the critical path. This assumes that the critical path delay through the multiplexer is higher than through the adder. This may not be the case, and the critical path could be from the c
register to the c_next
register through the adder, in which case we would have removed the multiplexer from the critical path. Regardless of which operation has the longer critical path delay (the adder or the multiplexer), we have removed an operation from the critical path by precomputing and storing the next value of c
. This allows us to reduce the critical path delay at the expense of area (in this case, a single 32-bit register).
The purpose of this tutorial is to show methods for improving the fMAX/II of a design by removing computation from the critical path. This optimization is widely applicable.
To demonstrate, we will optimize a simple function (useful in database processing) that finds the size of the intersection (denoted by the ∩
symbol) between two sorted arrays. We look at a special case where one array (A
) cannot have duplicates, the second array (B
) can have duplicates, and the output intersection contains the entire intersection (including the duplicates). An example is shown below:
A = {2,5,6,7,9}
B = {2,4,6,6,8,9,10}
A ∩ B = {2,6,6,9}
|A ∩ B| = 4
For the FPGA, we will create three kernels: two kernels that stream array A
and B
from global memory through SYCL pipes and a third kernel that performs the intersection logic. The kernel diagram is shown below:
----------------- ------------------------
| ProduceA |------------->| |
----------------- | |
| Intersection |
----------------- | |
| ProduceB |------------->| |
----------------- ------------------------
The FPGA pseudocode for the Intersection
kernel is shown below:
void intersection(int A_size, int B_size, int& intersection_size) {
int a, b;
int A_count, B_count;
intersection_size = 0;
// initialize first elements
a = APipe::read();
b = BPipe::read();
A_count = 1;
B_count = 1;
while (A_count < A_size || B_count < B_size) {
// values match increment counter
if (a == b) {
intersection_size++;
}
// read in new element
if (a < b && A_count < A_size) {
a = APipe::read();
A_count++;
} else if (B_count < B_size) {
b = BPipe::read();
B_count++;
}
};
// check the last values
if (a == b) {
intersection_size++;
}
}
To achieve an II of 1 for the main while
loop in the FPGA code shown above, the compiler must schedule three 32-bit Compare Operations, a 32-bit Add Operation, a 32-bit Select Operation (i.e., a pipe read), and a 1-bit And Operation into a single cycle. This is necessary since the actions of the next iteration of the loop depend on the result of the loop's current iteration. More specifically, the current iteration must: compare the current values of a
and b
, compare the number of elements read from the pipes (i.e. A_count < A_size
and B_count < B_size
), increment A_count
or B_count
, and then update either a
or b
by reading the respective pipe before the next iteration of the loop can enter the same block of code. This creates a long critical path that requires a tradeoff in fMAX or II (i.e., either fMAX must decrease or II must increase). This tutorial will explain optimizations that remove these operations from the critical path (at the expense of some area) and improve the fMAX/II tradeoff and, therefore, the throughput.
Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the
setvars
script in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh
- For private installations:
. ~/intel/oneapi/setvars.sh
- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'
Windows*:
C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat
- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.
-
Change to the sample directory.
-
Build the program for Intel® Agilex® 7 device family, which is the default.
mkdir build cd build cmake ..
Note: You can change the default target by using the command:
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
The build system will try to infer the FPGA family from the BSP name. If it can't, an extra option needs to be passed to
cmake
:-DDEVICE_FLAG=[A10|S10|CycloneV|Agilex5|Agilex7]
Note: You can poll your system for available BSPs using theaoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
- Compile for emulation (fast compile time, targets emulated FPGA device):
make fpga_emu
- Generate the optimization report:
make report
The report resides at
shannonization.report.prj/reports/report.html
. See the Reading the Reports section below to understand the report contents.- Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size):
make fpga_sim
- Compile for FPGA hardware (longer compile time, targets FPGA device):
make fpga
- Compile for emulation (fast compile time, targets emulated FPGA device):
-
Change to the sample directory.
-
Build the program for the Intel® Agilex® 7 device family, which is the default.
mkdir build cd build cmake -G "NMake Makefiles" ..
Note: You can change the default target by using the command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
The build system will try to infer the FPGA family from the BSP name. If it can't, an extra option needs to be passed to
cmake
:-DDEVICE_FLAG=[A10|S10|CycloneV|Agilex7]
Note: You can poll your system for available BSPs using theaoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
-
Compile for emulation (fast compile time, targets emulated FPGA device):
nmake fpga_emu
-
Generate the optimization report:
nmake report
The report resides at
shannonization.report.prj.a/reports/report.html
. See the Reading the Reports section below to understand the report contents. -
Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size):
nmake fpga_sim
-
Compile for FPGA hardware (longer compile time, targets FPGA device):
nmake fpga
-
Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your 'build' directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory, for example:
C:\samples\build> cmake -G "NMake Makefiles" C:\long\path\to\code\sample\CMakeLists.txt
This section will walk through how the HTML reports show the result of the optimizations we made in each version of the kernel, the definition of which can be found in src/IntersectionKernel.hpp
. The fMAX numbers mentioned in these sections assume that the Arria® 10 FPGA is the target. However, the discussion is similar for the other targets.
The first version of the kernel, Intersection<0>
, is the baseline implementation of the intersection kernel. Check the Details pane in the Loop Analysis tab for the while
loop in the Intersection<0>
kernel. You will notice that the Block Scheduled fMAX for the Intersection<0>
kernel is far lower than the target (e.g., ~140 MHz). The Details pane shows that the most critical path contains the operations mentioned earlier at the end of the Algorithm Details Section.
The second version of the kernel, Intersection<1>
, uses the shannonization optimization to remove the increment of A_count
and B_count
from the critical path. To do this, we create two new variables, A_count_next
and B_count_next
, which will store the value of A_count
and B_count
for the next iteration of the loop. The code snippet below shows how A_count
and B_count
are updated using A_count_next
and B_count_next
:
...
if (a < b && A_count < A_size) {
a = APipe::read();
A_count = A_count_next;
A_count_next++;
} else if (B_count < B_size) {
b = BPipe::read();
B_count = B_count_next;
B_count_next++;
}
...
Now, the assignments of A_count = A_count_next
and B_count = B_count_next
can be done in parallel to the increment of the counts for the next iteration of the loop (i.e. in parallel to A_count_next++
and B_count_next++
). This removes the 32-bit Integer Add Operation from the critical path, as can be seen in the Details pane of the Loop Analysis report for the Intersection<1>
kernel. The Loop Analysis pane will show an increase in the Block Scheduled fMAX (e.g. ~190 MHz).
The third version of the kernel, Intersection<2>
, extends the previous optimization by removing the 32-bit Integer Compare Operation from the critical path. The first step is to precompute the comparisons for the next loop iteration (A_count_inrange
and B_count_inrange
) and the next iteration (A_count_next_inrange
and B_count_next_inrange
), as we did for the accumulations. This is shown in the code snippet below:
...
if (a < b && A_count_inrange) {
a = APipe::read();
A_count = A_count_next;
A_count_next++;
A_count_inrange = A_count_next_inrange;
A_count_next_inrange = A_count_next < A_size;
} else if (B_count_inrange) {
b = BPipe::read();
B_count = B_count_next;
B_count_next++;
B_count_inrange = B_count_next_inrange;
B_count_next_inrange = B_count_next < B_size;
}
...
However, this places a 32-bit Integer Add Operation back into the critical path (e.g. A_count_next++
must be computed before computing A_count_next_inrange = A_count_next < A_size
). To remove this addition from the critical path, we do the same optimization as version 1. We now precompute the additions for the next two iterations of the loop (A_count_next_next
and B_count_next_next
), which again removes the addition from the critical path. This is shown in the code snippet below:
...
if (a < b && A_count_inrange) {
a = APipe::read();
A_count_inrange = A_count_next_inrange;
A_count_next_inrange = A_count_next_next < A_size;
A_count = A_count_next;
A_count_next = A_count_next_next;
A_count_next_next++;
} else if (B_count_inrange) {
b = BPipe::read();
B_count_inrange = B_count_next_inrange;
B_count_next_inrange = B_count_next_next < B_size;
B_count = B_count_next;
B_count_next = B_count_next_next;
B_count_next_next++;
}
...
In general, these shannonization optimizations create a shift-register that precomputes and passes values (additions and comparisons) to the loop's later iterations. The size of the shift-register determines how many future iterations we precompute for. In version 1, we precompute for one iteration; in this version, we precompute for 2 iterations. The reports for the Intersection<2>
should show a critical path with: a single 32-bit Integer Compare Operation (a < b
), a 32-bit Select Operation (::read
), and a 1-bit And Operation (a < b && A_count_inrange
). Note the removal of two 32-bit Compare Operations and one 32-bit Add Operation from the critical path. Looking at the Loop Analysis pane, you will see that the Block Scheduled fMAX is highest for Intersection<2>
(e.g., 240 MHz).
The following table explains the command-line arguments that can be passed to the Shannonization
program.
Argument name | Description | Default |
---|---|---|
--A |
Set the size of array A | 128 for emulation, 16384 for FPGA |
--B |
Set the size of array B | 256 for emulation, 32768 for FPGA |
--help |
Print the help message | N/A |
- Run the sample on the FPGA emulator (the kernel executes on the CPU).
./shannonization.fpga_emu
- Run the sample on the FPGA simulator device.
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./shannonization.fpga_sim
- Run the sample on the FPGA device (only if you ran
cmake
with-DFPGA_DEVICE=<board-support-package>:<board-variant>
)../shannonization.fpga
- Run the sample on the FPGA emulator (the kernel executes on the CPU).
shannonization.fpga_emu.exe
- Run the sample on the FPGA simulator device.
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 shannonization.fpga_sim.exe set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
Note: Hardware runs are not supported on Windows.
Generating input data
Computing golden result
Running 1 iteration of kernel 0 with |A|=128 and |B|=256
Running 1 iteration of kernel 1 with |A|=128 and |B|=256
Running 1 iteration of kernel 2 with |A|=128 and |B|=256
PASSED
Generating input data
Computing golden result
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
Running 5 iterations of kernel 0 with |A|=16384 and |B|=32768
Kernel 0 average throughput: 182.521 MB/s
Running 5 iterations of kernel 1 with |A|=16384 and |B|=32768
Kernel 1 average throughput: 221.232 MB/s
Running 5 iterations of kernel 2 with |A|=16384 and |B|=32768
Kernel 2 average throughput: 226.728 MB/s
PASSED
Code samples are licensed under the MIT license. See License.txt for details.
Third party program Licenses can be found here: third-party-programs.txt.