This FPGA tutorial demonstrates how to parallelize host-side processing and buffer transfers between host and device with kernel execution, which can improve overall application performance.
Area | Description |
---|---|
What you will learn | How and when to implement the double buffering optimization technique |
Time to complete | 30 minutes |
Category | Code Optimization |
This sample demonstrates double buffering to overlap kernel execution with buffer transfers and host processing. In an application where the FPGA kernel is executed multiple times, the host must perform the following processing and buffer transfers before each kernel invocation.
-
The output data from the previous invocation must be transferred from the device to the host and then processed by the host. Examples of this processing include copying the data to another location, rearranging the data, and verifying it in some way.
-
The input data for the next invocation must be processed by the host and then transferred to the device. Examples of this processing include copying the data from another location, rearranging the data for kernel consumption, and generating the data in some way.
This sample is part of the FPGA code samples. It is categorized as a Tier 2 sample that demonstrates a design pattern.
flowchart LR
tier1("Tier 1: Get Started")
tier2("Tier 2: Explore the Fundamentals")
tier3("Tier 3: Explore the Advanced Techniques")
tier4("Tier 4: Explore the Reference Designs")
tier1 --> tier2 --> tier3 --> tier4
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier2 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
style tier3 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, links to selected documentation, etc.
Optimized for | Description |
---|---|
OS | Ubuntu* 20.04 RHEL*/CentOS* 8 SUSE* 15 Windows* 10, 11 Windows Server* 2019 |
Hardware | Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs |
Software | Intel® oneAPI DPC++/C++ Compiler |
Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
For using the simulator flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators must be installed and accessible through your PATH:
- Questa*-Intel® FPGA Edition
- Questa*-Intel® FPGA Starter Edition
- ModelSim® SE
When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.
⚠️ Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
The key concepts discussed in this sample are as followed:
- The double buffering optimization technique
- Determining when double buffering is beneficial
- How to measure the impact of double buffering
Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the
setvars
script located in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh
- For private installations:
. ~/intel/oneapi/setvars.sh
- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'
Windows*:
C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat
- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.
-
Change to the sample directory.
-
Build the program for the Agilex® 7 device family, which is the default.
mkdir build cd build cmake ..
Note: You can change the default target by using the command:
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
-
Compile for emulation (fast compile time, targets emulated FPGA device).
make fpga_emu
-
Compile for simulation (fast compile time, targets simulated FPGA device).
make fpga_sim
-
Generate HTML performance report.
make report
The report resides at
double_buffering.report.prj/reports/report.html
. Note that because the optimization occurs at the runtime level, the FPGA compiler report will not show a difference between the optimized and unoptimized cases. -
Compile for FPGA hardware (longer compile time, targets FPGA device).
make fpga
-
- Change to the sample directory.
- Build the program for the Agilex® 7 device family, which is the default.
mkdir build cd build cmake -G "NMake Makefiles" ..
Note: You can change the default target by using the command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design. (The provided targets match the recommended development flow.)
-
Compile for emulation (fast compile time, targets emulated FPGA device).
nmake fpga_emu
-
Compile for simulation (fast compile time, targets simulated FPGA device).
nmake fpga_sim
-
Generate HTML performance report.
nmake report
The report resides at
double_buffering.report.prj.a/reports/report.html
. Note that because the optimization occurs at the runtime level, the FPGA compiler report will not show a difference between the optimized and unoptimized cases. -
Compile for FPGA hardware (longer compile time, targets FPGA device).
nmake fpga
-
Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your
build
directory in a shorter path, for exampleC:\samples\build
. You can then build the sample in the new location, but you must specify the full path to the build files.
- Run the sample on the FPGA emulator (the kernel executes on the CPU).
./double_buffering.fpga_emu
- Run the sample on the FPGA simulator device.
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./double_buffering.fpga_sim
- Run the sample on the FPGA device (only if you ran
cmake
with-DFPGA_DEVICE=<board-support-package>:<board-variant>
)../double_buffering.fpga
- Run the sample on the FPGA emulator (the kernel executes on the CPU).
double_buffering.fpga_emu.exe
- Run the sample on the FPGA simulator device.
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 double_buffering.fpga_sim.exe set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
Note: Hardware runs are not supported on Windows.
Platform name: Intel(R) FPGA SDK for OpenCL(TM)
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
Executing kernel 100 times in each round.
*** Beginning execution, without double buffering
Launching kernel #0
Launching kernel #10
Launching kernel #20
Launching kernel #30
Launching kernel #40
Launching kernel #50
Launching kernel #60
Launching kernel #70
Launching kernel #80
Launching kernel #90
Overall execution time without double buffering = 13343 ms
Total kernel-only execution time without double buffering = 12878 ms
Throughput = 78.583199 MB/s
*** Beginning execution, with double buffering.
Launching kernel #0
Launching kernel #10
Launching kernel #20
Launching kernel #30
Launching kernel #40
Launching kernel #50
Launching kernel #60
Launching kernel #70
Launching kernel #80
Launching kernel #90
Overall execution time with double buffering = 12929 ms
Total kernel-only execution time with double buffering = 12924 ms
Throughput = 81.101379 MB/s
Verification PASSED
Emulator output does not demonstrate true hardware performance. The design may need to run on actual hardware to observe the performance benefit of the optimization exemplified in this tutorial.
Platform name: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device name: Intel(R) FPGA Emulation Device
Executing kernel 20 times in each round.
*** Beginning execution, without double buffering
Launching kernel #0
Launching kernel #10
Overall execution time without double buffering = 56 ms
Total kernel-only execution time without double buffering = 3 ms
Throughput = 5.7965984 MB/s
*** Beginning execution, with double buffering.
Launching kernel #0
Launching kernel #10
Overall execution time with double buffering = 6 ms
Total kernel-only execution time with double buffering = 2 ms
Throughput = 47.919624 MB/s
Verification PASSED
Without double buffering, host processing and buffer transfers occur between kernel executions; therefore, there is a gap in time between kernel executions, which you can refer to as kernel downtime (see the image below). If these operations overlap with kernel execution, the kernels can execute back-to-back with minimal downtime increasing overall application performance.
Before discussing the concepts, we must first define the required variables.
Variable | Description |
---|---|
R | Time to transfer the kernel output buffer from device to host |
Op | Host-side processing time of kernel output data (output processing) |
Ip | Host-side processing time for kernel input data (input processing) |
W | Time to transfer the kernel input buffer from host to device |
K | Kernel execution time |
In general, R, Op, Ip, and W operations must all complete before the next kernel is launched. To maximize performance, while one kernel is executing on the device, these operations should execute simultaneously on the host and operate on a second set of buffer locations. They should complete before the current kernel completes, allowing the next kernel to be launched immediately with no downtime. In general, to maximize performance, the host must launch a new kernel every K.
This leads to the following constraint to minimize kernel downtime: R + Op + Ip + W <= K.
If the above constraint is not satisfied, a performance improvement may still be observed because some overlap (perhaps not complete overlap) is still possible. Further improvement is possible by extending the double buffering concept to N-way buffering (see the corresponding tutorial).
You must get a sense of the kernel downtime to identify the degree to which this technique can help improve performance.
This can be done by querying the total kernel execution time from the runtime and comparing it to the overall application execution time. In an application where kernels execute with minimal downtime, these two numbers will be close. However, if kernels have a significant downtime, the overall execution time will notably exceed kernel execution time. The tutorial code exemplifies how to do this.
The basic implementation flow is as follows:
- Perform the input processing for the first two kernel executions and queue them both.
- Call the
process_output()
method immediately (automatically blocked by the SYCL* runtime) on the first kernel completing because of the implicit data dependency. - When the first kernel completes, the second kernel begins executing immediately because it was already queued.
- While the second kernel runs, the host processes the output data from the first kernel and prepares the third kernel's input data.
- As long as the above operations complete before the second kernel completes, the third kernel is queued early enough to allow it to be launched immediately after the second kernel.
- Repeat the process.
A test compile of this tutorial design achieved a maximum frequency (fMAX) of approximately 600 MHz on Intel® FPGA SmartNIC N6001-PL. The results with and without double buffering are shown in the following table:
Configuration | Overall Execution Time (ms) | Total Kernel Execution time (ms) |
---|---|---|
Without double buffering | 13343 | 12878 |
With double buffering | 12929 | 12924 |
In both runs, the total kernel execution time is similar as expected; however, without double buffering, the overall execution time exceeds the total kernel execution time, implying there is downtime between kernel executions. With double buffering, the overall execution time is close to the total kernel execution time.
Code samples are licensed under the MIT license. See License.txt for details.
Third-party program Licenses can be found here: third-party-programs.txt.