Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
src	src
CMakeLists.txt	CMakeLists.txt
README.md	README.md
sample.json	sample.json

`Double Buffering` Sample

This FPGA tutorial demonstrates how to parallelize host-side processing and buffer transfers between host and device with kernel execution, which can improve overall application performance.

Area	Description
What you will learn	How and when to implement the double buffering optimization technique
Time to complete	30 minutes
Category	Code Optimization

Purpose

This sample demonstrates double buffering to overlap kernel execution with buffer transfers and host processing. In an application where the FPGA kernel is executed multiple times, the host must perform the following processing and buffer transfers before each kernel invocation.

The output data from the previous invocation must be transferred from the device to the host and then processed by the host. Examples of this processing include copying the data to another location, rearranging the data, and verifying it in some way.
The input data for the next invocation must be processed by the host and then transferred to the device. Examples of this processing include copying the data from another location, rearranging the data for kernel consumption, and generating the data in some way.

Prerequisites

This sample is part of the FPGA code samples. It is categorized as a Tier 2 sample that demonstrates a design pattern.

flowchart LR
   tier1("Tier 1: Get Started")
   tier2("Tier 2: Explore the Fundamentals")
   tier3("Tier 3: Explore the Advanced Techniques")
   tier4("Tier 4: Explore the Reference Designs")

   tier1 --> tier2 --> tier3 --> tier4

   style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
   style tier2 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
   style tier3 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
   style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff

Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, links to selected documentation, etc.

Optimized for	Description
OS	Ubuntu* 20.04 RHEL/CentOS 8 SUSE* 15 Windows* 10, 11 Windows Server* 2019
Hardware	Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs
Software	Intel® oneAPI DPC++/C++ Compiler

Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.

For using the simulator flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators must be installed and accessible through your PATH:

Questa*-Intel® FPGA Edition

Questa*-Intel® FPGA Starter Edition

ModelSim® SE

When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.

⚠️ Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.

Key Implementation Details

The key concepts discussed in this sample are as followed:

The double buffering optimization technique
Determining when double buffering is beneficial
How to measure the impact of double buffering

Build the `Double Buffering` Sample

Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the setvars script located in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.

Linux*:

For system wide installations: . /opt/intel/oneapi/setvars.sh

For private installations: . ~/intel/oneapi/setvars.sh

For non-POSIX shells, like csh, use the following command: bash -c 'source <install-dir>/setvars.sh ; exec csh'

Windows*:

C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat

Windows PowerShell*, use the following command: cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'

For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.

On Linux*

Change to the sample directory.
Build the program for the Agilex® 7 device family, which is the default.
```
mkdir build
cd build
cmake ..
```
Note: You can change the default target by using the command:
```
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
```
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
```
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
```

Note: You can poll your system for available BSPs using the aoc -list-boards command. The board list that is printed out will be of the form
$> aoc -list-boards
Board list:
  <board-variant>
     Board Package: <path/to/board/package>/board-support-package
  <board-variant2>
     Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.

Compile the design. (The provided targets match the recommended development flow.)
1. Compile for emulation (fast compile time, targets emulated FPGA device).
```
make fpga_emu
```
2. Compile for simulation (fast compile time, targets simulated FPGA device).
```
make fpga_sim
```
3. Generate HTML performance report.
```
make report
```
  The report resides at double_buffering.report.prj/reports/report.html. Note that because the optimization occurs at the runtime level, the FPGA compiler report will not show a difference between the optimized and unoptimized cases.
4. Compile for FPGA hardware (longer compile time, targets FPGA device).
```
make fpga
```

On Windows*

Change to the sample directory.
Build the program for the Agilex® 7 device family, which is the default.
```
mkdir build
cd build
cmake -G "NMake Makefiles" ..
```

Note: You can change the default target by using the command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the aoc -list-boards command. The board list that is printed out will be of the form
$> aoc -list-boards
Board list:
  <board-variant>
     Board Package: <path/to/board/package>/board-support-package
  <board-variant2>
     Board Package: <path/to/board/package>/board-support-package
You will only be able to run an executable on the FPGA if you specified a BSP.

Compile the design. (The provided targets match the recommended development flow.)
1. Compile for emulation (fast compile time, targets emulated FPGA device).
```
nmake fpga_emu
```
2. Compile for simulation (fast compile time, targets simulated FPGA device).
```
nmake fpga_sim
```
3. Generate HTML performance report.
```
nmake report
```
  The report resides at double_buffering.report.prj.a/reports/report.html. Note that because the optimization occurs at the runtime level, the FPGA compiler report will not show a difference between the optimized and unoptimized cases.
4. Compile for FPGA hardware (longer compile time, targets FPGA device).
```
nmake fpga
```

Note: If you encounter any issues with long paths when compiling under Windows*, you may have to create your build directory in a shorter path, for example C:\samples\build. You can then build the sample in the new location, but you must specify the full path to the build files.

Run the `Double Buffering` Sample

On Linux

Run the sample on the FPGA emulator (the kernel executes on the CPU).
```
./double_buffering.fpga_emu
```

Run the sample on the FPGA simulator device.

CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./double_buffering.fpga_sim

Run the sample on the FPGA device (only if you ran cmake with -DFPGA_DEVICE=<board-support-package>:<board-variant>).
```
./double_buffering.fpga
```

On Windows

Run the sample on the FPGA emulator (the kernel executes on the CPU).
```
double_buffering.fpga_emu.exe
```

Run the sample on the FPGA simulator device.

set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
double_buffering.fpga_sim.exe
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=

Note: Hardware runs are not supported on Windows.

Example Output

Example Output for an FPGA Device

Platform name: Intel(R) FPGA SDK for OpenCL(TM)
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
Executing kernel 100 times in each round.

*** Beginning execution, without double buffering
Launching kernel #0
Launching kernel #10
Launching kernel #20
Launching kernel #30
Launching kernel #40
Launching kernel #50
Launching kernel #60
Launching kernel #70
Launching kernel #80
Launching kernel #90

Overall execution time without double buffering = 13343 ms
Total kernel-only execution time without double buffering = 12878 ms
Throughput = 78.583199 MB/s


*** Beginning execution, with double buffering.
Launching kernel #0
Launching kernel #10
Launching kernel #20
Launching kernel #30
Launching kernel #40
Launching kernel #50
Launching kernel #60
Launching kernel #70
Launching kernel #80
Launching kernel #90

Overall execution time with double buffering = 12929 ms
Total kernel-only execution time with double buffering = 12924 ms
Throughput = 81.101379 MB/s


Verification PASSED

Example Output for the FPGA Emulator

Emulator output does not demonstrate true hardware performance. The design may need to run on actual hardware to observe the performance benefit of the optimization exemplified in this tutorial.

Platform name: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device name: Intel(R) FPGA Emulation Device


Executing kernel 20 times in each round.

*** Beginning execution, without double buffering
Launching kernel #0
Launching kernel #10

Overall execution time without double buffering = 56 ms
Total kernel-only execution time without double buffering = 3 ms
Throughput = 5.7965984 MB/s


*** Beginning execution, with double buffering.
Launching kernel #0
Launching kernel #10

Overall execution time with double buffering = 6 ms
Total kernel-only execution time with double buffering = 2 ms
Throughput = 47.919624 MB/s


Verification PASSED

`Double Buffering` Guided Design Walkthrough

Determining When Double Buffering Is Possible

Without double buffering, host processing and buffer transfers occur between kernel executions; therefore, there is a gap in time between kernel executions, which you can refer to as kernel downtime (see the image below). If these operations overlap with kernel execution, the kernels can execute back-to-back with minimal downtime increasing overall application performance.

Before discussing the concepts, we must first define the required variables.

Variable	Description
R	Time to transfer the kernel output buffer from device to host
Op	Host-side processing time of kernel output data (output processing)
Ip	Host-side processing time for kernel input data (input processing)
W	Time to transfer the kernel input buffer from host to device
K	Kernel execution time

In general, R, Op, Ip, and W operations must all complete before the next kernel is launched. To maximize performance, while one kernel is executing on the device, these operations should execute simultaneously on the host and operate on a second set of buffer locations. They should complete before the current kernel completes, allowing the next kernel to be launched immediately with no downtime. In general, to maximize performance, the host must launch a new kernel every K.

This leads to the following constraint to minimize kernel downtime: R + Op + Ip + W <= K.

If the above constraint is not satisfied, a performance improvement may still be observed because some overlap (perhaps not complete overlap) is still possible. Further improvement is possible by extending the double buffering concept to N-way buffering (see the corresponding tutorial).

Measuring the Impact of Double Buffering

You must get a sense of the kernel downtime to identify the degree to which this technique can help improve performance.

This can be done by querying the total kernel execution time from the runtime and comparing it to the overall application execution time. In an application where kernels execute with minimal downtime, these two numbers will be close. However, if kernels have a significant downtime, the overall execution time will notably exceed kernel execution time. The tutorial code exemplifies how to do this.

Implementation Notes

The basic implementation flow is as follows:

Perform the input processing for the first two kernel executions and queue them both.
Call the process_output() method immediately (automatically blocked by the SYCL* runtime) on the first kernel completing because of the implicit data dependency.
When the first kernel completes, the second kernel begins executing immediately because it was already queued.
While the second kernel runs, the host processes the output data from the first kernel and prepares the third kernel's input data.
As long as the above operations complete before the second kernel completes, the third kernel is queued early enough to allow it to be launched immediately after the second kernel.
Repeat the process.

Impact of Double Buffering

A test compile of this tutorial design achieved a maximum frequency (f_MAX) of approximately 600 MHz on Intel® FPGA SmartNIC N6001-PL. The results with and without double buffering are shown in the following table:

Configuration	Overall Execution Time (ms)	Total Kernel Execution time (ms)
Without double buffering	13343	12878
With double buffering	12929	12924

In both runs, the total kernel execution time is similar as expected; however, without double buffering, the overall execution time exceeds the total kernel execution time, implying there is downtime between kernel executions. With double buffering, the overall execution time is close to the total kernel execution time.

License

Code samples are licensed under the MIT license. See License.txt for details.

Third-party program Licenses can be found here: third-party-programs.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

double_buffering

double_buffering

README.md

`Double Buffering` Sample

Purpose

Prerequisites

Key Implementation Details

Build the `Double Buffering` Sample

On Linux*

On Windows*

Run the `Double Buffering` Sample

On Linux

On Windows

Example Output

Example Output for an FPGA Device

Example Output for the FPGA Emulator

`Double Buffering` Guided Design Walkthrough

Determining When Double Buffering Is Possible

Measuring the Impact of Double Buffering

Implementation Notes

Impact of Double Buffering

License

Files

double_buffering

Directory actions

More options

Directory actions

More options

Latest commit

History

double_buffering

Folders and files

parent directory

README.md

Double Buffering Sample

Purpose

Prerequisites

Key Implementation Details

Build the Double Buffering Sample

On Linux*

On Windows*

Run the Double Buffering Sample

On Linux

On Windows

Example Output

Example Output for an FPGA Device

Example Output for the FPGA Emulator

Double Buffering Guided Design Walkthrough

Determining When Double Buffering Is Possible

Measuring the Impact of Double Buffering

Implementation Notes

Impact of Double Buffering

License

`Double Buffering` Sample

Build the `Double Buffering` Sample

Run the `Double Buffering` Sample

`Double Buffering` Guided Design Walkthrough