This FPGA tutorial demonstrates how to use the Intercept Layer for OpenCL™ Applications to perform system-level profiling on a design and reveal areas for improvement.
Note: Tutorial does not work on Windows* as compiling to FPGA hardware is not yet supported in Windows.
The Intercept Layer for OpenCL™ Applications GitHub provides complete documentation on using this tool.
Optimized for | Description |
---|---|
OS | Ubuntu* 20.04 RHEL*/CentOS* 8 SUSE* 15 |
Hardware | Intel® Agilex® 7, Agilex® 5, Arria® 10, Stratix® 10, and Cyclone® V FPGAs |
Software | Intel® oneAPI DPC++/C++ Compiler |
What you will learn | Summary of profiling tools available for performance optimization About the Intercept Layer for OpenCL™ Applications How to set up and use this tool A case study of using this tool to identify when the double buffering system-level optimization is beneficial |
Time to complete | 30 minutes |
Note: Even though the Intel DPC++/C++ oneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
For using the simulator flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) and one of the following simulators must be installed and accessible through your PATH:
- Questa*-Intel® FPGA Edition
- Questa*-Intel® FPGA Starter Edition
- ModelSim® SE
When using the hardware compile flow, Intel® Quartus® Prime Pro Edition (or Standard Edition when targeting Cyclone® V) must be installed and accessible through your PATH.
⚠️ Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
This sample is part of the FPGA code samples. It is categorized as a Tier 3 sample that demonstrates the usage of a tool.
flowchart LR
tier1("Tier 1: Get Started")
tier2("Tier 2: Explore the Fundamentals")
tier3("Tier 3: Explore the Advanced Techniques")
tier4("Tier 4: Explore the Reference Designs")
tier1 --> tier2 --> tier3 --> tier4
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier2 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
style tier3 fill:#f96,stroke:#333,stroke-width:1px,color:#fff
style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff
Find more information about how to navigate this part of the code samples in the FPGA top-level README.md. You can also find more information about troubleshooting build errors, links to selected documentation, etc.
This FPGA tutorial demonstrates how to use the Intercept Layer for OpenCL™ Applications, an open-source tool, to perform system-level profiling on a design and reveal areas for improvement. When targeting an FPGA family/part, no FPGA executable is generated. So this sample is really meant to be used when targeting a device with a BSP where an FPGA executable would be produced.
The following code snippet uses standard SYCL* and C++ language features to extract profiling information from code.
void profiling_example(const std::vector<float>& vec_in,
std::vector<float>& vec_out ) {
// Start the timer
auto start = std::chrono::steady_clock::now();
// Host performs pre-processing of input data
std::vector<float> vec_pp = PreProcess(vec_in);
// FPGA device performs additional processing
ext::intel::fpga_selector selector;
queue q(selector, fpga_tools::exception_handler,
property::queue::enable_profiling{});
buffer buf_in(vec_pp);
buffer buf_out(vec_out);
event e = q.submit([&](handler &h) {
accessor acc_in(buf_in, h, read_only);
accessor acc_out(buf_out, h, write_only, no_init);
h.single_task<class Kernel>([=]() [[intel::kernel_args_restrict]] {
DeviceProcessing(acc_in, acc_out);
});
});
// Query event e for kernel profiling information
// (blocks until command groups associated with e complete)
double kernel_time_ns =
e.get_profiling_info<info::event_profiling::command_end>() -
e.get_profiling_info<info::event_profiling::command_start>();
// Stop the timer.
auto end = std::chrono::steady_clock::now();
double total_time_s = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
// Report profiling info
std::cout << "Kernel compute time: " << kernel_time_ns * 1e-6 << " ms\n";
std::cout << "Total compute time: " << total_time_s * 1e3 << " ms\n";
}
This tutorial introduces the Intercept Layer for OpenCL™ Applications, a profiling tool that extracts and visualizes system-level profiling information for SYCL-compliant programs. This tool can extract the same profiling data (and more) as the code snippet above, without requiring any code-level profiling directives.
The Intercept Layer for OpenCL™ provides coarse-grained, system-level profiling information. A complementary tool, the Intel® FPGA Dynamic Profiler for DPC++, provides fine-grained profiling information for the kernels executing on the device. Together, these two tools can be used to optimize both host and device side execution. However, these tools should not be used simultaneously, as the Intercept Layer for OpenCL™ may slow down the runtime execution, rendering the Dynamic Profiler data less accurate.
The Intercept Layer for OpenCL™ Applications is an open-source tool that you can use to profile SYCL-conforming designs at a system-level. Although it is not part of the oneAPI Base Toolkit installation, it is freely available on GitHub.
This tool serves the following purposes:
- Intercept host calls before they reach the device to gather performance data and log host calls.
- Provide data to visualize the calls through time, separating them into queued, submitted, and execution sections to help you better understand the execution.
- Identify gaps (using visualization) in the runtime. Gaps often indicate inefficient execution and throughput loss.
The Intercept Layer for OpenCL™ Applications has several different options for capturing different aspects of the host run. These options are described in its documentation. This tutorial uses the call-logging and device timeline features that print information about the host's calls during execution.
You can visualize the data generated by the Intercept Layer for OpenCL™ Applications in the following ways:
- Google* Chrome* trace event profiling tool: JSON files generated by the Intercept Layer for OpenCL™ Applications contain device timeline information. You can open these JSON files in the Google* Chrome* trace event profiling tool to generate a visual representation of the profiling data.
- Microsoft* Excel*: The Intercept Layer for OpenCL™ Applications contains a Python script that parses the timeline information into a Microsoft* Excel* file, where it is presented both in a table format and in a bar graph.
This tutorial will use the Google* Chrome trace event profiling tool for visualization.
Use the visualized data to identify gaps in the runtime where events are waiting for something else to finish executing. These gaps represent potential opportunities for system-level optimization. While it is not possible to eliminate all such gaps, you might be able to eliminate those caused by dependencies that can be avoided.
This tutorial is based on the double-buffering optimization. Double-buffering allows host data processing and host transfers to the device-side buffer to occur in parallel with the kernel execution on the FPGA device. This parallelization is useful when the host performs any combination of the following actions between consecutive kernel runs:
- Preprocessing
- Postprocessing
- Writes to the device buffer
By running host and device actions in parallel, execution gaps between kernels are removed as they no longer have to wait for the host to finish its operation. It's easy to see the benefits of double-buffering with the visualizations provided by the Intercept Layer output.
The Intercept Layer for OpenCL™ Applications is available on GitHub at the following URL: https://github.com/intel/opencl-intercept-layer
To set up the Intercept Layer for OpenCL™ Applications, perform the following steps:
-
Download the Intercept Layer for OpenCL™ Applications version 2.2.1 or later from GitHub.
-
Build the Intercept Layer according to the instructions provided in How to Build the Intercept Layer for OpenCL™ Applications.
- Run
cmake
: Ensure that you setENABLE_CLILOADER=1
when running cmake. (i.e.cmake -DENABLE_CLILOADER=1 ..
) - Run
make
: After the cmake step,make
must be run in the build directory. This step builds thecliloader
loader utility. - Add to your
PATH
: Thecliloader
executable should now exist in<path to opencl-intercept-layer-master download>/<build dir>/cliloader/
directory. Add this directory to yourPATH
environment variable if you wish to run multiple designs usingcliloader
.
You can now pass your executables to
cliloader
to run them with the intercept layer. For details about thecliloader
loader utility, see cliloader: A Intercept Layer for OpenCL™ Applications Loader. - Run
-
Set
cliloader
and other Intercept Layer options.If you run multiple designs with the same options, set up a
clintercept.conf
file in your home directory. You can also set the options as environment variables by prefixing the option name withCLI_
. For example, theDllName
option can be set through theCLI_DllName
environment variable. For a list of options, see Controls in How to Use the Intercept Layer for OpenCL™ Applications.For this tutorial, set the following options:
Options/Variables | Description |
---|---|
DllName=$CMPLR_ROOT/linux/lib/libOpenCL.so |
The intercept layer must know where libOpenCL.so file from the original oneAPI build is. If using this option in a conf file, expand $CMPLR_ROOT and pass in the absolute path. |
DevicePerformanceTiming=1 and DevicePerformanceTimelineLogging=1 |
These options print out runtime timeline information in the output of the executable run. |
ChromePerformanceTiming=1 , ChromeCallLogging=1 , ChromePerformanceTimingInStages=1 |
These variables set up the chrome tracer output, and ensure the output has Queued, Submitted, and Execution stages. |
These instructions set up the cliloader
executable, which provides some flexibility by allowing for more control over when the layer is used or not used. If you prefer a local installation (for a single design) or a global installation (always ON for all designs), follow the instructions at How to Install the Intercept Layer for OpenCL™ Applications.
To run a compiled program using the Intercept Layer for OpenCL™ Applications, use the command:
cliloader <executable> [executable args]
To run the tutorial example, refer to the "Running the Sample" section.
When you run the host executable with the cliloader
command, the stderr
output contains lines as shown in the following example:
Device Timeline for clEnqueueMapBuffer (enqueue 1) = 145810154237969 ns (queued), 145810156546459 ns (submit), 145810156552270 ns (start), 145810159109587 ns (end)
These lines give the timeline information about a variety of oneAPI runtime calls. After the host executable finishes running, there is also a summary of the run's performance information.
After the executable runs, the data collected will be placed in the CLIntercept_Dump
directory, which is in the home directory by default. Its location can be adjusted using the DumpDir=<directory where you want the output files>
cliloader
option (if you are using the option inline, remember to use CLI_DumpDir
). CLIntercept_Dump
contains a folder named after the executable with a file called clintercept_trace.json
. You can load this JSON file in the Google* Chrome trace event profiling tool to visualize the timeline data collected by the run.
For this tutorial, this visualization appears as shown in the following example:
This visualization shows different calls executed through time. The X-axis is time, with the scale shown near the top of the page. The Y-axis shows different calls that are split up in several ways.
The left side (Y-axis) has two different types of numbers:
- Numbers that contain a decimal point.
- The part of the number before the decimal point orders the calls approximately by start time.
- The part of the number after the decimal point represents the queue number the call was made from.
- Numbers that do not contain a decimal point. These numbers represent the thread ID of the thread being run on in the operating system.
The colors in the trace represent different stages of execution:
- Blue during the queued stage
- Yellow during the submitted stage
- Orange for the execution stage
Look for gaps between consecutive execution stages and kernel runs to identify possible areas for optimization.
The double-buffering optimization can help minimize or remove gaps between consecutive kernels as they wait for host processing to finish. These gaps are minimized or removed by having the host perform processing operations on a second set of buffers while the kernel executes. With this execution order, the host processing is done by the time the next kernel can run, so kernel execution is not held up waiting for the host.
For a more detailed explanation of the optimization, refer to the FPGA tutorial "Double Buffering to Overlap Kernel Execution with Buffer Transfers and Host Processing".
In this tutorial, the first three kernels are run without the double-buffer optimization, and the next three are run with it. The kernels were run on an Arria® 10 FPGA when the intercept layer data was collected. The result of this optimization can be clearly seen in the Intercept Layer for OpenCL™ Applications trace:
Here, the kernel runs named _ZTS10SimpleVpow
can be recognized as the bars with the largest execution time (the large orange bars). Double buffering removes the gaps between the kernel executions that can be seen in the top trace image. This optimization improves the throughput of the design, as explained in the double_buffering
tutorial.
The Intercept Layer for OpenCL™ Applications makes it clear why the double buffering optimization will benefit this design and shows the performance improvement it achieves. Use the Intercept Layer tool on your designs to identify scenarios where you can apply double buffering and other system-level optimizations.
- A summary of the key profiling tools available for performance optimization
- Understanding the Intercept Layer for OpenCL™ Applications tool
- How to set up and use the Intercept Layer for OpenCL™ Applications tool
- How to use the resulting information to identify opportunities for system-level optimizations such as double buffering
Note: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the
setvars
script located in the root of your oneAPI installation every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.Linux*:
- For system wide installations:
. /opt/intel/oneapi/setvars.sh
- For private installations:
. ~/intel/oneapi/setvars.sh
- For non-POSIX shells, like csh, use the following command:
bash -c 'source <install-dir>/setvars.sh ; exec csh'
Windows*:
C:\"Program Files (x86)"\Intel\oneAPI\setvars.bat
- Windows PowerShell*, use the following command:
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS* or Use the setvars Script with Windows*.
-
Generate the
Makefile
by runningcmake
.mkdir build cd build
To compile for the default target (the Agilex® 7 device family), run
cmake
using the command:cmake ..
Note: You can change the default target by using the command:
cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
Note: You can poll your system for available BSPs using the
aoc -list-boards
command. The board list that is printed out will be of the form$> aoc -list-boards Board list: <board-variant> Board Package: <path/to/board/package>/board-support-package <board-variant2> Board Package: <path/to/board/package>/board-support-package
>
> You will only be able to run an executable on the FPGA if you specified a BSP.
-
Compile the design through the generated
Makefile
. The following build targets are provided:-
Compile for emulation (fast compile time, targets emulated FPGA device):
make fpga_emu
-
Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size):
make fpga_sim
-
Compile for FPGA hardware (longer compile time, targets FPGA device):
make fpga
-
- Run the sample on the FPGA emulator (the kernel executes on the CPU):
./double_buffering.fpga_emu (Linux)
- Run the sample on the FPGA simulator device:
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./double_buffering.fpga_sim (Linux)
Note: for the following steps, you need to have access to an FPGA device, and the code must have been compiled by targeting its BSP.
- Run the sample on the FPGA device (only if you ran
cmake
with-DFPGA_DEVICE=<board-support-package>:<board-variant>
):./double_buffering.fpga (Linux)
- Follow the instructions in the "Setting up the Intercept Layer for OpenCL™ Applications" section to install and configure the
cliloader
tool. - Run the sample using the Intercept Layer for OpenCL™ Applications to obtain system-level profiling information:
cliloader ./double_buffering.fpga (Linux)
- Follow the instructions in the "Viewing the Performance Data" section to visualize the results.
Intercept Layer for OpenCL™ Applications results: Your visualization results should resemble the screenshots in sections "Viewing the Performance Data" and "Applying Double-Buffering Using the Intercept Layer for OpenCL™ Applications".
Command line stdout
:
When run without cliloader
, the tutorial output should resemble the result below.
Platform name: Intel(R) FPGA SDK for OpenCL(TM)
Running on device: ofs_n6001 : Intel OFS Platform (ofs_ee00000)
Executing kernel 3 times in each round.
*** Beginning execution, without double buffering
Launching kernel #0
Overall execution time without double buffering = 48 ms
Total kernel-only execution time without double buffering = 38 ms
Throughput = 64.228554 MB/s
*** Beginning execution, with double buffering.
Launching kernel #0
Overall execution time with double buffering = 40 ms
Total kernel-only execution time with double buffering = 38 ms
Throughput = 77.011658 MB/s
Verification PASSED
Code samples are licensed under the MIT license. See License.txt for details.
Third-party program Licenses can be found here: third-party-programs.txt.