Developing and optimizing the performance of a time-critical application is a challenging task. One needs to run an application with different configurations and evaluate the performance to figure out the desired production environment of an application. Holoscan is a real-time AI sensor processing platform that enables developers to build real-time AI applications with intuitive interfaces. However, performance measurements of a Holoscan application required significant efforts by the application developers. In this article, we introduce Holoscan flow benchmarking on HoloHub, which aims to make Holoscan applications' performance measurements and analysis easy and scalable. Holoscan flow benchmarking can help evaluate different configurations before deploying a Holoscan application to production.
A key performance indicator (KPI) for real-time sensor processing applications is the end-to-end latency. There are many possible metrics on the end-to-end latency performance including average end-to-end latency and maximum (worst-case) end-to-end latency. For a developer, it is important to know these metrics, to understand how well their application is performing in terms of responsiveness and efficiency. Holoscan SDK allows data flow tracking to precisely measure the end-to-end latency of Holoscan applications. The Holoscan flow benchmarking in Holohub leverages the data flow tracking of the Holoscan SDK.
Holoscan flow benchmarking helps with automating the benchmarking process and generating necessary log files which are analyzed later. In this article, we demonstrate how to use Holoscan flow benchmarking to evaluate different configurations, applications and schedulers. In a nutshell, there are four steps to perform Holoscan flow benchmarking which are summarized here. Following is a high-level overview:
-
Patching an application: In the first step, we patch a Holoscan application to turn on data flow tracking.
-
Building an application: In the second step, we compile the patched application with necessary header files.
-
Running the benchmark: In the third step, we run the
benchmark.py
script with necessary parameters to evaluate the performance of the application in different configurations. This generates a number of log files which capture the result of the performance measurements in its raw format. -
Analyzing the result: In the fourth step, we retrieve the result of the benchmarking process by using the generated log files with the
analyze.py
script. We can also export the result in a CSV file or generate graphs to visualize certain results.
Next, we demonstrate a few capabilities of Holoscan flow benchmarking.
In many cases, users may want to run multiple Holoscan applications in parallel on their systems. They want to know the performance implications of such a configuration before developing a full-fledged system with more than one Holoscan application. Through Holoscan flow benchmarking, they can easily evaluate the performance of running multiple instances of an application in parallel to understand the performance of such a configuration for their specific and curated use-cases. They can learn how many parallel instances of an application, representative of their own application, can safely run until the configuration cannot meet their expected end-to-end latency guarantees.
Suppose we want to evaluate the performance of running multiple instances of the endoscopy tool tracking application in parallel, to understand the end-to-end latency performance. After completing the first two steps mentioned above, we can run the following command to evaluate performance of running up to 5 instances of the endoscopy tool tracking application concurrently.
$ python tutorials/benchmarking/benchmark.py -a endoscopy_tool_tracking
--sched greedy -i <Number of Instances> -r 10 -m 1000 -d
endoscopy_<Number of Instance>
Number of Instances
needs to be set to measure the performance of
running multiple instances of the endoscopy tool tracking
applications concurrently. The above command tests the application
against 1000 data frames (-m 1000
) and repeats every experiment 10 times
(-r 10
) for the greedy scheduler (--sched greedy
). The log files
generated by the benchmarking process are stored in the
endoscopy_<Number of Instance>
directory where Number of Instance
varies from 1 to 5. The format of the filename for the data flow
tracking log files is logger_<scheduler>_<run_number>_<instance-id>.log
.
After the benchmarking process is complete, we can analyze the result
with the help of the analyze.py
script. If we want to know the average
end-to-end latencies of the above experiment, the following command
could be used:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py --avg -g
endoscopy_1/logger* Instance-1 -g endoscopy_2/logger* Instance-2 -g
endoscopy_3/logger* Instance-3 -g endoscopy_4/logger* Instance-4 -g
endoscopy_5/logger* Instance-5
In the above command, --avg is used to show the average end-to-end latency of each experiment. Every -g parameter is used to specify a group of log files. endoscopy_1/logger* specifies the data flow tracking log files for the experiment in the endoscopy_1 directory. It is followed by a group name of Instance-1. The same is true for the other -g parameters. Executing the command shows a result like below:
In addition to getting the analysis results, it is important to plot the result in a graph to better
understand the trends in an evaluation result. The Holoscan flow benchmarking provides options to
export the results in a CSV file to plot the data with your favorite plotting tool later on.
Appending the analyze.py
command with the --save-csv
parameter enables
saving the results in an avg_values.csv
file in CSV format:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py --avg -g
endoscopy_1/logger* Instance-1 -g endoscopy_2/logger* Instance-2 -g
endoscopy_3/logger* Instance-3 -g endoscopy_4/logger* Instance-4 -g
endoscopy_5/logger* Instance-5 --save-csv
Then, the average values could be plotted with your preferred plotting
tools. For example, matplotlib
Python library could be used to plot the
average values in a bar-graph with the following code:
import matplotlib.pyplot as plt
with open('avg_values.csv', 'r') as f:
lines = f.readlines()
avg_values = [float(x) for x in lines[0].strip().split(',') if x]
plt.bar(range(1, 6), avg_values)
plt.xlabel('Number of Instances')
plt.ylabel('Average End-to-end Latency (ms)')
plt.savefig('avg_bar_graph.png')
The graph looks like the following:
Similar results can be obtained for the maximum end-to-end latencies as well, if the above analyze.py command is supplied with the --max parameter. The bar-graph with the maximum end-to-end latency would look like the following graph, as the worst-case latency gets affected by interferences from multiple instances:
The application developers can decide for their own applications the acceptable maximum or average end-to-end latency and therefore:
- decide how many parallel instances of an application they want to run on their system
- whether they need to further optimize the pipeline
- whether they need more compute resources
Understanding the impact of running multiple instances of an application
on GPU utilization is crucial for optimizing performance. Holoscan flow
benchmarking can also help measure the GPU utilization for every
experiment. By appending -u
to the above benchmark.py
command, the GPU
utilization is monitored and logged in a CSV file named
gpu_utilization_<scheduler>_<run_number>.csv
. For
example, the final command for GPU utilization monitoring looks like
below:
$ python tutorials/benchmarking/benchmark.py -a
endoscopy_tool_tracking --sched greedy -i <Number of Instances>
-r 10 -m 1000 -d endoscopy_<Number of Instance> -u
In the analysis phase, the average GPU utilization result can be saved
in a CSV file (avg_gpu_utilization_values.csv
) using the -u --save-csv
parameters:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py -g
endoscopy_1/logger* Instance-1 -u endoscopy_1/gpu*
Instance-1-GPUUtil -u endoscopy_2/gpu* Instance-2-GPUUtil -u
endoscopy_3/gpu* Instance-3-GPUUtil -u endoscopy_4/gpu*
Instance-4-GPUUtil -u endoscopy_5/gpu* Instance-5-GPUUtil --save-csv
Each -u
parameter specifies the group of GPU utilization log files
(endoscopy_1/gpu*
) with a last parameter specifying the group name
(Instance-1-GPUUtil
). -g endoscopy_1/logger* Instance-1
is used to
satisfy the requirement of a mandatory -g
parameter.
Suppose we want to plot the average GPU utilization as well in the above bar-graph of maximum values. We can achieve that by the following code in Python:
import matplotlib.pyplot as plt
with open('max_values.csv', 'r') as f:
lines = f.readlines()
max_values = [float(x) for x in lines[0].strip().split(',') if x]
with open('avg_gpu_utilization_values.csv', 'r') as f:
lines = f.readlines()
avg_gpu_utilizations = [float(x) for x in lines[0].strip().split(',') if x]
fig, ax1 = plt.subplots()
ax1.bar(range(len(max_values)), max_values)
ax1.set_ylabel('Maximum End-to-end Latency (ms)')
ax1.set_xlabel('Number of Instances')
ax2 = ax1.twinx()
ax2.plot(range(len(avg_gpu_utilizations)), avg_gpu_utilizations, color='r', linewidth=2.0)
ax2.set_ylabel('Average GPU Utilization (%)')
plt.savefig('max_bar_graph.png')
The generated graph looks like below:
Holoscan flow benchmarking can also be used to generate the graph of Cumulative Distribution Function (CDF) of the observed end-to-end latencies. The CDF curve is a useful tool to understand how much the latencies of all the data frames passing through an application graph are distributed. The following command generates a CDF curve of the end-to-end latencies of the first path for the experiment with 1 instance:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py --draw-cdf
single_path_cdf.png -g endoscopy_1/logger* MyCustomGroup
--no-display-graphs
By default, the command also displays the graph in a separate window.
The --no-display-graphs
suppresses this option. The graph is saved in
single_path_cdf.png
file and looks like below:
In addition to plotting the CDF curve of the first path, it is also
possible to plot the CDF curves of all the paths of an application graph
with the --draw-cdf-paths
parameter.
The benchmark.py
script can be used to evaluate the performance of two
different applications. The results can be analyzed and compared using
the analyze.py
script. For example, we can run the following two
commands to run two different applications under same settings:
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
endoscopy_tool_tracking -r 10 -i 1 -m 1000 --sched greedy -d
endoscopy_outputs
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
ultrasound_segmentation -r 10 -i 1 -m 1000 --sched greedy -d
ultrasound_outputs
The above commands run two experiments and save the corresponding log
files in two different directories, endoscopy_outputs
and
ultrasound_outputs
. We can analyze and compare their maximum end-to-end
latencies with the following analyze.py command:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py --max -g
endoscopy_outputs/logger* "Endoscopy Tool Tracking" -g
ultrasound_outputs/logger* "Ultrasound Segmentation"
Such a command shows the maximum end-to-end latency of two different applications and helps the developers compare their performances.
We can also compare different configurations of an application with Holoscan flow benchmarking, such as specifying the GPU in which the application is executed. For example, if we have two GPUs (RTX A6000 and RTX A4000) both available for visualization, we may want to know the performance of running an application only on a specific GPU. We can run the following command:
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
endoscopy_tool_tracking -r 10 -i 1 -m 1000 --sched greedy -d
endoscopy_outputs_<GPU Name> -g <Corresponding GPU UUID>
In the above command, users can provide corresponding GPU UUIDS by
looking it up via nvidia-smi -L
and changing the <GPU Name>
to
save the outputs in different directories. The -g <Corresponding GPU UUID>
parameter sets the
CUDA_VISIBLE_DEVICES
environment variable to the corresponding GPU UUID. It is important to note
that this option does not override the application-specific GPU settings in the code, for
example, via the GPU assignment in Inference
operator (device_map
option).
The results of the above experiment can be compared using the following command:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py <insert
your metrics options> -g endoscopy_outputs_<GPU Name>/logger* "GPU1" -g endoscopy_outputs_<GPU Name>/gpu* "GPU2"
It is also possible to compare different schedulers using Holoscan flow benchmarking. The following commands can be used to evaluate performance of the endoscopy tool tracking sample application under the Greedy scheduler, the Multithread scheduler and the Event-based Scheduler:
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
endoscopy_tool_tracking -r 10 -i 1 -m 1000 --sched greedy -d endoscopy_greedy_outputs
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
endoscopy_tool_tracking -r 10 -i 1 -m 1000 --sched multithread -w 5 -d endoscopy_multithread_outputs
$ python benchmarks/holoscan_flow_benchmarking/benchmark.py -a
endoscopy_tool_tracking -r 10 -i 1 -m 1000 --sched eventbased -w 5 -d endoscopy_eventbased_outputs
Then, the results can be analyzed and compared using the following command:
$ python benchmarks/holoscan_flow_benchmarking/analyze.py <insert
your metrics options> -g endoscopy_greedy_outputs/logger* "Endoscopy (Greedy)" -g endoscopy_multithread_outputs/logger* "Endoscopy (Multithread)" -g endoscopy_eventbased_outputs/logger* "Endoscopy (Event-based)"
The analyze.py
script provides options for calculating the tail and
flatness of the CDF curve of the observed latencies using, --tail
and
--flatness
parameters, respectively. The tail of the CDF curve is the
difference between the 95 percentile and 100 percentile values. It helps
to understand how stretched out the worst-case latencies of an
application are. The lesser the value of the CDF curve tail, the better
the application is performing in the far end of the latency
distribution. The flatness of the CDF curve is the difference between
the 10 percentile and 90 percentile values. It informs how much the
observed latencies are distributed where most of the latencies are
observed. A smaller value of the flatness of the CDF curve indicates the
latencies are more concentrated, and the application is performing more
deterministically. Overall, we want both tail and flatness of the CDF
curve to be as small as possible.
Holoscan flow benchmarking is currently available for four HoloHub C++ applications:
-
endoscopy_tool_tracking
-
multiai_endoscopy
-
multiai_ultrasound
-
ultrasound_segmentation
We plan to add Holoscan flow benchmarking for other C++ and Python sample applications for HoloHub in the future.
Holoscan flow benchmarking allows developers to evaluate the performance of Holoscan applications at scale. The article demonstrates a few common use-cases. Please let us know how you are leveraging Holoscan flow benchmarking and if you are experiencing any gaps or blockers for your applications. Please feel free to file issues on Github or contribute directly to HoloHub.