diff --git a/src/content/docs/knowledge-base/performance-profiling.mdx b/src/content/docs/knowledge-base/performance-profiling.mdx new file mode 100644 index 0000000..ab855fb --- /dev/null +++ b/src/content/docs/knowledge-base/performance-profiling.mdx @@ -0,0 +1,128 @@ +--- +title: "Performance profiling" +description: "How to profile embedded applications" +order: 50 +--- + +After benchmarking your application and concluding that the performance is insufficient, the next +step is [profiling](https://en.wikipedia.org/wiki/Profiling_(computer_programming)). Profiling +measures where execution time is spent so that we can focus our optimization effort on parts of +the code where it will have the most impact. This scientific approach is typically much more +effective than guessing. + +## Sampling vs tracing + +This article will discuss sampling (statistical) profiling. The program is stopped repeatedly to +record the callstack. If this is done enough times then we get an idea of where the CPU spends its +time. This method is different to tracing, where an entry is recorded every time an event occurs, +such as on entry to or return from a function. Typically sampling profiling has lower overhead and +is more useful for analyzing throughput or bandwidth, whereas tracing is more useful for analyzing +latency or IO problems. + +## Recording a profile + +We can record a profile with: + +```sh +probe-rs profile --duration 200 callstack --cores 0 --rate 1 naive-dwarf +``` + +Replace `` with a path to the ELF file executing on the device. This will sample +core 0 at 1Hz for 200 seconds using DWARF debug information to unwind callstacks. The output will be +written to `probe-rs-profile.json.gz`, visualizing this is discussed in [a later +section](#displaying-a-profile). + +### Overhead and number of samples + +We halt the CPU to collect each sample, this means that high sampling rates can cause a lot of +overhead, as the CPU spends a large fraction of time halted. This overhead can particularly be a +problem for programs that interact with the outside world[^tcp-window]. This overhead likely varies +with the speed of your microcontroller and debug probe. It can be useful to benchmark your +application while profiling in order to check that the overhead is not too high. + +The standard error on our measurements is the square root of the count for each measurement. If we +measure 5 samples in a function then our expected error is roughly 45%. If we measure 100 samples in +a function our expected error is 10%. + +We should adjust the duration and rate to ensure that we get reasonable statistics and low overhead. + +### Callstack method + +As of writing two methods are available to recover the callstacks for recording: +- `naive-dwarf` +- `naive-frame-pointer` + +Details on both methods can be found in [this blog +post](https://maskray.me/blog/2020-11-08-stack-unwinding). + +Both methods require enabling debug information, for example by modifying the release profile in +`Cargo.toml`: + +```toml +[profile.release] +debug = 2 +``` + +`naive-frame-pointer` additionally requires frame pointers to be enabled when compiling the target +binary. This can be done by setting `RUSTFLAGS`: + +```sh +RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release +``` + +or by adding lines under the `[build]` or `[target.'xyz']` section of the `.cargo/config.toml`: + +```toml +rustflags = [ + # enable frame pointers for profiling + "-C", "force-frame-pointers=yes", +] +``` + +## Displaying a profile + +We use [`samply`](https://github.com/mstange/samply) to display the generated profile. It can be +installed with: + +```sh +cargo install --locked samply +``` + +Then to display our generated profile: + +```sh +samply load probe-rs-profile.json.gz +``` + +This will open the profile in firefox profiler in your browser. `samply` will continue running to +convert program addresses to function names and display code listings when queried. Switching to the +"Flame Graph" tab, double clicking a bar and clicking the "asm" button yields a view like the +following: + +![flamegraph open in firefox profiler with source listing and assembly panes +open](./../../../images/callstack-profile.png) + +### Flame graphs + +The widths of the bars of the flamegraph are proportional to the number of samples and hence to the +time spent in that function. The x-axis is sorted alphabetically. The y-axis is stack depth, from +caller-most at the bottom to callee-most at the top. See [this blog +post](https://www.brendangregg.com/flamegraphs.html) for an introduction to flamegraphs. + +### Saving and sharing profiles + +To correctly display the profile with `samply`, the sampled binary must be present at the +`` path. In order to embed the function names into a profile which you can then +share with others, you can use the "Upload Local Profile" option in firefox profiler and then click +"Download". + +### Firefox profiler + +Check the [firefox profiler docs](https://profiler.firefox.com/docs/#/) for more information on +navigating the interface. The [transforms +section](https://profiler.firefox.com/docs/#/./guide-filtering-call-trees?id=transforms), detailing +how to focus and merge, is particularly useful. + +[^tcp-window]: For example, TCP has a [window scale +option](https://en.wikipedia.org/wiki/TCP_window_scale_option) that throttles the bandwidth when the +receiver appears to be overwhelmed - this can increase the size of the side effect. diff --git a/src/images/callstack-profile.png b/src/images/callstack-profile.png new file mode 100644 index 0000000..ea0ebc1 Binary files /dev/null and b/src/images/callstack-profile.png differ