|
| 1 | +# Benchmark Characteristics |
| 2 | + |
| 3 | +<!--* |
| 4 | +# Document freshness: For more information, see go/fresh-source. |
| 5 | +freshness: { owner: 'liyuying' reviewed: '2025-08-10' } |
| 6 | +*--> |
| 7 | + |
| 8 | +This page summarizes some key performance characteristics of the benchmarks. |
| 9 | + |
| 10 | +[TOC] |
| 11 | + |
| 12 | +## Definition: |
| 13 | + |
| 14 | +We first present our definitions for different metrics here. The following |
| 15 | +sections provide a detailed breakdown of each benchmark, including all necessary |
| 16 | +details. |
| 17 | + |
| 18 | +### Primary Bound |
| 19 | + |
| 20 | +This category determines whether a benchmark's performance is limited by the |
| 21 | +processor's computational speed or by the rate at which data can be moved into |
| 22 | +and out of main memory. |
| 23 | + |
| 24 | +* **Compute-Bound**: The workload performs a large number of calculations |
| 25 | + relative to the amount of data it needs. Performance is limited by the CPU's |
| 26 | + ability to execute instructions. |
| 27 | + |
| 28 | +* **Cache-Bound**: The workload is frequently accessing data that isn't in the |
| 29 | + fastest L1 cache but is found in a lower level cache. While much faster than |
| 30 | + going to main memory, fetching from these caches still incurs latency, |
| 31 | + causing the CPU to wait. |
| 32 | + |
| 33 | +* **Memory-Bound**: The workload performance is bottlenecked by the speed of |
| 34 | + memory accesses. It spends more time waiting for data from RAM than it does |
| 35 | + computing. |
| 36 | + |
| 37 | +### Sensitivity |
| 38 | + |
| 39 | +These categories measure how a benchmark's performance is affected by hardware |
| 40 | +characteristics. |
| 41 | + |
| 42 | +* **High**: The benchmark's performance is a primary indicator of this |
| 43 | + characteristic. A change in this factor would cause a major performance |
| 44 | + shift. |
| 45 | + |
| 46 | +* **Medium**: The benchmark is moderately influenced by this factor, but it is |
| 47 | + not the main bottleneck. |
| 48 | + |
| 49 | +* **Low**: The benchmark's performance is largely independent of this factor. |
| 50 | + |
| 51 | +We apply this scale to the following areas: |
| 52 | + |
| 53 | +* **Memory Bandwidth Sensitivity** The benchmark's runtime is impacted |
| 54 | + directly by available memory bandwidth. A high sensitivity indicates an |
| 55 | + increase in bandwidth results in a noticeable performance gain. On the other |
| 56 | + hand, if the benchmark's performance is largely independent of memory |
| 57 | + bandwidth, it has low sensitivity. |
| 58 | + |
| 59 | +* **Memory Latency Sensitivity**: A high latency sensitivity indicates the |
| 60 | + benchmark's runtime is severely impacted by the time it takes to retrieve a |
| 61 | + single piece of data from memory. |
| 62 | + |
| 63 | +* **Cache Sensitivity**: A high cache sensitivity benchmark is almost entirely |
| 64 | + dependent on data being in the CPU's cache. A high cache miss rate would |
| 65 | + cause a massive performance drop. On the contrary, the benchmark's |
| 66 | + performance is largely independent of cache behavior if the cache |
| 67 | + sensitivity is low, either because it works with very little data or because |
| 68 | + it performs cache-bypassing operations. |
| 69 | + |
| 70 | +* **Branch Misprediction**: A high sensitivity benchmark contains |
| 71 | + unpredictable conditional branches that often cause the CPU to guess |
| 72 | + incorrectly, leading to pipeline stalls. In contrast, the benchmark has |
| 73 | + highly predictable control flow, with few conditional branches or |
| 74 | + easy-to-predict patterns when it is low sensitivity. |
| 75 | + |
| 76 | +### Instruction Type |
| 77 | + |
| 78 | +Besides all these sensitivities, we also include an extra column describing the |
| 79 | +type and proportion of instructions executed. Including: |
| 80 | + |
| 81 | +* **Scalar/SIMD**: A workload is dominated by single-data instructions or it |
| 82 | + works on multiple data elements simultaneously (vector instructions). |
| 83 | + |
| 84 | +* **Pointer Chasing**: A workload dominated by following a chain of pointers, |
| 85 | + which is highly sensitive to memory latency. |
| 86 | + |
| 87 | +* **Memory Intensive**: High if the benchmark moves a lot of data. |
| 88 | + |
| 89 | +* **Bit Manipulation**: Workloads focused on logic and bitwise operations. |
| 90 | + |
| 91 | +* **Integer/Floating-Point**: Whether the focus is on the integer operations |
| 92 | + or floating-point operations. |
| 93 | + |
| 94 | +## Proto |
| 95 | + |
| 96 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 97 | +:-------- | :---------------------------- | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 98 | +Proto | Hybrid Compute & Memory-Bound | High | Medium | High | Medium | Scalar, Integer, Point Chasing, Memory Intensive |
| 99 | + |
| 100 | +* Hybrid Compute & Memory-Bound: The benchmark contains memory-intensive |
| 101 | + `Serialize` and `Deserialize` operations, the code is filled with a dense |
| 102 | + sequence of calls to functions like `Merge`, `Descriptor`, `Reflection`, |
| 103 | + `ByteSize`, `Swap`, and various `_Set_` and `_Get_` functions. Many of these |
| 104 | + are computationally heavy and will tax the CPU's execution units. |
| 105 | +* Hybrid Instruction Mix: The code involves numerous scalar operations and a |
| 106 | + significant amount of pointer chasing to access message fields. This |
| 107 | + benchmark is highly read/write intensive, with frequent operations like |
| 108 | + `(De)Serialize`, `Copy`, etc. |
| 109 | + |
| 110 | +## Swissmap |
| 111 | + |
| 112 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 113 | +:------------ | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 114 | +Hot_Swissmap | Compute-Bound | Low | Low | High | Meidum | Scalar, Integer, Point Chasing, Compute Intensive |
| 115 | +Cold_Swissmap | Memory-Bound | High | Very High | Very High | Medium | Scalar\|Integer, Point Chasing, Memory Intensive |
| 116 | + |
| 117 | +## TCMalloc |
| 118 | + |
| 119 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 120 | +:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 121 | +TCMalloc | Compute-Bound | Medium | High | High | High | Integer, Point Chasing, Memory Intensive, Bit Manipulation |
| 122 | + |
| 123 | +* The benchmark is a classic example of a *compute-bound workload* where the |
| 124 | + computation is not math-heavy, but rather focused on intricate data |
| 125 | + structure management. TCMalloc maintains sophisticated data structures to |
| 126 | + manage memory efficiently. The work of finding an appropriately sized free |
| 127 | + block, updating metadata, and linking/unlinking pointers is a complex |
| 128 | + sequence of CPU instructions. |
| 129 | +* This is a high memory latency sensitive benchmark benchmark. The internal |
| 130 | + workings of the memory allocator are an example of a pointer-chasing |
| 131 | + workload. Traversing freelists and other metadata structures involves many |
| 132 | + unpredictable memory accesses, and the time it takes to retrieve a single |
| 133 | + piece of data (memory latency) is a major performance factor. |
| 134 | +* This is also a high cache sensitive benchmark because the performance of |
| 135 | + TCMalloc is highly dependent on keeping its critical metadata (e.g., small |
| 136 | + object freelists) resident in the CPU's caches. |
| 137 | + |
| 138 | +## Mem Libc |
| 139 | + |
| 140 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 141 | +:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 142 | +L1/L2/LLC | Cache-Bound | Very High | Low | Very High | Hybrid | Scalar, Integer, Memory Intensive |
| 143 | +Cold | Memory-Bound | Very High | Very High | Low | Low | Scalar, Integer, Memory Intensive |
| 144 | + |
| 145 | +* L1/L2/LLC are *Cache-bound*. The performance is limited by how quickly the |
| 146 | + CPU can move data to and from the targeted cache level. This is a measure of |
| 147 | + the cache's throughput, which is significantly higher than that of main |
| 148 | + memory. |
| 149 | +* The cold benchmarks are *memory-bound by main memory bandwidth*. The |
| 150 | + execution time is dominated by the time spent waiting for data to be |
| 151 | + transferred from the system's main memory to the CPU's caches. |
| 152 | +* Read/Write-intensive & Branch misprediction: |
| 153 | + * `memcpy` and `memmove`: These are *read-write* operations. They read |
| 154 | + data from a source buffer and write it to a destination buffer. Their |
| 155 | + performance is constrained by the maximum throughput of both the read |
| 156 | + and write channels of the memory subsystem. They are low in branch |
| 157 | + misprediction sensitivity. |
| 158 | + * `memset`: This is a *write-only* operation. It writes a single, |
| 159 | + repeating byte value to a destination buffer. `memset` is only limited |
| 160 | + by write bandwidth. It can often be implemented more efficiently at the |
| 161 | + instruction level, sometimes leveraging specialized instructions like |
| 162 | + [STNP](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/STNP--Store-Pair-of-Registers--with-non-temporal-hint-) |
| 163 | + on ARM. They are low in branch misprediction sensitivity. |
| 164 | + * `memcmp` and `bcmp`: These are *read-only* comparison operations and |
| 165 | + *branch-intensive*. They read data from two separate memory locations |
| 166 | + and compare them. The performance is limited by the speed of reading |
| 167 | + from both source buffers. They also have *high branch misprediction |
| 168 | + sensitivity*, as the loop terminates on the first mismatch. |
| 169 | + |
| 170 | +## Compression |
| 171 | + |
| 172 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 173 | +:------------ | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 174 | +Compression | Compute-Bound | High | Low | Low | Medium | Scalar, Integer, Memory Intensive, Bit Manipulation |
| 175 | +Decompression | Compute-Bound | High | Low | Low | High | Scalar, Integer, Memory Intensive, Bit Manipulation |
| 176 | + |
| 177 | +* Compression is *Read-Intensive* as this operation reads a large, |
| 178 | + uncompressed corpus and writes a smaller, compressed output. Performance is |
| 179 | + limited by the algorithm's internal processing speed, but also highly |
| 180 | + sensitive to the available memory bandwidth for ingesting the input data. |
| 181 | + These benchmarks have medium branch misprediction sensitivity. The |
| 182 | + algorithms use complex state machines and conditional branches for pattern |
| 183 | + matching and encoding, which can sometimes be difficult for the CPU to |
| 184 | + predict. |
| 185 | +* Decompression is both *read and write intensive*. The performance is highly |
| 186 | + sensitive to both the read and write memory bandwidth. |
| 187 | +* Both have high memory bandwidth and low memory latency sensitivity. The |
| 188 | + benchmark must read, either the entire uncompressed corpus or the compressed |
| 189 | + data, in a streaming fashion. The overall throughput is directly impacted by |
| 190 | + how quickly this data can be fed into the CPU. Because of this sequential |
| 191 | + and streaming data access pattern, the benchmark is not particular sensitive |
| 192 | + to memory latency. |
| 193 | +* Branch prediction is high for decompression algorithms because of the state |
| 194 | + machines, which are driven by a stream of bits. Parsing the bitstream |
| 195 | + involves numerous conditional branches that can be unpredictable, leading to |
| 196 | + frequent pipeline stalls from branch mispredictions. Compression algorithms |
| 197 | + also contain a variety of conditional branches for state machines and |
| 198 | + pattern matching. While some branches can be predictable, others, especially |
| 199 | + in the core matching loops, can lead to mispredictions that impact |
| 200 | + performance. |
| 201 | + |
| 202 | +## Hashing |
| 203 | + |
| 204 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 205 | +:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 206 | +Hot | Compute-Bound | High | Low | High | Low | Integer, Memory Intensive, Bit Manipulation |
| 207 | +Cold | Memory-Bound | High | Low | Low | Low | Integer, Memory Intensive, Bit Manipulation |
| 208 | + |
| 209 | +* Hot benchmarks are compute-bound. When the data fits in the cache, the |
| 210 | + limiting factor is the CPU's speed in performing the hashing calculations. |
| 211 | + Cold benchmarks are memory-bound by main memory bandwidth. Since the input |
| 212 | + data is much larger than the cache, the CPU spends most of its time waiting |
| 213 | + for data to be streamed from main memory. |
| 214 | +* Both hot and cold benchmarks are *highly sensitive to memory bandwidth*. The |
| 215 | + hot benchmarks are limited by cache bandwidth, while the cold benchmarks are |
| 216 | + limited by main memory bandwidth. |
| 217 | +* `BM_HASHING_Extendcrc32cinternal` is a streaming workload for situations |
| 218 | + where data arrives in chunks and we want to calculate a running hash without |
| 219 | + re-processing all the previous data. `BM_HASHING_Computecrc32c` is a |
| 220 | + one-shot workload by measuring the raw performance of hashing a single, |
| 221 | + complete buffer. `BM_HASHING_Combine_contiguous` is a composite workload. It |
| 222 | + is not just hash the data, but hash the data combined with a small, |
| 223 | + fixed-size value (the length of the string) to prevent hashing collision. |
| 224 | + |
| 225 | +## Cord |
| 226 | + |
| 227 | +Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes |
| 228 | +:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :-------------------- |
| 229 | +Cord | Memory-Bound | High | High | High | Medium | Integer, Pointer Chasing |
| 230 | + |
| 231 | +* The benchmark's performance is primarily determined by how efficiently the |
| 232 | + CPU can move and access data from memory. Unlike a contiguous `std::string`, |
| 233 | + an `absl::Cord` stores its data in a series of non-contiguous chunks, often |
| 234 | + linked together like a tree. Operations that require access to the entire |
| 235 | + data, such as `copy` the Cord to a `std::string` or `compare` it to another |
| 236 | + string, involve traversing this fragmented data structure, making the |
| 237 | + benchmark heavily reliant on *memory bandwidth and latency*. |
0 commit comments