Skip to content

Commit 847c8d2

Browse files
liyuying0000copybara-github
authored andcommitted
Add benchmark categories documentation
PiperOrigin-RevId: 795641772 Change-Id: I11fb0c588e2b7e807d3331b943f5d4d2d7255d7e
1 parent 1ba3aec commit 847c8d2

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed

fleetbench/README.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Benchmark Characteristics
2+
3+
<!--*
4+
# Document freshness: For more information, see go/fresh-source.
5+
freshness: { owner: 'liyuying' reviewed: '2025-08-10' }
6+
*-->
7+
8+
This page summarizes some key performance characteristics of the benchmarks.
9+
10+
[TOC]
11+
12+
## Definition:
13+
14+
We first present our definitions for different metrics here. The following
15+
sections provide a detailed breakdown of each benchmark, including all necessary
16+
details.
17+
18+
### Primary Bound
19+
20+
This category determines whether a benchmark's performance is limited by the
21+
processor's computational speed or by the rate at which data can be moved into
22+
and out of main memory.
23+
24+
* **Compute-Bound**: The workload performs a large number of calculations
25+
relative to the amount of data it needs. Performance is limited by the CPU's
26+
ability to execute instructions.
27+
28+
* **Cache-Bound**: The workload is frequently accessing data that isn't in the
29+
fastest L1 cache but is found in a lower level cache. While much faster than
30+
going to main memory, fetching from these caches still incurs latency,
31+
causing the CPU to wait.
32+
33+
* **Memory-Bound**: The workload performance is bottlenecked by the speed of
34+
memory accesses. It spends more time waiting for data from RAM than it does
35+
computing.
36+
37+
### Sensitivity
38+
39+
These categories measure how a benchmark's performance is affected by hardware
40+
characteristics.
41+
42+
* **High**: The benchmark's performance is a primary indicator of this
43+
characteristic. A change in this factor would cause a major performance
44+
shift.
45+
46+
* **Medium**: The benchmark is moderately influenced by this factor, but it is
47+
not the main bottleneck.
48+
49+
* **Low**: The benchmark's performance is largely independent of this factor.
50+
51+
We apply this scale to the following areas:
52+
53+
* **Memory Bandwidth Sensitivity** The benchmark's runtime is impacted
54+
directly by available memory bandwidth. A high sensitivity indicates an
55+
increase in bandwidth results in a noticeable performance gain. On the other
56+
hand, if the benchmark's performance is largely independent of memory
57+
bandwidth, it has low sensitivity.
58+
59+
* **Memory Latency Sensitivity**: A high latency sensitivity indicates the
60+
benchmark's runtime is severely impacted by the time it takes to retrieve a
61+
single piece of data from memory.
62+
63+
* **Cache Sensitivity**: A high cache sensitivity benchmark is almost entirely
64+
dependent on data being in the CPU's cache. A high cache miss rate would
65+
cause a massive performance drop. On the contrary, the benchmark's
66+
performance is largely independent of cache behavior if the cache
67+
sensitivity is low, either because it works with very little data or because
68+
it performs cache-bypassing operations.
69+
70+
* **Branch Misprediction**: A high sensitivity benchmark contains
71+
unpredictable conditional branches that often cause the CPU to guess
72+
incorrectly, leading to pipeline stalls. In contrast, the benchmark has
73+
highly predictable control flow, with few conditional branches or
74+
easy-to-predict patterns when it is low sensitivity.
75+
76+
### Instruction Type
77+
78+
Besides all these sensitivities, we also include an extra column describing the
79+
type and proportion of instructions executed. Including:
80+
81+
* **Scalar/SIMD**: A workload is dominated by single-data instructions or it
82+
works on multiple data elements simultaneously (vector instructions).
83+
84+
* **Pointer Chasing**: A workload dominated by following a chain of pointers,
85+
which is highly sensitive to memory latency.
86+
87+
* **Memory Intensive**: High if the benchmark moves a lot of data.
88+
89+
* **Bit Manipulation**: Workloads focused on logic and bitwise operations.
90+
91+
* **Integer/Floating-Point**: Whether the focus is on the integer operations
92+
or floating-point operations.
93+
94+
## Proto
95+
96+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
97+
:-------- | :---------------------------- | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
98+
Proto | Hybrid Compute & Memory-Bound | High | Medium | High | Medium | Scalar, Integer, Point Chasing, Memory Intensive
99+
100+
* Hybrid Compute & Memory-Bound: The benchmark contains memory-intensive
101+
`Serialize` and `Deserialize` operations, the code is filled with a dense
102+
sequence of calls to functions like `Merge`, `Descriptor`, `Reflection`,
103+
`ByteSize`, `Swap`, and various `_Set_` and `_Get_` functions. Many of these
104+
are computationally heavy and will tax the CPU's execution units.
105+
* Hybrid Instruction Mix: The code involves numerous scalar operations and a
106+
significant amount of pointer chasing to access message fields. This
107+
benchmark is highly read/write intensive, with frequent operations like
108+
`(De)Serialize`, `Copy`, etc.
109+
110+
## Swissmap
111+
112+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
113+
:------------ | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
114+
Hot_Swissmap | Compute-Bound | Low | Low | High | Meidum | Scalar, Integer, Point Chasing, Compute Intensive
115+
Cold_Swissmap | Memory-Bound | High | Very High | Very High | Medium | Scalar\|Integer, Point Chasing, Memory Intensive
116+
117+
## TCMalloc
118+
119+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
120+
:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
121+
TCMalloc | Compute-Bound | Medium | High | High | High | Integer, Point Chasing, Memory Intensive, Bit Manipulation
122+
123+
* The benchmark is a classic example of a *compute-bound workload* where the
124+
computation is not math-heavy, but rather focused on intricate data
125+
structure management. TCMalloc maintains sophisticated data structures to
126+
manage memory efficiently. The work of finding an appropriately sized free
127+
block, updating metadata, and linking/unlinking pointers is a complex
128+
sequence of CPU instructions.
129+
* This is a high memory latency sensitive benchmark benchmark. The internal
130+
workings of the memory allocator are an example of a pointer-chasing
131+
workload. Traversing freelists and other metadata structures involves many
132+
unpredictable memory accesses, and the time it takes to retrieve a single
133+
piece of data (memory latency) is a major performance factor.
134+
* This is also a high cache sensitive benchmark because the performance of
135+
TCMalloc is highly dependent on keeping its critical metadata (e.g., small
136+
object freelists) resident in the CPU's caches.
137+
138+
## Mem Libc
139+
140+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
141+
:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
142+
L1/L2/LLC | Cache-Bound | Very High | Low | Very High | Hybrid | Scalar, Integer, Memory Intensive
143+
Cold | Memory-Bound | Very High | Very High | Low | Low | Scalar, Integer, Memory Intensive
144+
145+
* L1/L2/LLC are *Cache-bound*. The performance is limited by how quickly the
146+
CPU can move data to and from the targeted cache level. This is a measure of
147+
the cache's throughput, which is significantly higher than that of main
148+
memory.
149+
* The cold benchmarks are *memory-bound by main memory bandwidth*. The
150+
execution time is dominated by the time spent waiting for data to be
151+
transferred from the system's main memory to the CPU's caches.
152+
* Read/Write-intensive & Branch misprediction:
153+
* `memcpy` and `memmove`: These are *read-write* operations. They read
154+
data from a source buffer and write it to a destination buffer. Their
155+
performance is constrained by the maximum throughput of both the read
156+
and write channels of the memory subsystem. They are low in branch
157+
misprediction sensitivity.
158+
* `memset`: This is a *write-only* operation. It writes a single,
159+
repeating byte value to a destination buffer. `memset` is only limited
160+
by write bandwidth. It can often be implemented more efficiently at the
161+
instruction level, sometimes leveraging specialized instructions like
162+
[STNP](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/STNP--Store-Pair-of-Registers--with-non-temporal-hint-)
163+
on ARM. They are low in branch misprediction sensitivity.
164+
* `memcmp` and `bcmp`: These are *read-only* comparison operations and
165+
*branch-intensive*. They read data from two separate memory locations
166+
and compare them. The performance is limited by the speed of reading
167+
from both source buffers. They also have *high branch misprediction
168+
sensitivity*, as the loop terminates on the first mismatch.
169+
170+
## Compression
171+
172+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
173+
:------------ | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
174+
Compression | Compute-Bound | High | Low | Low | Medium | Scalar, Integer, Memory Intensive, Bit Manipulation
175+
Decompression | Compute-Bound | High | Low | Low | High | Scalar, Integer, Memory Intensive, Bit Manipulation
176+
177+
* Compression is *Read-Intensive* as this operation reads a large,
178+
uncompressed corpus and writes a smaller, compressed output. Performance is
179+
limited by the algorithm's internal processing speed, but also highly
180+
sensitive to the available memory bandwidth for ingesting the input data.
181+
These benchmarks have medium branch misprediction sensitivity. The
182+
algorithms use complex state machines and conditional branches for pattern
183+
matching and encoding, which can sometimes be difficult for the CPU to
184+
predict.
185+
* Decompression is both *read and write intensive*. The performance is highly
186+
sensitive to both the read and write memory bandwidth.
187+
* Both have high memory bandwidth and low memory latency sensitivity. The
188+
benchmark must read, either the entire uncompressed corpus or the compressed
189+
data, in a streaming fashion. The overall throughput is directly impacted by
190+
how quickly this data can be fed into the CPU. Because of this sequential
191+
and streaming data access pattern, the benchmark is not particular sensitive
192+
to memory latency.
193+
* Branch prediction is high for decompression algorithms because of the state
194+
machines, which are driven by a stream of bits. Parsing the bitstream
195+
involves numerous conditional branches that can be unpredictable, leading to
196+
frequent pipeline stalls from branch mispredictions. Compression algorithms
197+
also contain a variety of conditional branches for state machines and
198+
pattern matching. While some branches can be predictable, others, especially
199+
in the core matching loops, can lead to mispredictions that impact
200+
performance.
201+
202+
## Hashing
203+
204+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
205+
:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
206+
Hot | Compute-Bound | High | Low | High | Low | Integer, Memory Intensive, Bit Manipulation
207+
Cold | Memory-Bound | High | Low | Low | Low | Integer, Memory Intensive, Bit Manipulation
208+
209+
* Hot benchmarks are compute-bound. When the data fits in the cache, the
210+
limiting factor is the CPU's speed in performing the hashing calculations.
211+
Cold benchmarks are memory-bound by main memory bandwidth. Since the input
212+
data is much larger than the cache, the CPU spends most of its time waiting
213+
for data to be streamed from main memory.
214+
* Both hot and cold benchmarks are *highly sensitive to memory bandwidth*. The
215+
hot benchmarks are limited by cache bandwidth, while the cold benchmarks are
216+
limited by main memory bandwidth.
217+
* `BM_HASHING_Extendcrc32cinternal` is a streaming workload for situations
218+
where data arrives in chunks and we want to calculate a running hash without
219+
re-processing all the previous data. `BM_HASHING_Computecrc32c` is a
220+
one-shot workload by measuring the raw performance of hashing a single,
221+
complete buffer. `BM_HASHING_Combine_contiguous` is a composite workload. It
222+
is not just hash the data, but hash the data combined with a small,
223+
fixed-size value (the length of the string) to prevent hashing collision.
224+
225+
## Cord
226+
227+
Benchmark | Primary Bound | Memory Bandwidth Sensitivity | Memory Latency Sensitivity | Cache Sensitivity | Branch Misprediction | Instruction Mix Notes
228+
:-------- | :------------ | :--------------------------- | :------------------------- | :---------------- | :------------------- | :--------------------
229+
Cord | Memory-Bound | High | High | High | Medium | Integer, Pointer Chasing
230+
231+
* The benchmark's performance is primarily determined by how efficiently the
232+
CPU can move and access data from memory. Unlike a contiguous `std::string`,
233+
an `absl::Cord` stores its data in a series of non-contiguous chunks, often
234+
linked together like a tree. Operations that require access to the entire
235+
data, such as `copy` the Cord to a `std::string` or `compare` it to another
236+
string, involve traversing this fragmented data structure, making the
237+
benchmark heavily reliant on *memory bandwidth and latency*.

0 commit comments

Comments
 (0)