Skip to content

Commit bcca760

Browse files
rjogradycopybara-github
authored andcommitted
Update fleetbench readme to account for removing the default args on the benchmark BUILD rules.
Add more details about how to override TCMalloc. PiperOrigin-RevId: 542589689 Change-Id: Iebb8db769b3682642e4d8091b025dc5721f7ec8b
1 parent 7df9d57 commit bcca760

File tree

1 file changed

+37
-49
lines changed

1 file changed

+37
-49
lines changed

README.md

Lines changed: 37 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ are 3 levels of fidelity that we consider:
3737
1. The suite's performance counters match production.
3838
1. An optimization's impact on the suite matches the impact on production.
3939

40-
The goal of the suite for Y22 is to achieve the first level of fidelity.
41-
4240
## Versioning
4341

4442
Fleetbench uses [semantic versioning](http://semver.org) for its releases, where
@@ -67,14 +65,15 @@ optimizations.
6765

6866
### TCMalloc per-CPU Mode
6967

70-
TCMalloc is the underlying memory allocator in this benchmark suite. The
71-
supposed default operation mode should be
72-
[per-CPU mode](https://google.github.io/tcmalloc/overview.html). RSEQ is
73-
required for this mode, however, glibc took control of it since version 2.35,
74-
and TCMalloc reverts to using per-thread caching instead
75-
([more info](https://github.com/google/tcmalloc/issues/144)). We **strongly
76-
recommend** adding environment variable: `GLIBC_TUNABLES=glibc.pthread.rseq=0`
77-
to ensure per-CPU mode is being applied when running the benchmark. For example:
68+
TCMalloc is the underlying memory allocator in this benchmark suite. By default
69+
it operates in [per-CPU mode](https://google.github.io/tcmalloc/overview.html).
70+
71+
However, [RSEQ](https://lwn.net/Articles/883104/) is required for this to work.
72+
73+
To avoid [conflicts](https://github.com/google/tcmalloc/issues/144) with glibc's
74+
use of RSEQ, we **strongly recommend** setting the environment variable:
75+
`GLIBC_TUNABLES=glibc.pthread.rseq=0` to ensure per-CPU mode is being applied
76+
when running the benchmark. For example:
7877

7978
```
8079
GLIBC_TUNABLES=glibc.pthread.rseq=0 bazel run --config=opt fleetbench/swissmap:hot_swissmap_benchmark
@@ -109,57 +108,38 @@ Use `--config=westmere` for Westmere-era processors.
109108

110109
### Running Benchmarks
111110

112-
Swissmap benchmark for cold access setup takes much longer to run to completion,
113-
so by default it has a `--benchmark_filter` flag set to narrow down to smaller
114-
set sizes of `16` and `64` elements:
111+
Swissmap benchmark for cold access setup takes a long time to run to completion.
112+
We suggest using the `--benchmark_filter` flag to narrow down to smaller set
113+
sizes of e.g. `16` and `64` elements:
115114

116115
```
117-
bazel run --config=opt fleetbench/swissmap:cold_swissmap_benchmark
116+
bazel run --config=opt fleetbench/swissmap:cold_swissmap_benchmark -- \
117+
--benchmark_filter=".*set_size:(16|64).*"
118118
```
119119

120120
To change this filter, you can specify a regex in `--benchmark_filter` flag
121121
([more info](https://github.com/google/benchmark/blob/main/docs/user_guide.md#running-a-subset-of-benchmarks)).
122122
Example to run for only sets of `16` and `512` elements:
123123

124124
```
125-
bazel run --config=opt fleetbench/swissmap:cold_swissmap_benchmark -- --benchmark_filter=".*set_size:(16|512).*"
125+
bazel run --config=opt fleetbench/swissmap:cold_swissmap_benchmark -- \
126+
--benchmark_filter=".*set_size:(16|512).*"
126127
```
127128

128-
The protocol buffer benchmark is set to run for at least 3s by default:
129-
130-
```
131-
bazel run --config=opt fleetbench/proto:proto_benchmark
132-
```
133-
134-
To change the duration to 30s, run the following:
129+
To extend the runtime of a benchmark, e.g. to collect more profile samples, use
130+
--benchmark_min_time.
135131

136132
```
137133
bazel run --config=opt fleetbench/proto:proto_benchmark -- --benchmark_min_time=30s
138134
```
139135

140-
The TCMalloc Empirical Driver benchmark can take ~1hr to run all benchmarks:
136+
The TCMalloc Empirical Driver benchmark can take ~1hr to run all benchmarks, so
137+
running a subset may be advised.
141138

142139
```
143140
bazel run --config=opt fleetbench/tcmalloc:empirical_driver -- --benchmark_counters_tabular=true
144141
```
145142

146-
To build and execute the benchmark in separate steps, run the commands below.
147-
148-
NOTE: you'll need to specify the flags `--benchmark_filter` and
149-
`--benchmark_min_time` explicitly when build and execution are split into two
150-
separate steps.
151-
152-
```
153-
bazel build --config=opt fleetbench/swissmap:hot_swissmap_benchmark
154-
bazel-bin/fleetbench/swissmap/hot_swissmap_benchmark --benchmark_filter=all
155-
```
156-
157-
NOTE: the suite will be expanded with the ability to execute all benchmarks with
158-
one target.
159-
160-
WARNING: MacOS and Windows have not been tested, and are not currently supported
161-
by Fleetbench.
162-
163143
### Reducing run-to-run variance
164144

165145
It is expected that there will be some variance in the reported CPU times across
@@ -175,15 +155,15 @@ list of techniques that help with reducing run-to-run latency variance:
175155
`--benchmark_repetitions`.
176156
* Recommended by the benchmarking framework
177157
[here](https://github.com/google/benchmark/blob/main/docs/reducing_variance.md#reducing-variance-in-benchmarks):
178-
* Disable frequently scaling,
179-
* Bind the process to a core by setting its affinity,
180-
* Disable processor boosting,
158+
* Disable frequency scaling
159+
* Bind the process to a core by setting its affinity
160+
* Disable processor boosting
181161
* Disable Hyperthreading/SMT (should not affect single-threaded
182-
benchmarks).
162+
benchmarks)
183163
* NOTE: We do not recommend reducing the working set of the benchmark to
184164
fit into L1 cache, contrary to the recommendations in the link, as it
185165
would significantly reduce this benchmarking suite's representativeness.
186-
* Disable memory randomization (ASLR).
166+
* Disable memory randomization (ASLR)
187167

188168
## Future Work
189169

@@ -243,10 +223,18 @@ bazel run --config=clang --config=opt --features=thin_lto fleetbench/proto:proto
243223

244224
1. Q: Can I run Fleetbench without TCMalloc?
245225

246-
A: Fleetbench is built with Bazel, which supports --custom_malloc option
247-
([bazel docs](https://bazel.build/docs/user-manual#custom-malloc)). This
248-
should allow you to override the malloc attributed configured to take
249-
tcmalloc as the default.
226+
A: Yes. Specify `--custom_malloc="@bazel_tools//tools/cpp:malloc"` on the
227+
bazel command line to override with the system allocator.
228+
229+
1. Q: Can I run with Address Sanitizer?
230+
231+
A: Yes. Note that you need to override TCMalloc as well for ASAN to work.
232+
233+
Example:
234+
235+
```
236+
bazel build --custom_malloc="@bazel_tools//tools/cpp:malloc" -c opt fleetbench/proto:proto_benchmark --copt=-fsanitize=address --linkopt=-fsanitize=address
237+
```
250238

251239
1. Q: Are the benchmarks fixed in nature?
252240

0 commit comments

Comments
 (0)