Ideally TaoBench needs three machines: one as the server and two as clients. The server machine is the one that will be stressed and measured performance, and the clients can be less powerful machines or machines of different hardware configuration from the server.
To avoid bottleneck in network I/O, it is highly recommended to place the server and client machines in the same network. We recommend the ping latency between the machines to be in the range of 0.1ms and 0.15ms and the network bandwidth to be at least 10Gbps for small to medium core count CPUs (<= 72 logical cores) or 20Gbps for high core count ones (>72 logical cores). The maximum NIC bandwidth we recommend is 50Gbps. If the client nodes are not available, we offer an option to run the server and the clients on the same machine.
Currently TaoBench server can be run on CentOS Stream 8 or 9. However, we recommend
CentOS Stream 8 with Linux kernel 5.19 for best performance. Besides,
If IOMMU is enabled
in your system, please make sure you have IOMMU passthrough set in the
kernel boot cmdline (iommu=pt
for x86_64 and iommu.passthrough=1
for ARM),
otherwise the system will be soft locked up in network I/O when running
TaoBench and end up getting very low performance.
Run the following commands under the DCPerf OSS repo (assuming you are logged in as root and you have internet access):
./benchpress_cli.py install tao_bench_autoscale
This should automatically build and install all dependencies as well as TaoBench server and client binaries.
When installing TaoBench, the installer script will download and build OpenSSL
3.3.2 and link TaoBench server and client binaries to it. If you wish to use a
different version, you can prefix the installation command with OPENSSL_BRANCH
variable. The value of the variable should be a valid tag name in
OpenSSL's Github repo.
For example, to use OpenSSL 1.1.1w, run the following:
OPENSSL_BRANCH=OpenSSL_1_1_1w ./benchpress_cli.py install tao_bench_autoscale
NOTE: TaoBench requires gflags-devel-2.2.2. If you happen to have installed
an older version of gflags to support other software or benchmarks (e.g. Mediawiki)
after installing TaoBench, please reinstall gflags-devel-2.2.2 by running
dnf install -y gflags-devel-2.2.2
before running TaoBench.
tao_bench_autoscale
can automatically scale TaoBench server based on your system's
CPU core count. The number of TaoBench server instances to spawn equals to the number
of CPU logical cores divided by 72, rounded up. This job will also distribute CPU cores and
memory evenly among all the TaoBench servers.
This job provides the following optional parameters that you may be interested in:
num_servers
: Number of TaoBench servers to spawn. By default, this benchmark will spawnceil(NPROC / 72)
TaoBench server instances; you can override the default automatic scaling by specifying this parameter.memsize
: Total amount of memory to use by the benchmark, in GB. The default value is the total amount of available system memory (MemTotal
in/proc/meminfo
), but you can override it by specify this parameter if you don't want this benchmark to use all system memory.interface_name
: The name of the NIC that connects with the clients. Optional, default iseth0
. You will need to specify this parameter if the name of your NIC is noteth0
.warmup_time
: Time to warm up TaoBench server to let it reach a steady state, in seconds. The default is10 * memsize
or 1200, whichever is greater.test_time
: Time to stress TaoBench server and measure performance, in seconds. Optional, default is 720.bind_cpu
: When running on machines with multiple NUMA nodes, setting this to 1 will bind each server instance to the CPU node(s) that the assigned CPU cores belong to. This can minimize cross-socket traffic. By default this is set to 1, and you can set it to 0 to disable.bind_mem
: When running on machines with multiple NUMA nodes, setting this to 1 will bind each server instance to the memory node that is local to the CPU cores assigned to run the server instance. This can also minimize cross-socket traffic. By default it's set to 1. You may want to set this to 0 if you would like to test heterogeneous memory systems such as CXL systems, otherwise the benchmark will not be able to use the CXL memory.sanity
: Sanity check to measure the network bandwidth and latency between the server and the client. The default is 0 which is not running the sanity check. Please set this to 1 if you would like to run the sanity check and measure network bandwidth and latency.port_number_start
: The starting port number for TaoBench server to listen to. Optional, default is 11211.
The following parameters are used for generating client side instructions and will not have substantial effects on the server:
server_hostname
: Hostname or IP address of the server machine to which the client machines can connect. The default value is the return value of the python functionsocket.gethostname()
or the output ofhostname
command. If the current hostname of your server machine is not resolvable by other machines (e.g.localhost
), please set this parameter to another resolvable hostname or an IP address that others can use connect to the server.num_clients
: The number of client machines. Default is 2, but you can change this if you plan to use a different number of client machines.
You can simply run this job without setting any parameters if the following conditions are met:
- You are satisfied with TaoBench's automatic scaling (If not, please set
num_servers
) - You intend to use all available system memory (If not, please set
memsize
) - The NIC on the server machine is called
eth0
inifconfig
(If not, please setinterface_name
) - You are using 2 client machines (If not, please set
num_clients
) - The hostname of the server machine is resolvable by others (If not, please set
server_hostname
)
Example usage on a 380-core machine with 384GB of memory:
[root@<server-hostname> ~/external]# ./benchpress_cli.py run tao_bench_autoscale
Will run 1 job(s)
Running "tao_bench_autoscale": Spawns multiple Tao benchmark servers depending on the CPU core count. After executing this job, please see the tail of benchpress.log for instructions
on starting clients. MAKE SURE to start clients within 1 minute.
Job execution command: ['./packages/tao_bench/run_autoscale.py', '--num-servers=0', '--memsize=384', '--interface-name=eth0', '--port-number-start=11211', '--test-time=720', '--real']Please run the following commands **simultaneously** on all the client machines.
Client 1:
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11211}'
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11213}'
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11215}'
Client 2:
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11212}'
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11214}'
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 64, "warmup_time": 3840, "test_time": 720, "server_port_number": 11216}'
......
stderr:
Results Report:
{
......
}
Finished running "tao_bench_autoscale": Spawns multiple Tao benchmark servers depending on the CPU core count. After executing this job, please see the tail of benchpress.log for instructions on starting clients. MAKE SURE to start clients within 1 minute. with uuid: 1db2d5f6
As you can see, after executing ./benchpress_cli.py run tao_bench_autoscale
, it will show
instructions on what commands to run on each of the client machines. Please copy the commands
to the terminals of the client machines and try your best to run them simultaneously. We
recommend achieving this by using tmux.
If you need to specify one or more of the parameters, please specify them in the positional
argument -i
of ./benchpress_cli run
command in JSON format. For example, if you would
like to run four TaoBench server instances, use a total of 512GB of memory and your NIC name
is enP8s6
, then the command would be;
./benchpress_cli.py run tao_bench_autoscale -i '{"num_servers": 4, "memsize": 512, "interface_name": "enP8s6"}'
To start the clients and the server with autoscale on the same benchmarking machine:
./benchpress_cli.py run tao_bench_standalone
All the parameters used with autoscale can be used with standalone except interface_name.
Once the tao_bench benchmark finishes, benchpress will report the results in JSON format on the server machine, like the following:
{
"benchmark_args": [
"--num-servers={num_servers}",
"--memsize={memsize}",
"--interface-name={interface_name}",
"--port-number-start=11211",
"--test-time={test_time}",
"--real"
],
"benchmark_desc": "Spawns multiple Tao benchmark servers depending on the CPU core count. After executing this job, please see the tail of benchpress.log for instructions on starting clients. MAKE SURE to start clients within 1 minute.",
"benchmark_hooks": [
"tao_instruction: None",
"copymove: {'is_move': True, 'after': ['benchmarks/tao_bench/server*.csv', 'tao-bench-server-*.log']}"
],
"benchmark_name": "tao_bench_autoscale",
"machines": [
{
"cpu_architecture": "x86_64",
"cpu_model": "<CPU name>",
"hostname": "<server-hostname>",
"kernel_version": "5.19.0-0_xxxx",
"mem_total_kib": "393216 KiB",
"num_logical_cpus": "380",
"os_distro": "centos",
"os_release_name": "CentOS Stream 9"
}
],
"metadata": {
"L1d cache": "6 MiB (192 instances)",
"L1i cache": "6 MiB (192 instances)",
"L2 cache": "192 MiB (192 instances)",
"L3 cache": "768 MiB (24 instances)"
},
"metrics": {
"fast_qps": 3235784.8206896544,
"hit_ratio": 0.8832408870106071,
"num_data_points": 348,
"role": "server",
"slow_qps": 427937.3827586206,
"spawned_instances": 6,
"successful_instances": 6,
"total_qps": 3663722.2034482756,
"score": 5.364
},
"run_id": "1db2d5f6",
"timestamp": 1702658374
}
The metric total_qps
measures TaoBench performance and is the one we care
about the most. hit_ratio
is expected to be in the range of 0.88 and 0.9,
and there should be 58 * num_servers valid data points. We expect the CPU utilization of
this benchmark to be roughly between 70% and 80%, of which the user utilization
ranges between 15% and 30%.
(The result above is only an example that shows the format of the result report. It is not a reference score of any paticular machine)
On the server machine, benchpress will generate time-series QPS metrics CSV tables
at benchmark_metrics_<run_id>/server_N.csv
for each TaoBench server instance
under benchpress's folder. There will be individual logs at
benchmark_metrics_<run_id>/tao-bench-server--<yymmdd_HHMMSS>.log` for each
instance.
This is a newly introduced TaoBench workload that aims at providing better
and more stable scalability on servers equipped with ultra high core count CPUs
(>=200 logical cores) and reducing overhead from timer IRQs. If you find your
systems run into bottleneck and get lower-than-expected scores and very low user
CPU utilization (less than 15%), you can try running this job. However, the
results obatined from this job cannot be directly compared with results from the
regular tao_bench_autoscale
job because this tao_bench_autoscale_v2_beta
has some substantial change in the workload logic.
This workload job is marked beta
because we are still working on analyzing its
change in hot function profile in comparison to the original workload as well as
validating its representativeness to the actual production workload.
In addition to the parameters supported by tao_bench_autoscale
, this job has
the following additional parameters:
slow_threads_use_semaphore
- Whether to use sempahore, instead ofnanosleep()
to wait for the incoming requests in the slow threads. Set to 1 to enable and 0 to disable. Default is 1. Using semaphore will greatly reduce the overhead from hardware timer IRQ and thus improving overall performance and scalability.pin_threads
- Pin each thread in TaoBench server to a dedicated CPU logical core. This can reduce overhead from threads scheduling especially when the number of CPU cores grows. Set to 1 to enable and 0 to disable. Default is 0.conns_per_server_core
- TCP connections from clients per server CPU core. Default is 85. Increasing this will incur more pressure of TCP connections on the server and result in lower performance results.
Usage and reporting is the same as the tao_bench_autoscale
job, please refer to
the guide for the tao_bench_autoscale
job. Just replace the job name.
A good run of this workload should have about 7580% of CPU utilization in the steady
state, of which 2530% should be in userspace.
NOTE: We recommend using tao_bench_autoscale
to start the server as this job
only supports single server instance.
We also provide another job tao_bench_custom
which exposes more parameters for you
to customize TaoBench runs. It has the following roles and parameters:
- role
server
- used on the server machineinterface_name
- The name of the NIC that connects with the clients. Optional, default iseth0
.memsize
- Memory capacity in GB. Optional, default is 64.port_number
- Port to listen to. Optional, default is 11211warmup_time
- Time to warm up TaoBench server, in seconds. Optional, default is 1200test_time
- Time to stress TaoBench server and measure performance, in seconds. Optional, default is 360.
- role
client
- used on the client machineserver_hostname
- The IP address or hostname of the server machine. Mandatory.server_memsize
- The memory capacity of the server, in GB. Optional, default is 64.clients_per_thread
- Number of client workers to launch per logical CPU core. Optional, default is 380. NOTE: Ifclients_per_thread * (nproc - 6)
is greater than 32K, please adjust this parameter to bring down the total client connections. For example, if the client machine has 176 logical cores, the maximumclients_per_thread
will be 32768 / (176 - 6) = 192.server_port_number
- The port to connect to the server. Optional, default is 11211warmup_time
- Time to warm up TaoBench server, in seconds. Optional, default is 1200test_time
- Time to stress TaoBench server and measure performance, in seconds. Please make surewarmup_time
andtest_time
are consistent to what have been set on the server side.
For example, if your server machine has 128GB of memory, the NIC name is "enp0s10", and you want to set the warmup time to 1800s and test time to 540s, the commands will be:
Server:
./benchpress_cli.py run tao_bench_custom -r server -i '{"interface_name": "enp0s10", "memsize": 128, "warmup_time": 1800, "test_time": 540}'
Clients:
./benchpress_cli.py run tao_bench_custom -r client -i '{"server_hostname": "<server-hostname>", "server_memsize": 128, "warmup_time": 1800, "test_time": 540}'
This job is deprecated and no longer recommended to use. tao_bench_autoscale
is the
preferred way to start TaoBench server and it can give out instructions on what commands
to run on the client machines.
If the server machine has 64GB of RAM and the CPU has no more than 72 logical cores, you can just run the default job
tao_bench_64g
. This job has the following roles and parameters:
- role
server
- used on the server machine
interface_name
-- role
client
- used on the client machine
server_hostname
- The IP address or hostname of the server machine. Mandatory.Example usage:
On the server machine:
./benchpress_cli.py run tao_bench_64g -r serverOn client machines:
./benchpress_cli.py run tao_bench_64g -r client \ -i '{"server_hostname": "<server-hostname>"}'
NOTE: If the number of logical cores multiplied by 380 is greater than 32K on the client machines, you will need to adjust
clients_per_thread
parameter on the client side such that clients_per_thread * (NPROC - 6) is less than 32768 (NPROC is the number of logical cores). In this case you will need to usetao_bench_custom
job on the clients. Please refer totao_bench_custom
below.
Once the tao_bench benchmark finishes, benchpress will report the results in JSON format on the server machine, like the following:
{ "benchmark_args": [], "benchmark_desc": "Tao benchmark using 64GB memory. MAKE SURE to start clients within 1 minute.", "benchmark_name": "tao_bench_64g", "machines": [ { "cpu_architecture": "<arch: x86_64 or aarch64>", "cpu_model": "<CPU model name>", "hostname": "<hostname>", "kernel_version": "5.6.13", "mem_total_kib": "65385740 KiB", "num_logical_cpus": "50", "os_distro": "centos", "os_release_name": "CentOS Stream 8" } ], "metadata": { "L1d cache": "32K", "L1i cache": "32K", "L2 cache": "1024K", "L3 cache": "36608K" }, "metrics": { "fast_qps": 681564.6862068963, "hit_ratio": 0.8996873753935061, "num_data_points": 58, "role": "server", "slow_qps": 75992.55517241379, "total_qps": 757557.2413793101 }, "run_id": "8168af24", "timestamp": 1649914173 }
If TaoBench failed to install and you find the following build error near the end of the output:
libs/python/src/numpy/dtype.cpp: In member function ‘int boost::python::numpy::dtype::get_itemsize() const’:
libs/python/src/numpy/dtype.cpp:101:83: error: ‘PyArray_Descr’ {aka ‘struct _PyArray_Descr’} has no member named ‘elsize’
101 | int dtype::get_itemsize() const { return reinterpret_cast<PyArray_Descr*>(ptr())->elsize;}
| ^~~~~~
...failed gcc.compile.c++ bin.v2/libs/python/build/gcc-11/release/debug-symbols-on/link-static/pch-off/python-3.9/threading-multi/visibility-global/numpy/dtype.o...
gcc.compile.c++ bin.v2/libs/python/build/gcc-11/release/debug-symbols-on/link-static/pch-off/python-3.9/threading-multi/visibility-global/numpy/numpy.o
This is because the Boost library that came with Folly is not compatible with Numpy 2.0.
Please resolve the build error by downgrading your Numpy to 1.26 (by running
pip3 install 'numpy<2'
>).
First, please make sure the open files limit (ulimit -n
) on your system is large enough.
We recommend at least 64K for this limit.
If the error still exists after raising the open files limit, please reduce the value of
clients_per_thread
parameter in your client commands. By default this parameter is set
to the lower of 380
or floor(32768 / (NPROC - 6))
(NPROC
is the number of logical
cores on your client node). On some systems you might need to reduce this value to even
lower to avoid (e.g. half of the default value) the "cannot assign requested address" error.
If (one of) the clients exits early and has the error message of "TLS connection error: (null)", it's likely due to there's a Memcached service running and taking the port 11211 on the server machine. In this case, please try stopping the Memcached service and relaunch TaoBench benchmark.