Description
Hello friends,
I’m a CPU architect at Ampere Computing where I do performance analysis and workload characterization. I also serve on the SPEC CPU committee, working on benchmarks for the next version of SPEC CPU, CPUv8 . We try to find computationally intensive workloads in diverse fields, to help measure performance across a wide variety of behaviors and application domains. Based on the longevity of nest, its large active community in biology and its use in education, I have proposed the nest neural network model be included in the next set of marquee benchmarks in SPEC CPU.
As part of the effort, we have ported and integrated the nest mainline code into the SPEC CPU harness so that it can be tested on a wide variety of systems in a controlled environment to produce reproducible results. We have even built it on native Windows using MSVC and the Intel compiler for Windows – we are happy to share the changes if someone is interested in testing and integrating it back into the upstream mainline for the benefit of the community.
The piece we need help with is an understanding of the multithreaded workloads. Right now, we have single-threaded nest command lines which run and produce verifiable output across many compilers (llvm, gcc, icc, aocc, nvhpc, cray), ISAs (aarch64, x86, power) and operating systems (linux, windows, android). We verify the run via checking the .dat files which come out of the simulation runs to make sure that there are no differences in the resulting output. A problem arises when we run with multiple threads, since there are a different number of files produced, and I am unfamiliar with how to coalesce them to verify.
First some fundamental questions: Does a nest invocation with 8 threads perform the same amount of work as a run with 16 threads? Or is it that the problem being solved is larger? If it is the same, how can we verify that? Does this answer change based on the .sli script used?
In the example below, if I run examples/nest/brunel-2000_newconnect_dc.sli
(with a small edit to make it run longer)... I tried with 8 threads and 16. It looks like I am simulating the same number of Neurons and Synapses. The 8-thread version outputs 16 files and the 16-threaded version outputs 24 files. The total lines in the files are close. Do they fundamentally contain the same information, just at different sample points?
$ ./nest_s_base.O3-64 --userargs=threads=8 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.
Brunel Network Simulation
Number of Threads : 8
Number of Neurons : 95000
Number of Synapses: 902500100
Excitatory : 722000000
Inhibitory : 180500000
Excitatory rate : 6.99 Hz
Inhibitory rate : 6.825 Hz
Building time : 23.76 s
Simulation time : 448.07 s
$ ls -l brunel-2-in-threaded-95002-* | wc -l
16
$ wc -l brunel-2-in-threaded-95002-* | tail -1
2811 total
$ ./nest_s_base.O3-64 --userargs=threads=16 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.
Brunel Network Simulation
Number of Threads : 16
Number of Neurons : 95000
Number of Synapses: 902500100
Excitatory : 722000000
Inhibitory : 180500000
Excitatory rate : 6.985 Hz
Inhibitory rate : 7.36 Hz
Building time : 35.39 s
Simulation time : 332.74 s
$ ls -l brunel-2-in-threaded-95002-* | wc -l
24
$ wc -l brunel-2-in-threaded-95002-* | tail -1
2909 total
Overall, the goal is to be able to verify that the same amount of work was completed between these two command lines, and verify that they calculated the same result. This allows a benchmark to run on systems with a varying number of hardware cores, so we can measure CPU performance between them. We are allowed to provide some tolerance, in case there is floating point rounding error.
For the multithreaded benchmark, I am exercising the scripts below. The goal is to showcase scalable threading performance, as well as cover a variety of behaviors in the nest simulator.
brunel-2000_newconnect_dc_LONG.sli
brunel_ps_LONG.sli
hpc_benchmark.sli
microcircuit.sli
If you have feedback on which are more or less useful as multithreaded benchmarks, please share your thoughts!
Thank you!