Skip to content

Commit a7afcba

Browse files
ax3lWeiqunZhang
andauthored
amrex.omp_threads: Can Avoid SMT (AMReX-Codes#3607)
## Summary In all our applications in BLAST, the OpenMP default to use all [logical cores on modern CPUs](https://en.wikipedia.org/wiki/Simultaneous_multithreading) results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option `amrex.omp_threads` that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it. For codes, users that change the default to `amrex.omp_threads = nosmt`, the `OMP_NUM_THREADS` variable will still take precedence. This is a bit unusual (because CLI options usually have higher precedence than env vars - and they do if the user provides a number here), but done intentionally: this way, codes like WarpX can set the `nosmt` default and HPC job scripts will set the exact, preferably benchmarked number of threads as usual without surprises. - [x] document ## Tests Performed for AMReX OMP Backend Tests were performed with very small examples, WarpX 3D LWFA test as checked in or AMReX AMRCore 3d test. - [x] Ubuntu 22.04 Laptop w/ 12th Gen Intel i9-12900H: @ax3l - 20 logical cores; the first 12 logical cores use 2x SMT/HT - 20 virtual (default) -> 14 physical (`amrex.omp_threads = nosmt`) - faster runtime! - [x] Perlmutter (SUSE Linux Enterprise 15.4, kernel 5.14.21) - [CPU node](https://docs.nersc.gov/systems/perlmutter/architecture/) with 2x [AMD EPYC 7763](https://www.amd.com/en/products/cpu/amd-epyc-7763) - 2x SMT - 256 default, 128 with `amrex.omp_threads = nosmt` - faster runtime! - [x] Frontier (SUSE Linux Enterprise 15.4, kernel 5.14.21) - 1x AMD EPYC 7763 64-Core Processor (w/ 2x SMT enabled) - 2x SMT - 128 default - 64 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might also be lower, due to first cores used by OS and [low-noise cores](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#low-noise-mode-layout) after that. But that is an orthogonal question and should be set in job scripts: `#SBATCH --ntasks-per-node=8` `#SBATCH --cpus-per-task=7` `#SBATCH --gpus-per-task=1` - [x] Summit (RHEL 8.2, kernel 4.18.0) - 2x IBM Power9 (each 22 physical cores each, each 6 disabled/hidden for OS?, 4x SMT enabled; cpuinfo says 128 total) - 4x SMT - 128 default, 32 with `amrex.omp_threads = nosmt` - faster runtime! - [x] [Lassen](https://hpc.llnl.gov/hardware/compute-platforms/lassen) (RHEL 7.9, kernel 4.14.0) - 2x IBM Power9 (each 22 physical cores, each 2 reserved for OS?, 4x SMT enabled) - 4x SMT - 160 default, 44 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might be even down to 40, but that is an orthogonal question and should be set in job scripts. - [x] macOS M1 (arm64/aarch64) mini: - no SMT/HT - 8 default, 8 with `amrex.omp_threads = nosmt` - [x] macOS (OSX Ventura 13.5.2, 2.8 GHz Quad-Core Intel Core i7-8569U) Intel x86_64 @n01r - 2x SMT - 8 default, 4 with `amrex.omp_threads = nosmt` - faster runtime! - [x] macOS (OSX Ventura 13.5.2) M1 Max on mac studio @RTSandberg - no SMT/HT - 10 default, 10 with `amrex.omp_threads = nosmt` - [ ] some BSD/FreeBSD system? - no user requests - low priority, we just keep the default for now - [ ] Windows... looking for a system ## Additional background ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>
1 parent 606a94c commit a7afcba

File tree

7 files changed

+224
-7
lines changed

7 files changed

+224
-7
lines changed
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
.. _Chap:InputsComputeBackends:
2+
3+
Compute Backends
4+
================
5+
6+
The following inputs must be preceded by ``amrex.`` and determine runtime options of CPU or GPU compute implementations.
7+
8+
+------------------------+-----------------------------------------------------------------------+-------------+------------+
9+
| Parameter | Description | Type | Default |
10+
+========================+=======================================================================+=============+============+
11+
| ``omp_threads`` | If OpenMP is enabled, this can be used to set the default number of | String | ``system`` |
12+
| | threads. The special value ``nosmt`` can be used to avoid using | or Int | |
13+
| | threads for virtual cores (aka Hyperthreading or SMT), as is default | | |
14+
| | in OpenMP, and instead only spawns threads equal to the number of | | |
15+
| | physical cores in the system. | | |
16+
| | For the values ``system`` and ``nosmt``, the environment variable | | |
17+
| | ``OMP_NUM_THREADS`` takes precedence. For Integer values, | | |
18+
| | ``OMP_NUM_THREADS`` is ignored. | | |
19+
+------------------------+-----------------------------------------------------------------------+-------------+------------+
20+
21+
For GPU-specific parameters, see also the :ref:`GPU chapter <sec:gpu:parameters>`.

Docs/sphinx_documentation/source/Inputs_Chapter.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Run-time Inputs
99
InputsProblemDefinition
1010
InputsTimeStepping
1111
InputsLoadBalancing
12+
InputsComputeBackends
1213
InputsPlotFiles
1314
InputsCheckpoint
1415

Src/Base/AMReX.cpp

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
#endif
5353

5454
#ifdef AMREX_USE_OMP
55+
#include <AMReX_OpenMP.H>
5556
#include <omp.h>
5657
#endif
5758

@@ -72,7 +73,9 @@
7273
#include <iostream>
7374
#include <iomanip>
7475
#include <new>
76+
#include <optional>
7577
#include <stack>
78+
#include <string>
7679
#include <thread>
7780
#include <limits>
7881
#include <vector>
@@ -459,15 +462,17 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
459462
#endif
460463

461464
#ifdef AMREX_USE_OMP
465+
amrex::OpenMP::init_threads();
466+
467+
// status output
462468
if (system::verbose > 0) {
463469
// static_assert(_OPENMP >= 201107, "OpenMP >= 3.1 is required.");
464470
amrex::Print() << "OMP initialized with "
465471
<< omp_get_max_threads()
466472
<< " OMP threads\n";
467473
}
468-
#endif
469474

470-
#if defined(AMREX_USE_MPI) && defined(AMREX_USE_OMP)
475+
// warn if over-subscription is detected
471476
if (system::verbose > 0) {
472477
auto ncores = int(std::thread::hardware_concurrency());
473478
if (ncores != 0 && // It might be zero according to the C++ standard.
@@ -476,8 +481,10 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
476481
amrex::Print(amrex::ErrorStream())
477482
<< "AMReX Warning: You might be oversubscribing CPU cores with OMP threads.\n"
478483
<< " There are " << ncores << " cores per node.\n"
479-
<< " There are " << ParallelDescriptor::NProcsPerNode() << " MPI ranks per node.\n"
480-
<< " But OMP is initialized with " << omp_get_max_threads() << " threads per rank.\n"
484+
#if defined(AMREX_USE_MPI)
485+
<< " There are " << ParallelDescriptor::NProcsPerNode() << " MPI ranks (processes) per node.\n"
486+
#endif
487+
<< " But OMP is initialized with " << omp_get_max_threads() << " threads per process.\n"
481488
<< " You should consider setting OMP_NUM_THREADS="
482489
<< ncores/ParallelDescriptor::NProcsPerNode() << " or less in the environment.\n";
483490
}

Src/Base/AMReX_OpenMP.H

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,29 @@ namespace amrex::OpenMP {
1111
inline int get_max_threads () { return omp_get_max_threads(); }
1212
inline int get_thread_num () { return omp_get_thread_num(); }
1313
inline int in_parallel () { return omp_in_parallel(); }
14+
inline void set_num_threads (int num) { omp_set_num_threads(num); }
1415

16+
void init_threads ();
1517
}
1618

17-
#else
19+
#else // AMREX_USE_OMP
1820

1921
namespace amrex::OpenMP {
2022

2123
constexpr int get_num_threads () { return 1; }
2224
constexpr int get_max_threads () { return 1; }
2325
constexpr int get_thread_num () { return 0; }
2426
constexpr int in_parallel () { return false; }
25-
27+
constexpr void set_num_threads (int) { /* nothing */ }
28+
constexpr void init_threads () { /* nothing */ }
2629
}
2730

28-
#endif
31+
#endif // AMREX_USE_OMP
32+
33+
namespace amrex {
34+
/** ... */
35+
int
36+
numUniquePhysicalCores();
37+
}
2938

3039
#endif

Src/Base/AMReX_OpenMP.cpp

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
#include <AMReX_OpenMP.H>
2+
#include <AMReX.H>
3+
#include <AMReX_ParmParse.H>
4+
#include <AMReX_Print.H>
5+
6+
#if defined(__APPLE__)
7+
#include <sys/types.h>
8+
#include <sys/sysctl.h>
9+
#endif
10+
11+
#if defined(_WIN32)
12+
#include <windows.h>
13+
#endif
14+
15+
#include <cstdlib>
16+
#include <fstream>
17+
#include <iostream>
18+
#include <optional>
19+
#include <set>
20+
#include <sstream>
21+
#include <string>
22+
#include <thread>
23+
#include <vector>
24+
25+
26+
namespace amrex
27+
{
28+
int
29+
numUniquePhysicalCores ()
30+
{
31+
int ncores;
32+
33+
#if defined(__APPLE__)
34+
size_t len = sizeof(ncores);
35+
// See hw.physicalcpu and hw.physicalcpu_max
36+
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname/determining_system_capabilities/
37+
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname
38+
if (sysctlbyname("hw.physicalcpu", &ncores, &len, NULL, 0) == -1) {
39+
if (system::verbose > 0) {
40+
amrex::Print() << "numUniquePhysicalCores(): Error receiving hw.physicalcpu! "
41+
<< "Defaulting to visible cores.\n";
42+
}
43+
ncores = int(std::thread::hardware_concurrency());
44+
}
45+
#elif defined(__linux__)
46+
std::set<std::vector<int>> uniqueThreadSets;
47+
int cpuIndex = 0;
48+
49+
while (true) {
50+
// for each logical CPU in cpuIndex from 0...N-1
51+
std::string path = "/sys/devices/system/cpu/cpu" + std::to_string(cpuIndex) + "/topology/thread_siblings_list";
52+
std::ifstream file(path);
53+
if (!file.is_open()) {
54+
break; // no further CPUs to check
55+
}
56+
57+
// find its siblings
58+
std::vector<int> siblings;
59+
std::string line;
60+
if (std::getline(file, line)) {
61+
std::stringstream ss(line);
62+
std::string token;
63+
64+
// Possible syntax: 0-3, 8-11, 14,17
65+
// https://github.com/torvalds/linux/blob/v6.5/Documentation/ABI/stable/sysfs-devices-system-cpu#L68-L72
66+
while (std::getline(ss, token, ',')) {
67+
size_t dashPos = token.find('-');
68+
if (dashPos != std::string::npos) {
69+
// Range detected
70+
int start = std::stoi(token.substr(0, dashPos));
71+
int end = std::stoi(token.substr(dashPos + 1));
72+
for (int i = start; i <= end; ++i) {
73+
siblings.push_back(i);
74+
}
75+
} else {
76+
siblings.push_back(std::stoi(token));
77+
}
78+
}
79+
}
80+
81+
// and record the siblings group
82+
// (assumes: ascending and unique sets per cpuIndex)
83+
uniqueThreadSets.insert(siblings);
84+
cpuIndex++;
85+
}
86+
87+
if (cpuIndex == 0) {
88+
if (system::verbose > 0) {
89+
amrex::Print() << "numUniquePhysicalCores(): Error reading CPU info.\n";
90+
}
91+
ncores = int(std::thread::hardware_concurrency());
92+
} else {
93+
ncores = int(uniqueThreadSets.size());
94+
}
95+
#elif defined(_WIN32)
96+
DWORD length = 0;
97+
bool result = GetLogicalProcessorInformation(NULL, &length);
98+
99+
if (!result) {
100+
if (system::verbose > 0) {
101+
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! "
102+
<< "Defaulting to visible cores.\n";
103+
}
104+
ncores = int(std::thread::hardware_concurrency());
105+
}
106+
else {
107+
std::vector<SYSTEM_LOGICAL_PROCESSOR_INFORMATION> buffer(length / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
108+
if (!GetLogicalProcessorInformation(&buffer[0], &length)) {
109+
if (system::verbose > 0) {
110+
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! "
111+
<< "Defaulting to visible cores.\n";
112+
}
113+
ncores = int(std::thread::hardware_concurrency());
114+
} else {
115+
ncores = 0;
116+
for (const auto& info : buffer) {
117+
if (info.Relationship == RelationProcessorCore) {
118+
ncores++;
119+
}
120+
}
121+
}
122+
}
123+
#else
124+
// TODO:
125+
// BSD
126+
if (system::verbose > 0) {
127+
amrex::Print() << "numUniquePhysicalCores(): Unknown system. Defaulting to visible cores.\n";
128+
}
129+
ncores = int(std::thread::hardware_concurrency());
130+
#endif
131+
return ncores;
132+
}
133+
} // namespace amrex
134+
135+
#ifdef AMREX_USE_OMP
136+
namespace amrex::OpenMP
137+
{
138+
void init_threads ()
139+
{
140+
amrex::ParmParse pp("amrex");
141+
std::string omp_threads = "system";
142+
pp.queryAdd("omp_threads", omp_threads);
143+
144+
auto to_int = [](std::string const & str_omp_threads) {
145+
std::optional<int> num;
146+
try { num = std::stoi(str_omp_threads); }
147+
catch (...) { /* nothing */ }
148+
return num;
149+
};
150+
151+
if (omp_threads == "system") {
152+
// default or OMP_NUM_THREADS environment variable
153+
} else if (omp_threads == "nosmt") {
154+
char const *env_omp_num_threads = std::getenv("OMP_NUM_THREADS");
155+
if (env_omp_num_threads != nullptr && amrex::system::verbose > 1) {
156+
amrex::Print() << "amrex.omp_threads was set to nosmt,"
157+
<< "but OMP_NUM_THREADS was set. Will keep "
158+
<< "OMP_NUM_THREADS=" << env_omp_num_threads << ".\n";
159+
} else {
160+
omp_set_num_threads(numUniquePhysicalCores());
161+
}
162+
} else {
163+
std::optional<int> num_omp_threads = to_int(omp_threads);
164+
if (num_omp_threads.has_value()) {
165+
omp_set_num_threads(num_omp_threads.value());
166+
}
167+
else {
168+
if (amrex::system::verbose > 0) {
169+
amrex::Print() << "amrex.omp_threads has an unknown value: "
170+
<< omp_threads
171+
<< " (try system, nosmt, or a positive integer)\n";
172+
}
173+
}
174+
}
175+
}
176+
} // namespace amrex::OpenMP
177+
#endif // AMREX_USE_OMP

Src/Base/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ foreach(D IN LISTS AMReX_SPACEDIM)
5353
AMReX_ParallelDescriptor.H
5454
AMReX_ParallelDescriptor.cpp
5555
AMReX_OpenMP.H
56+
AMReX_OpenMP.cpp
5657
AMReX_ParallelReduce.H
5758
AMReX_ForkJoin.H
5859
AMReX_ForkJoin.cpp

Src/Base/Make.package

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ C$(AMREX_BASE)_headers += AMReX_REAL.H AMReX_INT.H AMReX_CONSTANTS.H AMReX_SPACE
3838
C$(AMREX_BASE)_sources += AMReX_DistributionMapping.cpp AMReX_ParallelDescriptor.cpp
3939
C$(AMREX_BASE)_headers += AMReX_DistributionMapping.H AMReX_ParallelDescriptor.H
4040
C$(AMREX_BASE)_headers += AMReX_OpenMP.H
41+
C$(AMREX_BASE)_sources += AMReX_OpenMP.cpp
4142

4243
C$(AMREX_BASE)_headers += AMReX_ParallelReduce.H
4344

0 commit comments

Comments
 (0)