-
Notifications
You must be signed in to change notification settings - Fork 30
Performance
The scalability of the fluid solver of LUMA v1.3 and LUMA v1.7 has been documented from the result of a series of tests run on the UK super-computer ARCHER. These tests established the strong and weak scaling of the fluid solver part of the software for different problem sizes. A description of the tests as well as the results are presented below. At the time of writing the scalability of other parts of the software such as the immersed boundary module and the grid refinement algorithms are also being written into the test plan. Results will be presented at the end of the page when they are available.
Timing data is recorded by LUMA using the L_LOG_TIMINGS flag in the definitions.h input file and recompiling the application. A breakdown of the time taken to conduct a time step (i.e. an update of all the cells in the lattice) is recorded and averaged on-the-fly for each process. The time taken to complete the MPI communication part of the loop is also recorded. The total time taken to perform a loop is then the time step time + the communication time. These times are recorded separately for each process and hence an error bar may be generated indicating the spread across the processes in the test case.
The performance of the software can then be assessed in terms of how many lattice sites (cells) are able to be updated per second. This measure is termed Million Lattice Updates Per Second (MLUPS). A higher value indicates an increased rate of throughput.
A periodic, 3D flow in a square channel of length 6 units and width and height 1 unit is used as a test case. The flow is driven by a body force and hence the forcing capability is enabled. Solid wall boundary conditions are also applied to the channel walls but the half-way bounce-back formulation is used which is the least computationally expensive.
LUMA v1.3 insisted on having at least 2 ranks in every dimension of the MPI topology so it was not possible to simulate a suitable 1-process baseline case without switching to a serial build. The serial build removes the MPI parts of the application and hence it would not be a fair test of scaling to compare to what is essentially a different application. Instead, a single-node case is run (24-processes) as the baseline case and subsequent cases determined as binary multiples of nodes.
The total number of cells that can be initialised on a single node with 64GB of RAM is dictated by a resolution of 224 cells / unit length of the problem. This gives a total problem size of 67,436,544 cells on a single node. Starting with a single node, the software is run on up to 512 nodes (12,288 processes). A uniform decomposition strategy is used to ensure an equal load for each process. Strong scaling of LUMA v1.3 for a 67M cells case up to 12,288 cores is shown in Figure 1.

Figure 1: Strong Scaling Efficiency (relative to single-node case) for LUMA v1.3 fluid solver
The 67M cells case from the strong scaling tests was also used to assess weak scalability of the LUMA v1.3 fluid solver. The single node (24 process) baseline case used for the strong scaling gives approximately 2.9M cells per process. We use the same uniform decomposition strategy as for the strong scaling but this time adjust the total problem size to maintain the number of cells / process within 2% of the baseline case. Figure 2 illustrates the results of the weak scaling tests up to 12,288 cores with respect to the baseline.

Figure 2: Weak Scaling Efficiency (relative to single-node case) for LUMA v1.3 fluid solver
New strong and weak scaling tests of the LUMA fluid solver have been performed on the ARCHER super-computer for the fluid solver. These are presented below. The v1.7 test plan also includes testing the scalability of parallel FSI problems (now these capabilities are available). There are also plans to extend testing to include the scalability of the grid refinement algorithm. However, all these configurations are entirely problem dependent which makes it difficult to present a definitive picture. Tests are being design carefully to demonstrate extremes of performance given knowledge of the implementation.
LUMA v1.7 has a number of improvements over LUMA v1.3 including OpenMP support and the ability to run the software in either serial or in an MPI environment but with the topology set to 1 in each direction. This allows a fair comparison of performance data right back to a single process rather than a single node (24 processes) as in the earlier test on v1.3.
A resolution of 400 cells / unit length is used for the baseline cases which is close to the memory limits of a single process on ARCHER. This gives a total problem size of 64,000,000 cells. Unlike the above tests on v1.3, we enable the new OpenMP features of v1.7 which gives a performance improvement of 27% in v1.7 over v1.3. Starting with the single single process case, the software is run on up to 512 nodes (12,288 processes). A uniform decomposition strategy is used to ensure an equal load for each process. Strong scaling of LUMA v1.7 for a 64M cell case up to 12,288 cores is shown in Figure 3. We note that the use of OpenMP when the cells per process is small is always expected to have a detrimental effect on performance hence the plunging efficiency as the core count gets very large.

Figure 3: Strong Scaling Efficiency (relative to single-process case) for LUMA v1.7 fluid solver with OpenMP threads set to 12.
As a single node on ARCHER can hold approximately 67M cells before running out of memory, we compute that a single process can hold approximately 2.74M cells. This is used as the baseline for the weak scaling tests of the LUMA v1.7 fluid solver. We use the same uniform decomposition strategy as for the strong scaling but this time adjust the total problem size to maintain the number of cells / process within 0.1% of the baseline case. Figure 4 illustrates the results of the weak scaling tests up to 12,288 cores with respect to the baseline. Note that the MLUPS is lower than the results of the v1.3 tests because the baseline case is set to a single process rather than 24 processes. In absolute terms, the MLUPS performance of v1.7 at 24 processes is 36.1 MLUPS versus the 28.5 MLUPS of v1.3 (hence v1.7 is actually 27% faster than v1.3).

Figure 2: Weak Scaling Efficiency (relative to single-process case) for LUMA v1.7 fluid solver with OMP threads set to 12.
Further results will be added here in due course
Lattice-Boltzmann @ The University of Manchester (LUMA) -- School of Mech., Aero. & Civil Engineering