Skip to content

Setting up diagonalization engine

Gabriel Wlazłowski edited this page May 20, 2021 · 21 revisions

[[TOC]]

ELPA Library

For static calculations, it is recommended to use ELPA Library, which has better performance than ScaLapack. In particular, ELPA allows for the utilization of GPUs which provide a significant boost for calculations. In order to activate ELPA lib in predefines.h set:

// select diagonalization routine
#define DIAGONALIZATION_ROUTINE ELPA

Moreover, you need to inspect carefully part:

// ---------------------- ELPA SETTINGS ---------------------------
// Fill this part only if ELPA library is used for diagonnalization

// uncomment it if you want to activate GPU for diagonalizations 
#define ELPA_USE_GPU

// Select ELPA kernels
#define ELPS_USE_SOLVER ELPA_SOLVER_1STAGE
#define ELPA_USE_COMPLEX_KERNEL ELPA_2STAGE_COMPLEX_DEFAULT
#define ELPA_USE_REAL_KERNEL ELPA_2STAGE_REAL_DEFAULT

// Fraction of eigenvectors to be extracted in each cycle.
// 1.0 corresponds to extraction if all eigenvectors (USE IT IF YOU YOU ARE NOT SURE)
// NOTE: value of this parameter should assure that all eigenstates below requested Ec are extracted.  
// NOTE: For 3D case this value typically can be set to 0.78
#define ELPA_NEV_FRACTION 1.0

Documentation

  1. Eigenvalue SoLvers for Petaflop-Applications (ELPA)
  2. Wiki: Eigenvalue SoLvers for Petaflop-Applications (ELPA)
  3. [ELPA installation guide](ELPA installation guide)

Publications about ELPA performance

  1. GPU-Acceleration of the ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems

ScaLapack library

If the target system does not provide ELPA library user can use (standard) diagonalization library: ScaLAPACK. W-SLDA Toolkit can utilize the following ScaLapack diagonalization engines:

#define DIAGONALIZATION_ROUTINE PZHEEVR

or

#define DIAGONALIZATION_ROUTINE PZHEEVD

It is recommended to use PZHEEVR. This engine takes advantage from the fact that typically we extract only a fraction of eigenstates. However, we find that in some rare cases (system dependent) this routine does not work correctly. In such a case, PZHEEVD should be used.

Benchmarks & Scalings

All tests correspond to the extraction of all eigenvectors.

Table

matrix size p q mb nb prec. routine system time [sec] cost
32,768 = 2x128^2 6 8 16 16 real ELPA (2-GPU) Cygnus 93 0.052 nh
45,000 = 2x150^2 6 8 16 16 real ELPA (2-GPU) Cygnus 217 0.12 nh
45,000 = 2x150^2 6 8 16 16 complex ELPA (2-GPU) Cygnus 860 0.478 nh
65,536 = 2x32^3 24 28 32 32 complex ELPA (2-GPU) Summit 115 0.52 nh
65,536 = 2x32^3 24 28 8 8 complex ELPA (1-GPU) Summit 118 0.52 nh
128,000 = 2x40^3 24 28 8 8 complex ELPA (1-GPU) Summit 435 1.93 nh
128,000 24 28 32 32 complex ELPA (2-GPU) Summit 511 2.27 nh
128,000 = 2x40^3 20 20 32 32 complex ELPA (1-GPU) Daint 220 24.4 nh
128,000 54 64 32 32 complex ELPA (2-CPU) Daint 677 54.1 nh
128,000 54 64 32 32 complex PZHEEVR Daint 945 75.6 nh
147,456 = 4x64x24^2 24 25 32 32 complex ELPA (1-GPU) Daint 375 62.5 nh
147,456 = 2x768x96 18 18 16 16 double ELPA (1-GPU) Daint 395 35.6 nh
221,184 = 2x48^3 46 84 32 32 complex ELPA (2-GPU) Summit 603 15.4 nh
221,184 46 84 16 16 complex ELPA (1-GPU) Summit 736 18.8 nh
221,184 46 84 16 16 complex ELPA (2-GPU) Summit 3098 79.2 nh
221,184 46 84 16 16 complex PZHEEVD Summit 5995 153.2 nh
500,000 = 2x50^2x100 96 112 16 16 complex ELPA (1-GPU) Summit 2,109 150.0 nh
524,288 = 2x64^3 96 112 16 16 complex ELPA (1-GPU) Summit 2,217 157.7 nh
746,496 = 2x72^3 112 192 16 16 complex ELPA (1-GPU) Summit 3,436 488.7 nh
746,496 112 192 64 64 complex ELPA (2-GPU) Summit 3,628 516.0 nh
1,769,472 = 2x96^3 300 560 32 32 complex ELPA (1-GPU) Summit 52,024 57,804 nh

(1-GPU): ELPA_SOLVER_1STAGE, ELPA_2STAGE_COMPLEX_GPU or ELPA_2STAGE_REAL_GPU
(1-CPU): ELPA_SOLVER_1STAGE, ELPA_2STAGE_COMPLEX_DEFAULT or ELPA_2STAGE_REAL_DEFAULT
(2-GPU): ELPA_SOLVER_2STAGE, ELPA_2STAGE_COMPLEX_GPU or ELPA_2STAGE_REAL_GPU
(2-CPU): ELPA_SOLVER_2STAGE, ELPA_2STAGE_COMPLEX_DEFAULT or ELPA_2STAGE_REAL_DEFAULT

Plots

These scalings are derived empirically: points correspond to real measurement on target system, while line shows a fit of ideal scaling for level-3 rutines ($\sim N^3$)

The scaling was derived within ALCC grant Quantum Turbulence in Fermi Superfluids.

Clone this wiki locally