Performance

For the purposes of this document performance is defined as time to solution. Since RMG uses iterative methods time to solution depends on both the efficiency of the iterations and the speed with which they execute. These in turn may depend on both the hardware platform (e.g. cluster or workstation) and the problem type and size. There are many input file options and environment variables that may affect convergence rates and execution speed including.

BLAS libaries

RMG is critically dependent on the performance of several double precision level 3 BLAS routines. In particular dgemm and dsyrk plus their complex equivalents zgemm and zsyrk for non-gamma calculations. Level 1 and 2 BLAS routines have an insignificant impact on RMG performance since they are rarely used. The matrix sizes passed to the routines will depend on the size of the problem (number of electronic wavefunctions and the real space basis size) as well as the number of MPI tasks used since RMG uses domain decomposition for the real space basis. As an example the initial electronic quench for a 512 atom NiO supercell run at an equivalent cutoff of 154 Rydbergs with 196 total MPI tasks is shown below.

BLAS function	m	n	k	Times called
DGEMM	3060	3570	31104	329
DGEMM	3060	3584	31104	4
DGEMM	3060	64	31104	288
DGEMM	31104	3570	3060	306
DGEMM	31104	3570	3570	76
DGEMM	31104	3584	3060	2
DGEMM	31104	3584	3584	2
DGEMM	3570	3570	31104	4
DSYRK	--	3570	31104	151
DSYRK	--	3584	31104	4

Charge density mixing

Pseudocode for the SCF cycle includes a mixing function MIX as shown below

Initial charge density = ρ_in and initial orbitals ψ_i,j with i=0 and 1<j<N
do
    compute V_eff(ρ_in)
    for(j=1 to N)
        multigrid preconditioner applied to ψ_i,j using V_eff(ρ_in)
    end for
    diagonalize or orthogonalize ψ_i,j => ψ_i+1,j
    generate ρ_out from ψ_i+1,j
    update ρ_in = MIX(ρ_in, ρ_out)
    Δρ_i = ABS(ρ_in - ρ_out)
    i=i+1
while (Δρ_i > tolerance)

Three types of charge density mixing functions MIX(ρ_in, ρ_out) are available in RMG and are selected using the charge_mixing_type input option.

Linear - This is the default and requires specifying a mixing constant α. New charge density is then calculated as ρ_in = α ρ_new + (1.0 - α) ρ_old. Proper choice of α is crucial, for small values, reaching convergence may take many steps, while large values may lead to instability due to overshooting.
Pulay - More sophisticated scheme, which uses charge densities from several previous steps to determine optimal charge density of the current step. may fail in some hard-to-converge cases. More info Pulay, Chem. Phys. Let. 73, 393 (1980).
Broyden - Another multi-step scheme which is the preferred option when the Davidson Kohn-Sham solver is used.

Potential acceleration

The SCF cycle outlined above recomputes V_eff(ρ_in) once per SCF step (outer loop). It's possible to modify the cycle by updating V_eff(ρ_in) in the inner loop over j. Specifically for each j we have a Δψ_j = (ψ_i+1,j - ψ_i,j). Since the hartree potential V_h has a linear dependence on <ψ_j|ψ_j> we can compute an approximate update to V_h after computing each ψ_i+1,j and then applying it to the preconditioning steps for successive ψ_i+1,j. This has the effect of stabilizing the calculation and allows the use of a larger value of α when using linear charge density mixing (it is not compatible with Pulay or Broyden mixing). The input option potential_acceleration_constant_step is used to control this method and it's use is illustrated in the C60 examples.

Subspace diagonalization driver

Depending on hardware resources and build configuration RMG users can select between several different diagonalizers. The optimal choice depends on the system hardware available and the problem size. Available options include.

lapack - Standard matrix algebra package. Required to build RMG. Is not parallel across MPI process's but can use multiple threads within a process via the BLAS libraries.
scalapack - Parallel version of lapack that decomposes matrices over MPI processes. A good choice for very large problems where the number of wavefunctions N > 3000 but it may not be available for every hardware/software platform.
cusolver - GPU accelerated diagonalization routines for Nvidia hardware. When suitable hardware is available this is often the best choice for problems where N < 5000 but may not available for every hardware/software platform.
rocsolver - GPU accelerated diagonalization routines for AMD hardware. This is available but should still be considered experimental as of RMG version 5.0 and ROCM v3.4.

For small problems diagonalization normally comprises only a small part of the total execution time and the choice of driver is not critical. The computational work for diagonalization scales as N³ though and as N increases this becomes a larger and larger portion of the calculation. With this in mind driver choice depends greatly on the hardware platform.

Workstation - Lapack is only a good choice for workstations with a parallel BLAS library. If a parallel library is not available then Scalapack is preferred. Finally if a GPU is installed on the workstation Cusolver will usually be a better choice once N is more than a few hundred orbitals.
Cluster - The situation with clusters is considerably more complicated. The computational power provided by a single node has to be balanced against the communications speed available between nodes. Both Lapack and Cusolver are not parallel across nodes so their speed is limited by the speed of an individual node. Scalapack is parallel across nodes but is highly dependent on communications speed. Currently we have observed that for values of N < 3000 and both a high end communications fabric (Cray XK/XE series hardware) and high end GPU support (Nvidia Fermi/Kepler class hardware) Cusolver is usually the best choice. If GPU hardware is not available Scalapack is preferred. For N > 3000 Scalapack will usually be the best choice for systems with fast communications.
Folded spectrum A folded spectrum implementation is available that works in conjunction with the lapack or cusolver drivers. When enabled it can dramatically decrease the time required for subspace diagonalization on large systems. A paper describing the method may be downloaded from http://arxiv.org/abs/1502.07806. Only for expert use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

BLAS libaries

Charge density mixing

Potential acceleration

Subspace diagonalization driver

Clone this wiki locally