Commit c5d32742 authored by Axel Kohlmeyer's avatar Axel Kohlmeyer
Browse files

restore files

parent 4f36f077
This diff is collapsed.
"Higher level section"_Speed.html - "LAMMPS WWW Site"_lws - "LAMMPS
Documentation"_ld - "LAMMPS Commands"_lc :c
:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Commands_all.html)
:line
KOKKOS package :h3
Kokkos is a templated C++ library that provides abstractions to allow
a single implementation of an application kernel (e.g. a pair style)
to run efficiently on different kinds of hardware, such as GPUs, Intel
Xeon Phis, or many-core CPUs. Kokkos maps the C++ kernel onto
different backend languages such as CUDA, OpenMP, or Pthreads. The
Kokkos library also provides data abstractions to adjust (at compile
time) the memory layout of data structures like 2d and 3d arrays to
optimize performance on different hardware. For more information on
Kokkos, see "Github"_https://github.com/kokkos/kokkos. Kokkos is part
of "Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos
library was written primarily by Carter Edwards, Christian Trott, and
Dan Sunderland (all Sandia).
The LAMMPS KOKKOS package contains versions of pair, fix, and atom
styles that use data structures and macros provided by the Kokkos
library, which is included with LAMMPS in /lib/kokkos. The KOKKOS
package was developed primarily by Christian Trott (Sandia) and Stan
Moore (Sandia) with contributions of various styles by others,
including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
(Sandia). For more information on developing using Kokkos abstractions
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
Kokkos currently provides support for 3 modes of execution (per MPI
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
(threading for many-core CPUs and Intel Phi), and CUDA (for NVIDIA
GPUs). You choose the mode at build time to produce an executable
compatible with specific hardware.
NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
Clang 3.5.2 or later is required.
NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this.
NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using
only a single GPU and a single MPI rank on a desktop. When running
with multiple MPI ranks, you may see segmentation faults without
GPU-direct support. These can be avoided by adding the flags "-pk
kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
by using the command "package kokkos gpu/direct off"_package.html in
the input file.
[Building LAMMPS with the KOKKOS package:]
See the "Build extras"_Build_extras.html#kokkos doc page for instructions.
[Running LAMMPS with the KOKKOS package:]
All Kokkos operations occur within the context of an individual MPI
task running on a single node of the machine. The total number of MPI
tasks used by LAMMPS (one or multiple per compute node) is set in the
usual manner via the mpirun or mpiexec commands, and is independent of
Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
-npernode switches. Ditto for MPICH via -np and -ppn.
[Running on a multi-core CPU:]
Here is a quick overview of how to use the KOKKOS package
for CPU acceleration, assuming one or more 16-core nodes.
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk
kokkos" "command-line switches"_Run_options.html in your mpirun
command. You must use the "-k on" "command-line
switch"_Run_options.html to enable the KOKKOS package. It takes
additional arguments for hardware settings appropriate to your system.
For OpenMP use:
-k on t Nt :pre
The "t Nt" option specifies how many OpenMP threads per MPI task to
use with a node. The default is Nt = 1, which is MPI-only mode. Note
that the product of MPI tasks * OpenMP threads/task should not exceed
the physical number of cores (on a node), otherwise performance will
suffer. If hyperthreading is enabled, then the product of MPI tasks *
OpenMP threads/task should not exceed the physical number of cores *
hardware threads. The "-k on" switch also issues a "package kokkos"
command (with no additional arguments) which sets various KOKKOS
options to default values, as discussed on the "package"_package.html
command doc page.
The "-sf kk" "command-line switch"_Run_options.html will automatically
append the "/kk" suffix to styles that support it. In this manner no
modification to the input script is needed. Alternatively, one can run
with the KOKKOS package by editing the input script as described
below.
NOTE: The default for the "package kokkos"_package.html command is to
use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. However, when running on CPUs, it
will typically be faster to use "half" neighbor lists and set the
Newton flag to "on", just as is the case for non-accelerated pair
styles. It can also be faster to use non-threaded communication. Use
the "-pk kokkos" "command-line switch"_Run_options.html to change the
default "package kokkos"_package.html options. See its doc page for
details and default settings. Experimenting with its options can
provide a speed-up for specific calculations. For example:
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre
If the "newton"_newton.html command is used in the input
script, it can also override the Newton flag defaults.
[Core and Thread Affinity:]
When using multi-threading, it is important for performance to bind
both MPI tasks to physical cores, and threads to physical cores, so
they do not migrate during a simulation.
If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), binding can be forced with these flags:
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre
For binding threads with KOKKOS OpenMP, use thread affinity
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. In general, for best
performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
OMP_PLACES=threads. For binding threads with the KOKKOS pthreads
option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
[Running on Knight's Landing (KNL) Intel Xeon Phi:]
Here is a quick overview of how to use the KOKKOS package for the
Intel Knight's Landing (KNL) Xeon Phi:
KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores are
reserved for the OS, and only 64 or 66 cores are used. Each core has 4
hyperthreads,so there are effectively N = 256 (4*64) or N = 264 (4*66)
cores to run on. The product of MPI tasks * OpenMP threads/task should
not exceed this limit, otherwise performance will suffer. Note that
with the KOKKOS package you do not need to specify how many KNLs there
are per node; each KNL is simply treated as running some number of MPI
tasks.
Examples of mpirun commands that follow these rules are shown below.
Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
The -np setting of the mpirun command sets the number of MPI
tasks/node. The "-k on t Nt" command-line switch sets the number of
threads/task as Nt. The product of these two values should be N, i.e.
256 or 264.
NOTE: The default for the "package kokkos"_package.html command is to
use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. When running on KNL, this will
typically be best for pair-wise potentials. For manybody potentials,
using "half" neighbor lists and setting the Newton flag to "on" may be
faster. It can also be faster to use non-threaded communication. Use
the "-pk kokkos" "command-line switch"_Run_options.html to change the
default "package kokkos"_package.html options. See its doc page for
details and default settings. Experimenting with its options can
provide a speed-up for specific calculations. For example:
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre
NOTE: MPI tasks and threads should be bound to cores as described
above for CPUs.
NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors
such as Knight's Corner (KNC), your system must be configured to use
them in "native" mode, not "offload" mode like the USER-INTEL package
supports.
[Running on GPUs:]
Use the "-k" "command-line switch"_Run_options.html to
specify the number of GPUs per node. Typically the -np setting of the
mpirun command should set the number of MPI tasks/node to be equal to
the number of physical GPUs on the node. You can assign multiple MPI
tasks to the same GPU with the KOKKOS package, but this is usually
only faster if significant portions of the input script have not
been ported to use Kokkos. Using CUDA MPS is recommended in this
scenario. Using a CUDA-aware MPI library with support for GPU-direct
is highly recommended. GPU-direct use can be avoided by using
"-pk kokkos gpu/direct no"_package.html.
As above for multi-core CPUs (and no GPU), if N is the number of
physical cores/node, then the number of MPI tasks/node should not
exceed N.
-k on g Ng :pre
Here are examples of how to use the KOKKOS package for GPUs, assuming
one or more nodes, each with two GPUs:
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
NOTE: The default for the "package kokkos"_package.html command is to
use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions, along with threaded communication.
When running on Maxwell or Kepler GPUs, this will typically be
best. For Pascal GPUs, using "half" neighbor lists and setting the
Newton flag to "on" may be faster. For many pair styles, setting the
neighbor binsize equal to the ghost atom cutoff will give speedup.
Use the "-pk kokkos" "command-line switch"_Run_options.html to change
the default "package kokkos"_package.html options. See its doc page
for details and default settings. Experimenting with its options can
provide a speed-up for specific calculations. For example:
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj # Set binsize = neighbor ghost cutoff
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
NOTE: For good performance of the KOKKOS package on GPUs, you must
have Kepler generation GPUs (or later). The Kokkos library exploits
texture cache options not supported by Telsa generation GPUs (or
older).
NOTE: When using a GPU, you will achieve the best performance if your
input script does not use fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU. Invoking a
non-Kokkos fix or compute, or performing I/O for
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
to be copied back to the CPU incurring a performance penalty.
NOTE: To get an accurate timing breakdown between time spend in pair,
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
However, this will reduce performance and is not recommended for production runs.
[Run with the KOKKOS package by editing an input script:]
Alternatively the effect of the "-sf" or "-pk" switches can be
duplicated by adding the "package kokkos"_package.html or "suffix
kk"_suffix.html commands to your input script.
The discussion above for building LAMMPS with the KOKKOS package, the
mpirun/mpiexec command, and setting appropriate thread are the same.
You must still use the "-k on" "command-line switch"_Run_options.html
to enable the KOKKOS package, and specify its additional arguments for
hardware options appropriate to your system, as documented above.
You can use the "suffix kk"_suffix.html command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.
pair_style lj/cut/kk 2.5 :pre
You only need to use the "package kokkos"_package.html command if you
wish to change any of its option defaults, as set by the "-k on"
"command-line switch"_Run_options.html.
[Using OpenMP threading and CUDA together (experimental):]
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
used together in a few special cases. In the Makefile, the
KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
KOKKOS_DEVICES=Cuda,OpenMP :pre
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the "-sf kk" in the command line gives the default CUDA version
everywhere. However, if the "/kk/host" suffix is added to a specific
style in the input script, the Kokkos OpenMP (CPU) version of that
specific style will be used instead. Set the number of OpenMP threads
as "t Nt" and the number of GPUs as "g Ng"
-k on t Nt g Ng :pre
For example, the command to run with 1 GPU and 8 OpenMP threads is then:
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
Conversely, if the "-sf kk/host" is used in the command line and then
the "/kk" or "/kk/device" suffix is added to a specific style in your
input script, then only that specific style will run on the GPU while
everything else will run on the CPU in OpenMP mode. Note that the
execution of the CPU and GPU styles will NOT overlap, except for a
special case:
A kspace style and/or molecular topology (bonds, angles, etc.) running
on the host CPU can overlap with a pair style running on the
GPU. First compile with "--default-stream per-thread" added to CCFLAGS
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
suffix for kspace and bonds, angles, etc. in the input file and the
"kk" suffix (equal to "kk/device") on the command line. Also make
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
so CPU/GPU overlap can occur.
[Speed-ups to expect:]
The performance of KOKKOS running in different modes is a function of
your hardware, which KOKKOS-enable styles are used, and the problem
size.
Generally speaking, the following rules of thumb apply:
When running on CPUs only, with a single thread per MPI task,
performance of a KOKKOS style is somewhere between the standard
(un-accelerated) styles (MPI-only mode), and those provided by the
USER-OMP package. However the difference between all 3 is small (less
than 20%). :ulb,l
When running on CPUs only, with multiple threads per MPI task,
performance of a KOKKOS style is a bit slower than the USER-OMP
package. :l
When running large number of atoms per GPU, KOKKOS is typically faster
than the GPU package. :l
When running on Intel hardware, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware. :l
:ule
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the KOKKOS package on different
hardware.
[Advanced Kokkos options:]
There are other allowed options when building with the KOKKOS package.
As explained on the "Build extras"_Build_extras.html#kokkos doc page,
they can be set either as variables on the make command line or in
Makefile.machine, or they can be specified as CMake variables. Each
takes a value shown below. The default value is listed, which is set
in the lib/kokkos/Makefile.kokkos file.
KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
provides alternative methods via environment variables for binding
threads to hardware cores. More info on binding threads to cores is
given on the "Speed omp"_Speed_omp.html doc page.
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
on most Unix platforms. This library is not available on all
platforms.
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
debugging information that can be useful. It also enables runtime
bounds checking on Kokkos data structures.
KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when
building LAMMPS.
KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
package must be compiled with the {enable_lambda} option when using
GPUs.
[Restrictions:]
Currently, there are no precision options with the KOKKOS package. All
compilation and computation is performed in double precision.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment