Commit 9bb89cd1 authored by Sudheer Chunduri's avatar Sudheer Chunduri
Browse files

Applications and tools

parent 4f6200ee
Adrian Pope
apope@anl.gov
2017-10-13
====
src:
====
This directory contains a pre-built HACC binary for theta:
src/cpu/theta/hacc_tpm
which is also symbolically linked from the top level directory. After
instrumenting the driver you will need to re-build:
$ cd src
$ source env/bashrc.theta
$ cd cpu
$ make
The build system for the benchmark version of HACC is not very smart, and I do
not recommend using make -j to try to use more threads for compilation as I
think this can fail with race conditions. Build dependencies are also not
very well tracked, so if you want to make sure that everything is re-built
after modifying source then I recommend:
$ cd ..
$ make clean
$ cd cpu
$ make
The driver source file is:
src/simulation/driver_hires.cxx
During the build a symbolic link to that file is created in the src/cpu
directory, but it is likely better to edit the original copy in src/simulation.
The driver file is mostly just a long main() function, and MPI_Init_threads()
is on line 129. The code starts by initializing particle positions, which
essentially consists of calculating 4 FFTs using the same FFT code that the
Poisson solver uses during every time step. However, the time reported for
the benchmark excludes the initialization time. I believe the timing for
the benchmark starts with the MPI_Wtime() call on line 300, and the actual
time stepping is the for(int step=step0; ...) loop starting on line 306.
The timing for the benchmark ends with the MPI_Wtime() call on line 481.
There's a little more code between that and the call to MPI_Finalize() on line
528, and that probably involves a few calls to MPI_Allreduce() with a very
small data volume.
========
testing:
========
I made small and large tests for 1, 8, 64, 128, 256, and 512 theta nodes. The
small tests are the same size per-node as I was using for the BGQ tests and
each should take ~3 min of wall-clock. The large tests are closer to the
size per node for the likely A21 system benchmark and each should take ~15 min
of wall-clock. The small and large tests for the same number of nodes should
have the same number of messages, but different message sizes.
The only input files that are actually different in each directory are
"submit.theta.sh" and "indat". Right now everything is set up for 16 MPI
ranks per node, but if you want to test fewer (power-of-2) ranks per node
you should be able to just change "RANKS_PER_NODE" in "run.theta.sh" and
I think it should work. More (power-of-2) ranks per node might work in some
cases, but I'm not sure for all cases, and I'm pretty sure it won't work for
more than 64 ranks per node since I hard-coded it to use 1 HW thread per core.
The tests have very little IO so I don't think it matters which filesystem
is used.
#Link: https://trac.alcf.anl.gov/projects/DarkUniverse
#!/bin/bash -evx
#COBALT -A Performance
#COBALT -n 256
#COBALT -t 360
#COBALT -q default
#COBALT --attrs mcdram=flat:numa=quad
export RANKS_PER_NODE=64
export THREADS_PER_CORE=1
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/test
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/main
export AP_OUTPUT_SYS=0
export AP_OUTPUT_LOCAL=1
export NODES=$COBALT_JOBSIZE
export MPI_RANKS=$((NODES*RANKS_PER_NODE))
export CORES_PER_RANK=$((64/RANKS_PER_NODE))
export OMP_NUM_THREADS=$((CORES_PER_RANK*THREADS_PER_CORE))
ulimit -s unlimited
timestamp=$(date +%Y-%m-%d-%T)
mkdir HACC_${NODES}_FQ_numa_$((COBALT_JOBID))_${timestamp}
cd HACC_${NODES}_FQ_numa_$((COBALT_JOBID))_${timestamp}
cp ../counters.txt .
for I in {1..5}
do
aprun -n ${MPI_RANKS} -N ${RANKS_PER_NODE} \
-cc depth -d ${OMP_NUM_THREADS} -j ${THREADS_PER_CORE} \
--env OMP_NUM_THREADS=${OMP_NUM_THREADS} \
numactl -p 1 /projects/networkbench/HACC_Adrian/contention/src/cpu/theta/hacc_tpm ../indat ../cmbM000.tf m000 INIT ALL_TO_ALL -w -R -N 256 2>&1 | tee HACC_MPI_${MPI_RANKS}_TH_${OMP_NUM_THREADS}_${I}.output
mv networktiles.1.yaml networktiles${I}.1.yaml
mv proctiles.1.yaml proctiles${I}.1.yaml
done
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
for I in {1..5}
do
aprun -n ${MPI_RANKS} -N ${RANKS_PER_NODE} \
-cc depth -d ${OMP_NUM_THREADS} -j ${THREADS_PER_CORE} \
--env OMP_NUM_THREADS=${OMP_NUM_THREADS} \
numactl -p 1 /projects/networkbench/HACC_Adrian/contention/src/cpu/theta/hacc_tpm ../indat ../cmbM000.tf m000 INIT ALL_TO_ALL -w -R -N 256 2>&1 | tee ad3_HACC_MPI_${MPI_RANKS}_TH_${OMP_NUM_THREADS}_${I}.output
mv networktiles.1.yaml ad3_networktiles${I}.1.yaml
mv proctiles.1.yaml ad3_proctiles${I}.1.yaml
done
git clone https://github.com/milc-qcd/milc_qcd.git
#specific version 201510211317
#!/bin/bash -x
#COBALT -A Performance
#COBALT -n 256
#COBALT -t 260
#COBALT --jobname MILC_small_pat_fq
#COBALT -O 256nodes-pat-fq-$COBALT_JOBID
#COBALT --attrs mcdram=flat:numa=quad
timestamp=$(date +%Y-%m-%d-%T)
mkdir MILC_256_$((COBALT_JOBID))_${timestamp}
cd MILC_256_$((COBALT_JOBID))_${timestamp}
cp ../counters.txt .
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/test
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/main
echo $COBALT_JOBID
echo $COBALT_JOBSIZE
echo $COBALT_PARTNAME
export AP_OUTPUT_SYS=0
export AP_OUTPUT_LOCAL=1
nx=96
ny=96
nz=96
nt=48
rpn=64
source_directory=${source_directory:-${PWD}}
# for getresults.sh script processing
echo "$rpn $(($COBALT_JOBSIZE * $rpn)) $nx $ny $nz $ny ./su3_rhmd_hisq l9648_4steps.in"
ln -s ${source_directory}/../runs/rationals.sample.su3_rhmc_hisq
ulimit -c unlimited
# MCDRAM
for I in {1..3}
do
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
numactl -m 1 ${source_directory}/../milc_qcd-git-201510211317/ks_imp_rhmc/su3_rhmd_hisq ${source_directory}/../runs/l9648_4steps.in > run${I}.out
mv networktiles.1.yaml networktiles${I}.1.yaml
mv proctiles.1.yaml proctiles${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
numactl -m 1 ${source_directory}/../milc_qcd-git-201510211317/ks_imp_rhmc/su3_rhmd_hisq ${source_directory}/../runs/l9648_4steps.in > run${I}_ad3.out
mv networktiles.1.yaml networktiles${I}_ad3.1.yaml
mv proctiles.1.yaml proctiles${I}_ad3.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
export MPICH_RANK_REORDER_METHOD=4
export MPICH_RANK_REORDER_OPTS="--ndims=4 --node_block=4,4,2,2 --dims=16,16,8,8"
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
numactl -m 1 ${source_directory}/../milc_qcd-git-201510211317/ks_imp_rhmc/su3_rhmd_hisq ${source_directory}/../runs/l9648_4steps.in > pmi_run_${I}.out
mv networktiles.1.yaml pmi_networktiles${I}.1.yaml
mv proctiles.1.yaml pmi_proctiles${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
numactl -m 1 ${source_directory}/../milc_qcd-git-201510211317/ks_imp_rhmc/su3_rhmd_hisq ${source_directory}/../runs/l9648_4steps.in > pmi_run_${I}_ad3.out
mv networktiles.1.yaml pmi_networktiles${I}_ad3.1.yaml
mv proctiles.1.yaml pmi_proctiles${I}_ad3.1.yaml
export MPICH_RANK_REORDER_METHOD=1
# 1 Specifies SMP-style placement. This is the default
# aprun placement. For a multi-core node, sequential
# MPI ranks are placed on the same node.
done
rc=$?
exit $rc
#!/bin/bash -x
#COBALT -A Performance
#COBALT -n 256
#COBALT -t 150
#COBALT --attrs mcdram=flat:numa=quad
#COBALT -q default
#COBALT -O Nek5000-fq-256nodes-$COBALT_JOBID
time_stamp=$(date +%Y-%m-%d-%T)
mkdir 256_FQ_$((COBALT_JOBID))_${time_stamp}
cd 256_FQ_$((COBALT_JOBID))_${time_stamp}
#qstat -l -f > joblist-$((COBALT_JOBID))_${time_stamp}
cp ../counters.txt .
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/test
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/main
export AP_OUTPUT_SYS=0
export AP_OUTPUT_LOCAL=1
cp ../tcc.rea .
cp ../tcc.par .
cp ../tcc.re2 .
cp ../tcc0.f00021 .
cp ../tcc.usr .
cp ../tcc.map .
cp ../pretcc0.f00001 .
cp ../nek5000 .
cp ../SIZE .
cp ../SESSION.NAME .
cp ../ExhLift.dat .
cp ../filename.gfldr .
cp ../IntkLift.dat .
rpn=64
thr=1
case=tcc
#echo $case > SESSION.NAME
#echo `pwd`'/' >> SESSION.NAME
touch $case.rea
rm -f ioinfo
mv -f $case.his $case.his1
mv -f $case.sch $case.sch1
for I in {1..3}
do
aprun -n $((COBALT_JOBSIZE*rpn)) \
-N $rpn \
-d $thr \
-cc depth \
-j 1 \
numactl -p 1 ./nek5000 > run256_$((COBALT_JOBID))_${I}.out
mv networktiles.1.yaml networktiles_$((COBALT_JOBID))_${I}.1.yaml
mv proctiles.1.yaml proctiles_$((COBALT_JOBID))_${I}.1.yaml
done
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
for I in {1..3}
do
aprun -n $((COBALT_JOBSIZE*rpn)) \
-N $rpn \
-d $thr \
-cc depth \
-j 1 \
numactl -p 1 ./nek5000 > ad3_run256_$((COBALT_JOBID))_${I}.out
mv networktiles.1.yaml ad3_networktiles_$((COBALT_JOBID))_${I}.1.yaml
mv proctiles.1.yaml ad3_proctiles_$((COBALT_JOBID))_${I}.1.yaml
done
exit $?
exit $?
wget https://github.com/Nek5000/Nek5000/releases/download/v19.0/Nek5000-19.0.tar.gz
Refer https://www.alcf.anl.gov/support-center/theta/qbox-theta
#!/bin/bash -x
#COBALT -A Performance
#COBALT -n 256
#COBALT -t 150
#COBALT --jobname QB_fq
#COBALT -O 256nodes-new-pat-fq-$COBALT_JOBID
#COBALT --attrs mcdram=flat:numa=quad
timestamp=$(date +%Y-%m-%d-%T)
mkdir QB_256_$((COBALT_JOBID))_${timestamp}
cd QB_256_$((COBALT_JOBID))_${timestamp}
cp ../counters.txt .
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/test
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/main
export AP_OUTPUT_SYS=0
export AP_OUTPUT_LOCAL=1
echo $COBALT_JOBID
echo $COBALT_JOBSIZE
echo $COBALT_PARTNAME
cp ../gs2.i .
cp ../gs1.i .
cp ../gs_opt1.i .
cp ../gs_opt2.i .
cp ../sic512.i .
cp ../*.xml .
rpn=64
# MCDRAM
for I in {1..4}
do
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
-d 1 -j 1 -e OMP_NUM_THREADS=1 \
numactl -p 1 ../theta_libsci_opt_ap ../gs1.i > run${I}.out
mv networktiles.1.yaml networktiles_$COBALT_JOBID.${I}.1.yaml
mv proctiles.1.yaml proctiles_$COBALT_JOBID.${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
aprun -n $((COBALT_JOBSIZE * $rpn)) \
-N $rpn \
-d 1 -j 1 -e OMP_NUM_THREADS=1 \
numactl -p 1 ../theta_libsci_opt_ap ../gs1.i > ad3_run${I}.out
mv networktiles.1.yaml ad3_networktiles_$COBALT_JOBID.${I}.1.yaml
mv proctiles.1.yaml ad3_proctiles_$COBALT_JOBID.${I}.1.yaml
done
rc=$?
exit $rc
#!/bin/bash -x
#COBALT -A Performance
#COBALT -n 256
#COBALT -t 90
#COBALT -q default
#COBALT --attrs mcdram=flat:numa=quad
#COBALT -O Rayleigh.jid$COBALT_JOBID
######COBALT --attrs mcdram=flat:numa=quad
time_stamp=$(date +%Y-%m-%d-%T)
mkdir 256_FQ_R_$((COBALT_JOBID))_${time_stamp}
cd 256_FQ_R_$((COBALT_JOBID))_${time_stamp}
qstat -l -f > joblist-$((COBALT_JOBID))_${time_stamp}
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/test
aprun -n $((COBALT_JOBSIZE)) -N 1 -d 1 -cc depth /projects/networkbench/aries-topo/src/main
export AP_OUTPUT_SYS=0
export AP_OUTPUT_LOCAL=1
echo $COBALT_JOBID
echo $COBALT_JOBSIZE
echo $COBALT_PARTNAME
thr=1
cp ../counters.txt ./
cp ../main_input ./
cp ../make_dirs ./
for I in {1..1}
do
#aprun -n 16384 -N 64 numactl -p 1 ../../bin/rayleigh.avx512 -niter 100 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > Rayleigh.jid$COBALT_JOBID.run_${I}.out
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
export MPICH_GNI_A2A_ROUTING_MODE=ADAPTIVE_0
aprun -n 16384 -N 64 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad0_ad0.out
#aprun -n 16384 -N 64 numactl -p 1 ../../bin/rayleigh.avx512 -niter 100 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad0_ad0.out
mv networktiles.1.yaml ad0_ad0_networktiles.${I}.1.yaml
mv proctiles.1.yaml ad0_ad0_proctiles.${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
export MPICH_GNI_A2A_ROUTING_MODE=ADAPTIVE_1
#aprun -n 16384 -N 64 numactl -p 1 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad0_ad1.out
aprun -n 16384 -N 64 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad0_ad1.out
mv networktiles.1.yaml ad0_ad1_networktiles.${I}.1.yaml
mv proctiles.1.yaml ad0_ad1_proctiles.${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
export MPICH_GNI_A2A_ROUTING_MODE=ADAPTIVE_3
#aprun -n 16384 -N 64 numactl -p 1 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad3_ad3.out
aprun -n 16384 -N 64 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad3_ad3.out
mv networktiles.1.yaml ad3_ad3_networktiles.${I}.1.yaml
mv proctiles.1.yaml ad3_ad3_proctiles.${I}.1.yaml
export MPICH_GNI_ROUTING_MODE=ADAPTIVE_3
export MPICH_GNI_A2A_ROUTING_MODE=ADAPTIVE_1
#aprun -n 16384 -N 64 numactl -p 1 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad3_ad1.out
aprun -n 16384 -N 64 ../../bin/rayleigh.avx512 -niter 50 -nprow 128 -npcol 128 -nr 512 -ntheta 3072 > run_${I}_ad3_ad1.out
mv networktiles.1.yaml ad3_ad1_networktiles.${I}.1.yaml
mv proctiles.1.yaml ad3_ad1_proctiles.${I}.1.yaml
done
https://www.alcf.anl.gov/user-guides/automatic-performance-collection-autoperf
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment