Gromacs 5.x on TITAN Cray machine
For compilation instructions checkout https://groups.google.com/d/msg/plumed-users/Tx29XNNRq8o/xeAu7RNaBAAJ
For a while we've been preparing to run some simulations on this machine, hosted by Oak Ridge National Lab. Every cluster is a bit different and that's definitely true for TITAN: each box has 16 CPUs (arranged in 2 "numas") and 1 K20 nVidia GPU. There is no usual MPI or Infiniband, there is some other Cray-specific beast.
Here is an example submission script for a non-replica exchange simulation:
1,16,2,5.346 ns/day
1,16,4,10.590 ns/day
1,16,8,17.402 ns/day
1,16,16,33.137 ns/day
First column is nodes=1, second column is -n 16 MPI ranks requested, third column is -N MPI ranks per node. At -N 2, we get 8 OpenMP threads per 1 MPI process, this is known to be non-ideal.
Keeping this -N 16 constant, let's add more nodes
#setup,performance
nodes-1__n-16_N-16,33.137
nodes-2__n-32_N-16,49.975
nodes-4__n-64_N-16,78.244
For a while we've been preparing to run some simulations on this machine, hosted by Oak Ridge National Lab. Every cluster is a bit different and that's definitely true for TITAN: each box has 16 CPUs (arranged in 2 "numas") and 1 K20 nVidia GPU. There is no usual MPI or Infiniband, there is some other Cray-specific beast.
Here is an example submission script for a non-replica exchange simulation:
#!/bin/bash
module add gromacs/5.0.2
cd $PBS_O_WORKDIR
# important - allow the GPU to be shared between MPI ranks
export CRAY_CUDA_MPS=1
mpirun=`which aprun`
application=`which mdrun_mpi`
options="-v -maxh 0.2 -s tpr/topol0.tpr "
gpu_id=000000000000 # only 12, discard last '0000'
$mpirun -n 32 -N 16 $application -gpu_id $gpu_id $options
Submit with
$ qsub -l walltime=1:00:00 -l nodes=2 submit.sh
Requesting for 2 boxes, start 32 MPI processes/ranks, 16 per box. Implicitly, this will result in 1 OpenMP thread per 1 MPI process. From single nodes simulations, let me explain this choice:
1,16,2,5.346 ns/day
1,16,4,10.590 ns/day
1,16,8,17.402 ns/day
1,16,16,33.137 ns/day
First column is nodes=1, second column is -n 16 MPI ranks requested, third column is -N MPI ranks per node. At -N 2, we get 8 OpenMP threads per 1 MPI process, this is known to be non-ideal.
Keeping this -N 16 constant, let's add more nodes
#setup,performance
nodes-1__n-16_N-16,33.137
nodes-2__n-32_N-16,49.975
nodes-4__n-64_N-16,78.244