For a while we've been preparing to run some simulations on this machine, hosted by Oak Ridge National Lab. Every cluster is a bit different and that's definitely true for TITAN: each box has 16 CPUs (arranged in 2 "numas") and 1 K20 nVidia GPU. There is no usual MPI or Infiniband, there is some other Cray-specific beast.

Here is an example submission script for a non-replica exchange simulation:


module add gromacs/5.0.2

# important - allow the GPU to be shared between MPI ranks
export CRAY_CUDA_MPS=1

mpirun=`which aprun`
application=`which mdrun_mpi`

options="-v -maxh 0.2 -s tpr/topol0.tpr "

gpu_id=000000000000 # only 12, discard last '0000'

$mpirun -n 32 -N 16 $application -gpu_id $gpu_id  $options

Submit with

$ qsub -l walltime=1:00:00 -l nodes=2

Requesting for 2 boxes, start 32 MPI processes/ranks, 16 per box. Implicitly, this will result in 1 OpenMP thread per 1 MPI process. From single nodes simulations, let me explain this choice:

1,16,2,5.346 ns/day
1,16,4,10.590 ns/day
1,16,8,17.402 ns/day
1,16,16,33.137 ns/day

First column is nodes=1, second column is -n 16 MPI ranks requested, third column is -N MPI ranks per node. At -N 2, we get 8 OpenMP threads per 1 MPI process, this is known to be non-ideal. 

Keeping this -N 16 constant, let's add more nodes