For compilation instructions checkout this link.

For a while, we’ve been preparing to run some simulations on this machine, hosted by Oak Ridge National Lab. Every cluster is a bit different, and that’s definitely true for TITAN: each box has 16 CPUs (arranged in 2 “numas”) and 1 K20 NVIDIA GPU. There is no usual MPI or Infiniband; instead, there’s some Cray-specific system.

Here is an example submission script for a non-replica exchange simulation:

#!/bin/bash
module add gromacs/5.0.2

cd $PBS_O_WORKDIR

# important - allow the GPU to be shared between MPI ranks
export CRAY_CUDA_MPS=1

mpirun=`which aprun`
application=`which mdrun_mpi`
options="-v -maxh 0.2 -s tpr/topol0.tpr"

gpu_id=000000000000 # only 12, discard last '0000'

$mpirun -n 32 -N 16 $application -gpu_id $gpu_id $options

Submit with:

$ qsub -l walltime=1:00:00 -l nodes=2 submit.sh

Requesting 2 boxes, start 32 MPI processes/ranks, 16 per box. Implicitly, this will result in 1 OpenMP thread per 1 MPI process. From single-node simulations, let me explain this choice:

1,16,2,5.346 ns/day
1,16,4,10.590 ns/day
1,16,8,17.402 ns/day
1,16,16,33.137 ns/day

The first column is nodes=1, the second column is -n 16 MPI ranks requested, and the third column is -N MPI ranks per node. At -N 2, we get 8 OpenMP threads per 1 MPI process, which is known to be non-ideal.

Keeping -N 16 constant, let’s add more nodes:

#setup,performance
nodes-1__n-16_N-16,33.137
nodes-2__n-32_N-16,49.975
nodes-4__n-64_N-16,78.244