Tardis Cluster User Guide

TARDIS(Time And Relative Dimension In Space), pictures from BBC TV series Doctor Who.

FAQ

Q: How to request only GPU nodes or Non-GPU nodes?

Each node is tagged with a property; compute node is tagged with "nogpu" and gpu node is tagged with "gpu". When requesting resources in the job file, use something like this to only use GPU nodes or Non-GPU nodes:

# request only compute nodes
#PBS -l nodes=8:nogpu:ppn=32
# request only gpu nodes
#PBS -l nodes=1:gpu:ppn=1

Q: How to specify the nodes I want?

The names of the nodes in the resource manager is node[01-12].cluster and gpu[01-12].cluster. To specify a node that a job wants,

#PBS -l nodes=gpu01.cluster:ppn=32

Cluster Overview

TARDIS is a cluster of 16 computing nodes and a head node. The head node is used for writing/compiling/debugging programs, file serving, and job management. Please try NOT to run significant program on head node for too long, for that might affect other people's login experiences.

Among the 16 computing nodes, 12 are "pure" computing node named node01-node12 without GPUs and 4 are "GPU" nodes named gpu01-gpu04 each with 2 nVidia Tesla M2075 GPUs. Except for GPUs, all nodes are almost identical in CPUs, memory and interconnections. Each node has 2 AMD Opteron 6272 CPUs totaling 32 cores, 64GB ECC memory and 40Gbps high bandwidth low latency InfiniBand networks. All nodes are connected to a Mellanox 18 port InfiniBand switch.

The head node has a RAID 6 disk array with 8TB usable storage for users. Each node has 1TB storage for their local use. /home and /act directories are exported to all nodes by NFS. All nodes run CentOS 6.3 operating system.

Here are some pictures of the cluster: Front Back

Some HPL tests results:

using all nodes HPL
using only 8 nodes HPL

Account/Login

TARDIS sits behind the department firewall, which means that only computers within UCR CS department are able to access it. If your computer is outside CS network you have to login bell/bass first.

To request an account, send an email to Panruo (pwu011@cs.ucr.edu) specifying how you affliate with the program.

Hardware

512 computing cores on 16 compute nodes integrated by Advanced Clustering Technology. All machines are 2-way SMPs with AMD Opteron 6272 processors with 16 cores per socket totalling 32 cores per node.

6272 Opteron Processor Specifications:

16 cores, 16 threads
frequency: 2.1 GHz, Turbo( 2.4GHz when more than 8 cores, 3.0GHz when less than 8 cores)
bus speed: 3.2 GHz HyperTransport links (6.4 GT/s)
L1 cache: 8x64 KB 2-way associative shared instruction caches, 16x16 KB 4-way associative data caches
L2 cache: 8x2 MB 16-way associative shared exclusive caches
L3 cache:2x8 MB up to 64-way assocciative shared caches
Average CPU power 80 Watt, thermal design power 115 Watt

Nodes:

64GB ECC protected DRAM
on board QDR IB adaptor or Mellanox ConnectX-3 VPI adapter card, 40Gb/s QSFP, PCIe 3.0 x8 8GT/s and 40GbE
Nvidia Tesla M2075 1.5GHz, 6GB GDDR5, 448 CUDA cores (2 GPUs per GPU node)
1TB temporary disk space each node
16 TB hard drive on head node configured in RAID 6; 8TB usable.
12 computing nodes: node01-node12
4 GPU nodes: gpu01-gpu04
1 login node: tardis.cs.ucr.edu

Network:

Mellanox 18 port IB QDR switch QSFP ports
24 port Gigabit switch

Software

All software are 64-bit.

CentOS 6.3 Linux x86_64 (compatible with Red Hat Enterprise 6.3)
OpenFabrics 1.5.4.1 (comes with CentOS 6.3)
gcc/gfortran 4.4/4.6.2/4.7.2 (C/fortran compiler)
MVAPICH2
(recommended InfiniBand MPI library)
MPICH2
OpenMPI
Torque/PBS (batch job manager)
OpenMP (via GCC)
Nvidia CUDA 5
ACML: Highly optimized BLAS, LAPACK, FFT

Modules

There's a convenient tool called Modules installed on the cluster for managing different versions of software(notably compilers, MPI implementations).

If we want to switch between different compilers or MPI impelementations, it's quite tedious to manually rewrite $PATH, $LD_LIBRARY_PATH etc without a tool to help us. It's even more tedious if we want to switch back and forth to do some testing. Modules software provides a clean and graceful way to do exactly this.

For example, the system default(CentOS 6.3) GCC compiler is version 4.4; we might want to use a newer version like 4.7.2. First we want to see which compilers are available to us:

[pwu@head ~]$ module avail

------------------------------ /usr/share/Modules/modulefiles ------------------------------
dot         module-cvs  module-info modules     null        use.own

------------------------------------- /act/modulefiles -------------------------------------
gcc-4.6.2                mvapich2-1.9a2/gnu-4.6.2 openmpi-1.6/gnu
gcc-4.7.2                open64                   openmpi-1.6/gnu-4.6.2
mpich2/gnu               openmpi-1.4/gnu
mpich2/gnu-4.6.2         openmpi-1.4/gnu-4.6.2

-------------------------------- /home/pwu/pub/modulefiles ---------------------------------
bupc-2.16.0              mvapich2-1.9a2/gnu-4.6.2 openmpi-1.6.4/gnu-4.7.2
intel-13.2               mvapich2-1.9b/gnu-4.7.2
mpich-3.0.2/gnu-4.7.2    mvapich2-1.9b/intel-13.2

We see gcc-4.6.2 and gcc-4.7.2 are available for us to use.

module load gcc-4.6.2 will set up all the environment variables for us to use the new GCC 4.6.2 instead of the system GCC 4.4. If we want to revert back to GCC 4.4 we just issue module unload gcc-4.6.2. The command module purge will unload all modules. module list gives you your current loaded modules.

Note that it's usually not a good idea to use a different compiler than the one that compiles the MPI implementation you are going to use. For example it's recommended to load both gcc-4.6.2 and mvapich2-1.9a2/gnu-4.6.2 at the same time.

If you install an even newer version of GCC like GCC 4.7.2 in your home directory, you can write a simple modulefile to use modules to manage it like above. Please consult their website for more information.

A Sample MPI program

LLNL provides very good MPI tutorials.

Here's a hello world MPI program, and how to compile and run it.

   #include "mpi.h"
   #include <stdio.h>

   int main(argc,argv)
   int argc;
   char *argv[]; {
   int  numtasks, rank, len, rc; 
   char hostname[MPI_MAX_PROCESSOR_NAME];

   rc = MPI_Init(&argc,&argv);
   if (rc != MPI_SUCCESS) {
     printf ("Error starting MPI program. Terminating.\n");
     MPI_Abort(MPI_COMM_WORLD, rc);
     }

   MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Get_processor_name(hostname, &len);
   printf ("Number of tasks= %d My rank= %d Running on %s\n", numtasks,rank,hostname);

   /*******  do some work *******/

   MPI_Finalize();
   }

To compile it, we first load the appropriate MPI libraries by Module and the use the MPI compiler wrappers to compile the code:

[pwu@head tmp]$ module purge
[pwu@head tmp]$ module load mvapich2-1.9a2/gnu-4.6.2
[pwu@head tmp]$ module load gcc-4.6.2
[pwu@head tmp]$ mpicc hello.c -o hello

To run the hello program, we need to submit a job via Torque/PBS batch system. Here's an example job specification file hello.job:

#!/bin/sh
#PBS -l nodes=4:ppn=2

module purge
module load mvapich2-1.9a2/gnu-4.6.2
module load gcc-4.6.2   

cd $PBS_O_WORKDIR

mpirun ./hello

The job file is basically a bash script. #PBS -l nodes=4:ppn=2 says that we are requesting 4 nodes, with 2 processes/cores per node to run our hello program. Then use module to load appropriate MPI libraries, change directory to the current work directory ( cd $PBS_O_WORKDIR) then run the program. Different MPI implementations have different mpirun or mpiexec format; it happens that for MVAPICH21.9a2 it's simply mpirun ./hello.

Now the job is in the queue waiting for resources. You can use qstat to check the current queue status. Once the job is finished, there will be 2 files in your work directories; they are named after your job file name plus job number. In this particular case,

[pwu@head tmp]$ cat hello.job.o184
Number of tasks= 8 My rank= 0 Running on gpu02.cluster
Number of tasks= 8 My rank= 1 Running on gpu02.cluster
Number of tasks= 8 My rank= 6 Running on gpu01.cluster
Number of tasks= 8 My rank= 2 Running on gpu03.cluster
Number of tasks= 8 My rank= 4 Running on gpu04.cluster
Number of tasks= 8 My rank= 7 Running on gpu01.cluster
Number of tasks= 8 My rank= 3 Running on gpu03.cluster
Number of tasks= 8 My rank= 5 Running on gpu04.cluster

hello.job.o184 is the output of our program. `hello.job.e184' is the standard error output.

Compilers/Libraries

Here's a compiler options quick reference for our AMD Opteron 6272 processors:reference card

Currently GCC 4.4/4.6/4.7 are present on our system for immediate use. Also the Open64 compiler suite from AMD can be used. For non-commercial personal use, the user can obtain a copy of Intel compilers from intel compilers for free.

As we can see from the module avail command, there are a bunch of MPI libraries there including MPICH2, MVAPICH2 and OpenMPI compiled with different compilers. It's recommended to use the mvapich2/gnu-4.6.2 suite since MVAPICH2 is desinged for our fast network. You can also compile your own MPI library, and use the module software to manage it. You are welcome to share it with other users on this cluster by making your software and modulefiles public. See /home/pwu/pub/modulefiles for an example.

For math libraries, the Core Math Library(ACML) from AMD is a good choice. It provides optimized BLAS/LAPACK libraries and FFT etc. It works with many kind of compilers. Use it if you can. If you installed the Intel compilers they come with Intel Math Kernel Library(MKL) which provides more functions and are quite performant. But it might not perform well on AMD CPUs.

CUDA

To use CUDA 5 installed in directory /opt/cuda, add the following lines to your .bashrc

# setup for CUDA
export LD_LIBRARY_PATH="/opt/cuda/lib64:$LD_LIBRARY_PATH"

And use GCC 4.4 or GCC 4.6. This version of CUDA does not support GCC 4.7 yet. If the linker complains about missing libraries libcuda.so, enter command from your working directory:

cp /act/cloner/data/images/gpu/data/usr/lib64/libcuda.so* .

There's a bunch of standard sample CUDA programs that you might want to look at; they are located in /opt/cuda/samples. Copy them to your home directory to play with them.

If you want to use a single GPU node interactively(GPU exclusively), you might be interested in the interactive mode the TORQUE/PBS:

$ qsub -I -l nodes=1:ppn=16:gpus=1 -l walltime=12:00:00

Of course you can always use a job file for requesting one node and one GPU device:

#!/bin/sh

#PBS -l nodes=1:ppn=16:gpus=1,walltime=01:00:00
#PBS -N saving_the_universe
#PBS -M doctor.who@tar.dis
#PBS -m abe

module purge
module load gcc-4.6.2

cd $PBS_O_WORKDIR
./save_the_universe

Running Jobs

Batch Job System: This cluster uses a batch job system called TORQUE. The workflow is like this: you write and compile code, write a submission file and submit the job. When the batch system finds enough resources it grants your job the resources and start executing it. It's important to NOT to bypass the batch system such as using mpiexec -f hostfile -np 256 executable; doing so will confuse the batch system so that it does not know which nodes are free to assign to other jobs. Your job and others' might run on the same node which will result in degraded performance in both jobs.

The job submission file is essentially a BASH script describing how to run the jobs and what resources the job is requesting. Here's a sample submission file hpl.sub

#!/bin/sh

#PBS -l nodes=15:ppn=32,walltime=03:00:00
#PBS -N saving_the_earth
#PBS -M doctor.who@tar.dis
#PBS -m abe

module purge
module load gcc-4.6.2
module load mvapich2-1.8/gnu-4.6.2

cd $PBS_O_WORKDIR

mpirun ./xhpl

The line

#PBS -l nodes=15:ppn=32,walltime=03:00:00

specifies that your job is requesting 15 nodes, 32 cores each node and it should terminate within 3 hours wall time. If it doesn't the job system will kill it properly. You can also request GPU nodes by something like:

#PBS -l nodes=2:ppn=32:gpus=2

The following line

#PBS -N 15nHPL

specifies the name of your job;

#PBS -M armiuswu@gmail.com
#PBS -m abe

tells the system to emial you when your job aborts, begins or ends.

The following module commands sets up proper environment for each node. cd $PBS_O_WORKDIR sets the job current directory to the directory you are submitting the job; mpirun ./xhpl actually runs the MPI program on the nodes you requested.

Requesting Exclusive nodes: Sometimes you might want your job to run exclusively on some nodes even if you only want to use some but not all of the processors. Let's say we want to run a 16 processes job on 4 nodes with each node running 4 processes, and we require exclusive job running on the 4 nodes. To do that we first request 4 nodes with ppn=32:

#PBS -l nodes=4:ppn=32,walltime=60

then change mpirun command to

mpirun -n 16 -ppn 4 ./executable

Requesting Specific nodes: If for some reason you want to request for some specific nodes, you can use the node names in your job script. For example, the following line says you request 1 core and 1 gpu from node gpu04:

#PBS -l nodes=gpu04.cluster:gpus=1

Job submission/monitoring/cancellation: To submit a job to the batch system:

$ qsub jobfile

To display the jobs in the queue:

$ qstat

To stop and delete a job:

$ qdel jobid

For more information about these commands please consult their manual pages using man command.