Tardis Cluster User Guide
Panruo Wu pwu011@cs.ucr.edu
TARDIS(Time And Relative Dimension In Space), pictures from BBC TV series Doctor Who.
FAQ
Q: How to request only GPU nodes or Non-GPU nodes?
Each node is tagged with a property; compute node is tagged with "nogpu" and gpu node is tagged with "gpu". When requesting resources in the job file, use something like this to only use GPU nodes or Non-GPU nodes:
# request only compute nodes
#PBS -l nodes=8:nogpu:ppn=32
# request only gpu nodes
#PBS -l nodes=1:gpu:ppn=1
Q: How to specify the nodes I want?
The names of the nodes in the resource manager is node[01-12].cluster and gpu[01-12].cluster. To specify a node that a job wants,
#PBS -l nodes=gpu01.cluster:ppn=32
Cluster Overview
TARDIS is a cluster of 16 computing nodes and a head node. The head node is used for writing/compiling/debugging programs, file serving, and job management. Please try NOT to run significant program on head node for too long, for that might affect other people's login experiences.
Among the 16 computing nodes, 12 are "pure" computing node named node01-node12 without GPUs and 4 are "GPU" nodes named gpu01-gpu04 each with 2 nVidia Tesla M2075 GPUs. Except for GPUs, all nodes are almost identical in CPUs, memory and interconnections. Each node has 2 AMD Opteron 6272 CPUs totaling 32 cores, 64GB ECC memory and 40Gbps high bandwidth low latency InfiniBand networks. All nodes are connected to a Mellanox 18 port InfiniBand switch.
The head node has a RAID 6 disk array with 8TB usable storage for users. Each node has 1TB storage for their local use. /home and /act directories are exported to all nodes by NFS. All nodes run CentOS 6.3 operating system.
Here are some pictures of the cluster: Front Back
Some HPL tests results:
Account/Login
TARDIS sits behind the department firewall, which means that only computers within UCR CS department are able to access it. If your computer is outside CS network you have to login bell/bass first.
To request an account, send an email to Panruo (pwu011@cs.ucr.edu) specifying how you affliate with the program.
Hardware
512 computing cores on 16 compute nodes integrated by Advanced Clustering Technology. All machines are 2-way SMPs with AMD Opteron 6272 processors with 16 cores per socket totalling 32 cores per node.
6272 Opteron Processor Specifications:
- 16 cores, 16 threads
- frequency: 2.1 GHz, Turbo( 2.4GHz when more than 8 cores, 3.0GHz when less than 8 cores)
- bus speed: 3.2 GHz HyperTransport links (6.4 GT/s)
- L1 cache: 8x64 KB 2-way associative shared instruction caches, 16x16 KB 4-way associative data caches
- L2 cache: 8x2 MB 16-way associative shared exclusive caches
- L3 cache:2x8 MB up to 64-way assocciative shared caches
- Average CPU power 80 Watt, thermal design power 115 Watt
Nodes:
- 64GB ECC protected DRAM
- on board QDR IB adaptor or Mellanox ConnectX-3 VPI adapter card, 40Gb/s QSFP, PCIe 3.0 x8 8GT/s and 40GbE
- Nvidia Tesla M2075 1.5GHz, 6GB GDDR5, 448 CUDA cores (2 GPUs per GPU node)
- 1TB temporary disk space each node
- 16 TB hard drive on head node configured in RAID 6; 8TB usable.
- 12 computing nodes: node01-node12
- 4 GPU nodes: gpu01-gpu04
- 1 login node: tardis.cs.ucr.edu
Network:
- Mellanox 18 port IB QDR switch QSFP ports
- 24 port Gigabit switch
Software
All software are 64-bit.
- CentOS 6.3 Linux x86_64 (compatible with Red Hat Enterprise 6.3)
- OpenFabrics 1.5.4.1 (comes with CentOS 6.3)
- gcc/gfortran 4.4/4.6.2/4.7.2 (C/fortran compiler)
- MVAPICH2
(recommended InfiniBand MPI library) - MPICH2
- OpenMPI
- Torque/PBS (batch job manager)
- OpenMP (via GCC)
- Nvidia CUDA 5
- ACML: Highly optimized BLAS, LAPACK, FFT
Modules
There's a convenient tool called Modules installed on the cluster for managing different versions of software(notably compilers, MPI implementations).
If we want to switch between different compilers or MPI impelementations, it's quite tedious to manually rewrite $PATH, $LD_LIBRARY_PATH etc without a tool to help us. It's even more tedious if we want to switch back and forth to do some testing. Modules software provides a clean and graceful way to do exactly this.
For example, the system default(CentOS 6.3) GCC compiler is version 4.4; we might want to use a newer version like 4.7.2. First we want to see which compilers are available to us:
[pwu@head ~]$ module avail
------------------------------ /usr/share/Modules/modulefiles ------------------------------
dot module-cvs module-info modules null use.own
------------------------------------- /act/modulefiles -------------------------------------
gcc-4.6.2 mvapich2-1.9a2/gnu-4.6.2 openmpi-1.6/gnu
gcc-4.7.2 open64 openmpi-1.6/gnu-4.6.2
mpich2/gnu openmpi-1.4/gnu
mpich2/gnu-4.6.2 openmpi-1.4/gnu-4.6.2
-------------------------------- /home/pwu/pub/modulefiles ---------------------------------
bupc-2.16.0 mvapich2-1.9a2/gnu-4.6.2 openmpi-1.6.4/gnu-4.7.2
intel-13.2 mvapich2-1.9b/gnu-4.7.2
mpich-3.0.2/gnu-4.7.2 mvapich2-1.9b/intel-13.2
We see gcc-4.6.2 and gcc-4.7.2 are available for us to use.
module load gcc-4.6.2
will set up all the environment variables for us to use the new GCC 4.6.2 instead of
the system GCC 4.4. If we want to revert back to GCC 4.4 we just issue module unload gcc-4.6.2
. The
command module purge
will unload all modules. module list
gives you your current loaded modules.
Note that it's usually not a good idea to use a different compiler than the one that compiles the MPI implementation you are going to use. For example it's recommended to load both gcc-4.6.2 and mvapich2-1.9a2/gnu-4.6.2 at the same time.
If you install an even newer version of GCC like GCC 4.7.2 in your home directory, you can write a simple modulefile to use modules to manage it like above. Please consult their website for more information.
A Sample MPI program
LLNL provides very good MPI tutorials.
Here's a hello world MPI program, and how to compile and run it.
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf ("Error starting MPI program. Terminating.\n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %s\n", numtasks,rank,hostname);
/******* do some work *******/
MPI_Finalize();
}
To compile it, we first load the appropriate MPI libraries by Module and the use the MPI compiler wrappers to compile the code:
[pwu@head tmp]$ module purge
[pwu@head tmp]$ module load mvapich2-1.9a2/gnu-4.6.2
[pwu@head tmp]$ module load gcc-4.6.2
[pwu@head tmp]$ mpicc hello.c -o hello
To run the hello program, we need to submit a job via Torque/PBS batch system.
Here's an example job specification file hello.job
:
#!/bin/sh
#PBS -l nodes=4:ppn=2
module purge
module load mvapich2-1.9a2/gnu-4.6.2
module load gcc-4.6.2
cd $PBS_O_WORKDIR
mpirun ./hello
The job file is basically a bash script. #PBS -l nodes=4:ppn=2
says that we are
requesting 4 nodes, with 2 processes/cores per node to run our hello
program.
Then use module to load appropriate MPI libraries, change directory to the
current work directory ( cd $PBS_O_WORKDIR
) then run the program. Different
MPI implementations have different mpirun
or mpiexec
format; it happens that
for MVAPICH21.9a2 it's simply mpirun ./hello
.
Now the job is in the queue waiting for resources. You can use qstat
to check
the current queue status. Once the job is finished, there will be 2 files in your
work directories; they are named after your job file name plus job number.
In this particular case,
[pwu@head tmp]$ cat hello.job.o184
Number of tasks= 8 My rank= 0 Running on gpu02.cluster
Number of tasks= 8 My rank= 1 Running on gpu02.cluster
Number of tasks= 8 My rank= 6 Running on gpu01.cluster
Number of tasks= 8 My rank= 2 Running on gpu03.cluster
Number of tasks= 8 My rank= 4 Running on gpu04.cluster
Number of tasks= 8 My rank= 7 Running on gpu01.cluster
Number of tasks= 8 My rank= 3 Running on gpu03.cluster
Number of tasks= 8 My rank= 5 Running on gpu04.cluster
hello.job.o184
is the output of our program. `hello.job.e184' is the standard error
output.
Compilers/Libraries
Here's a compiler options quick reference for our AMD Opteron 6272 processors:reference card
Currently GCC 4.4/4.6/4.7 are present on our system for immediate use. Also the Open64 compiler suite from AMD can be used. For non-commercial personal use, the user can obtain a copy of Intel compilers from intel compilers for free.
As we can see from the module avail
command, there are a bunch of MPI libraries there including
MPICH2, MVAPICH2 and OpenMPI compiled with different compilers. It's recommended to use the mvapich2/gnu-4.6.2
suite since MVAPICH2 is desinged for our fast network. You can also compile your own MPI library, and
use the module software to manage it. You are welcome to share it with other users on this cluster
by making your software and modulefiles public. See /home/pwu/pub/modulefiles for an example.
For math libraries, the Core Math Library(ACML) from AMD is a good choice. It provides optimized BLAS/LAPACK libraries and FFT etc. It works with many kind of compilers. Use it if you can. If you installed the Intel compilers they come with Intel Math Kernel Library(MKL) which provides more functions and are quite performant. But it might not perform well on AMD CPUs.
CUDA
To use CUDA 5 installed in directory /opt/cuda, add the following lines to your .bashrc
# setup for CUDA
export LD_LIBRARY_PATH="/opt/cuda/lib64:$LD_LIBRARY_PATH"
And use GCC 4.4 or GCC 4.6. This version of CUDA does not support GCC 4.7 yet.
If the linker complains about missing libraries libcuda.so
, enter command from your
working directory:
cp /act/cloner/data/images/gpu/data/usr/lib64/libcuda.so* .
There's a bunch of standard sample CUDA programs that you might want to look at; they are located in /opt/cuda/samples. Copy them to your home directory to play with them.
If you want to use a single GPU node interactively(GPU exclusively), you might be interested in the interactive mode the TORQUE/PBS:
$ qsub -I -l nodes=1:ppn=16:gpus=1 -l walltime=12:00:00
Of course you can always use a job file for requesting one node and one GPU device:
#!/bin/sh
#PBS -l nodes=1:ppn=16:gpus=1,walltime=01:00:00
#PBS -N saving_the_universe
#PBS -M doctor.who@tar.dis
#PBS -m abe
module purge
module load gcc-4.6.2
cd $PBS_O_WORKDIR
./save_the_universe
Running Jobs
Batch Job System: This cluster uses a batch job system called TORQUE.
The workflow is like this: you write and compile code, write a submission file and submit the job. When the
batch system finds enough resources it grants your job the resources and start executing it. It's important to
NOT to bypass the batch system such as using mpiexec -f hostfile -np 256 executable
; doing so will confuse
the batch system so that it does not know which nodes are free to assign to other jobs. Your job and others' might
run on the same node which will result in degraded performance in both jobs.
The job submission file is essentially a BASH script describing how to run the jobs and what resources the job
is requesting. Here's a sample submission file hpl.sub
#!/bin/sh
#PBS -l nodes=15:ppn=32,walltime=03:00:00
#PBS -N saving_the_earth
#PBS -M doctor.who@tar.dis
#PBS -m abe
module purge
module load gcc-4.6.2
module load mvapich2-1.8/gnu-4.6.2
cd $PBS_O_WORKDIR
mpirun ./xhpl
The line
#PBS -l nodes=15:ppn=32,walltime=03:00:00
specifies that your job is requesting 15 nodes, 32 cores each node and it should terminate within 3 hours wall time. If it doesn't the job system will kill it properly. You can also request GPU nodes by something like:
#PBS -l nodes=2:ppn=32:gpus=2
The following line
#PBS -N 15nHPL
specifies the name of your job;
#PBS -M armiuswu@gmail.com
#PBS -m abe
tells the system to emial you when your job aborts, begins or ends.
The following module
commands sets up proper environment for each node.
cd $PBS_O_WORKDIR
sets the job current directory to the directory you
are submitting the job; mpirun ./xhpl
actually runs the MPI program
on the nodes you requested.
Requesting Exclusive nodes: Sometimes you might want your job to run exclusively on some nodes even if you only want to use some but not all of the processors. Let's say we want to run a 16 processes job on 4 nodes with each node running 4 processes, and we require exclusive job running on the 4 nodes. To do that we first request 4 nodes with ppn=32:
#PBS -l nodes=4:ppn=32,walltime=60
then change mpirun command to
mpirun -n 16 -ppn 4 ./executable
Requesting Specific nodes: If for some reason you want to request for some specific nodes, you can use the node names in your job script. For example, the following line says you request 1 core and 1 gpu from node gpu04:
#PBS -l nodes=gpu04.cluster:gpus=1
Job submission/monitoring/cancellation: To submit a job to the batch system:
$ qsub jobfile
To display the jobs in the queue:
$ qstat
To stop and delete a job:
$ qdel jobid
For more information about these commands please consult their manual pages
using man
command.