Getting Started
Logging in to the cluster
It’s important to note that there are now two separate cluster environments running at WWU. The Computer Science (CSCI) cluster, and the College of Science and Engineering cluster (CSE). Both clusters are interconnected on the backend to allow for sharing of resources as they are available, but it is very important to login to the correct cluster head node so your jobs run in the correct environment.
The CSE cluster’s head node is accessible at
cse-head.cluster.cs.wwu.edu
while the CSCI cluster’s head node is
accessible at csci-head.cluster.cs.wwu.edu
. Please use the
appropriate one when connecting and setting up your config file
mentioned below. The rest of the document will use the notation
HEAD-NODE-NAME
, but you will need to replace it with the
appropriate address.
Both of the cluster head nodes are only reachable from either on campus, or through WWU’s VPN. ATUS has put together a page with information about the VPN, as well as a brief overview. The Computer Science Department’s support team has put together guides with step by step screenshots for installing the client and connecting to the VPN on Windows, macOS, and Linux.
Accessing the cluster is done using SSH through the terminal
(In Windows you can use PowerShell or the Windows Terminal) and
can be done a number of different ways. The suggested way is to modify
your ssh configuration file to define an easy method to access the
cluster. Open (or create) your ~/.ssh/config
(C:\Users\local_username\.ssh\config
in Windows) in your favorite
text editor (emacs, vi, nano, …) and paste in the following:
SSH Config
Unix / Linux / macOS –
~/.ssh/config
Windows 10+ –
C:\Users\local_username\.ssh\config
Host cluster
HostName HEAD-NODE-NAME
Port 22
User my_wwu_username
Make sure you edit the HEAD-NODE-NAME
with the correct node name
from above, local_username
with the username that you use to login
to the local computer with, and my_wwu_username
in the example to
use your WWU username. Your WWU username does not include “@wwu.edu”
just the part before that.
Windows SSH Config How-to
Browse to or create C:\Users\local_username\.ssh
in a File
Explorer Window, then click the View tab in the top
left, and select the File name extensions checkbox so
that file extensions are shown.
Right mouse-click in the large empty area and select Yes that you want to change it.
. Name the file :”config.txt”. Double left mouse-click the file to open it :in your text editor and paste the above contents into it. Pressing Control-s will save the file. Close your editor, and Right mouse-click on the “config.txt” file, selecting from the list. Remove the “.txt” extension, :leaving only “config”. You will be prompted with a dialog box :asking you to confirm the file rename and telling it “might become :unusable” – selectIf you need to make changes to the SSH config file later you can open it by dragging it your text editor of choice, without needing to rename it again.
Important
The rest of this document assumes that you have created a cluster entry in your ~/.ssh/config file named cluster.
Once you inserted this you can connect to the cluster by connecting to the VPN and typing the following in your shell/terminal:
ssh cluster
Note
You can open a shell a lot of different ways:
-
Unix / Linux
This varies greatly between environments, but usually you can find a terminal in the launcher menu for your environment under system utilities.
-
Windows 10
Click the Start menu, then type powershell. Click the PowerShell icon to launch it.
-
macOS
Press Command+Space to open the Spotlight Search, then type terminal.app in the text box and press Return.
Alternatively you can specify everything on the command line each time, which will look something like this:
ssh my_wwu_username@HEAD-NODE-NAME
Attention
The first time you connect to a host via SSH you will get a message that the host key can’t be verified, and a message telling you the fingerprint of the host key so you can manually verify it, such as the following:
USER@host:~$ ssh cluster
The authenticity of host 'cse-head.cluster.cs.wwu.edu (140.160.143.131)' can't be established.
ECDSA key fingerprint is SHA256:DiUEmE/yLuR3jsjkd1d1+xtqzXsb4mrYdlgomDgYis0.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
The following are the host key fingerprints of the CSE and CSCI head nodes:
cse-head:
256 SHA256:DiUEmE/yLuR3jsjkd1d1+xtqzXsb4mrYdlgomDgYis0 cse-head.cluster.cs.wwu.edu (ECDSA)
3072 SHA256:ND1qPMh1/DFRl+lDGsyrrfrzH2mvKARp2rGTAZf0T1c cse-head.cluster.cs.wwu.edu (RSA)
256 SHA256:HaOR/tPAc0UY9sfa+YAevvHOzR9FthYZPYGOLYp4GyA cse-head.cluster.cs.wwu.edu (ED25519)
csci-head:
256 SHA256:f/iM3OK0k7EDwMy3FGNt+ouZ64EYfsUZiIdagDgjlmo csci-head.cluster.cs.wwu.edu (ECDSA)
3072 SHA256:jAlOiSaFhW6KdYt7xoPnZjglPxP6KZ+4DCtCmQ+ISLI csci-head.cluster.cs.wwu.edu (RSA)
256 SHA256:Eh9fYR5KrGu945I3QXOLZfjC38n4Q13wWnnBmPMiwaQ csci-head.cluster.cs.wwu.edu (ED25519)
If you frequently login to the cluster or other systems using SSH, it may be advantageous to use SSH keys and run an SSH-Agent, but that’s outside of the scope of this getting started guide.
Copying files to and from the cluster
Copy files to the cluster
Often times your data is prepared somewhere else and you just want to load it on the cluster’s filesystem so it can be processed in a job you’re going to schedule. There’s a convenient tool that is a part of the SSH package known as SCP or Secure CoPy. This tool is typically run on your system, and not on the research cluster itself, due to how the networking is configured. The basic layout of the command is as follows:
scp [options] <src> <dst>
Here the [options] we might want to set is -r
. The -r
says we
should “recursively copy entire directories” which means we can
specify a directory as our source (src) and have it and all of it’s
contents copied to the destination (dst). For a complete list of
options and more examples, please see the man page for
scp(1).
Since we’re copying to the cluster filesystem, the src is a local path to a file or directory, and the dst is a special kind of path that says the username, the address of the server, and the remote machine’s path. Here’s a quick example to copy one file.
scp my_data.txt cluster:
This copies the file my_data.txt
in your current directory (You
can type pwd to see what your current directory is if your
shell doesn’t already show you) to cluster filesystem and puts it in
your home directory. The dst can be broken down to who@where:path
– In the above example we didn’t specify a path, so it will use our
home directory. Next lets copy a directory of files:
scp -r my_data_dir/ cluster:my_project/
This will take the local directory my_data_dir
and put it, and all
of its contents in ~/my_project/my_data_dir
.
Copy files from the cluster
Copying files from the cluster is the same as what we did above, except the source and destination are switched. As an example, we can copy a “results” directory after the job has finished running:
scp -r cluster:my_project/results my_results
Here all of the files in “~/my_project/results
” will be copied to
your local machine into a directory named “my_results
”. This can
be useful for offline analysis of files after your jobs have
completed.
Using the job scheduler
Scheduling background information
The WWU CSE cluster is leveraging the HTCondor job scheduler to send work out to all the nodes participating in the cluster. The “HT” in HTCondor stands for High Throughput, and as such our system is designed for high throughput rather than high performance. To quote the HTCondor docs:
For many research and engineering projects, the quality of the research or the product is heavily dependent upon the quantity of computing cycles available. It is not uncommon to find problems that require weeks or months of computation to solve. Scientists and engineers engaged in this sort of work need a computing environment that delivers large amounts of computational power over a long period of time. Such an environment is called a High-Throughput Computing (HTC) environment. In contrast, High Performance Computing (HPC) environments deliver a tremendous amount of compute power over a short period of time. HPC environments are often measured in terms of FLoating point Operations Per Second (FLOPS). A growing community is not concerned about operations per second, but operations per month or per year. Their problems are of a much larger scale. They are more interested in how many jobs they can complete over a long period of time instead of how fast an individual job can complete.
This falls inline with most of the research we do here at WWU, where we have access to a shared resource among researchers.
Viewing the job queue and status of your jobs
The first and easiest task to perform is the see the jobs in the queue. This will give you a good idea of how much work is being performed and how your work is progressing (if you have anything queued).
condor_q
The condor_q
command also takes many options such as -allusers
which will print
out the status of all jobs, not just your personal ones.
condor_q -allusers
Testing your software
Before you submit a long running job it’s a good idea to do a quick test to ensure that your software will run correctly. The front end node is a shared resource where you can login, test your software, submit jobs, and check their status.
Here are a few things to take into consideration:
Where does your data live? The
/cluster
filesystem is mounted on all the nodes so if your data lives there it can be accessed from the front end and compute nodes.-
Do you need a special environment to run your software?
Python can and typically runs by utilizing a virtual environment. It’s a good idea to write a shell script that activates this environment so you can have HTCondor run that.
OpenMPI based software typically requires you specifying the number of processes to run and potentially multiple hosts. HTCondor can set this up for you automatically, but for now you can just test with a few processes (Something small like
-np 12
) to see if your software runs.
Be polite when testing. The front end node is a shared resource and should be used for short tests only. This is not where you will run your software long term. Do not start something and intend for it to run for multiple hours.
Submitting a job
Determining the correct universe to run in
HTCondor leverages what it refers to as universes to determine how and where to run your job when it dispatches the job to a compute node.
The most basic universe is known as the vanilla universe. In this universe HTCondor will do nothing more than copy any files over you specify, run the command you specify, and copy any specified results back. Your command will run on a single host and utilize whichever resources you have requested, i.e. 4 cpus, 1 GPU, etc.
Another universe commonly used here at WWU is the parallel universe. This universe is used when you would like to schedule one job to run on multiple compute nodes. OpenMPI programs can take advantage of this universe and have their workload split across nodes to leverage additional resources.
If your program does not use multiple processes or threads, and only minimal network connections, and you have access to the source code to it, then it can be recompiled and linked into the standard universe. Despite it’s name of standard, this is not the default universe. This universe provides checkpointing allowing your job to be snapshotted, stopped, moved to a different node, and restarted, all without losing any work.
If you need a lot of tools in order to run your research it may be advantageous to leverage the Docker universe. Here you can bundle everything up into a container and run everything out of that container. For example if you work was a mix of shell scripts, Perl scripts, a Python script, and some C++ binaries, you can bundle everything together into a container to ensure that the exact version of the required tools are available to you when you run.
Writing a job submission script
In order to tell HTCondor how to run your job you’ll need to write a job description file. This tells HTCondor which program to run, and with which arguments, as well as which universe to run the program in, any additional files to copy to/from the compute nodes, and where to save the logs of the work that it performed.
Submitting to the vanilla universe
The most basic and plain universe available to HTCondor is a great starting point for scheduling jobs.
Super basic example
Let’s examine a very basic script to start running code.
1Universe = vanilla
2Executable = myprogram
3Log = condor.log
4Queue
Line 1 says which universe to use when running the program. Here we don’t need anything special, so vanilla is fine.
Line 2 says what program to run. By default it will run in the current directory where you submitted this script.
Line 3 says where to have condor store its information about what it did in order to schedule and run your program.
Line 4 says that you are ready to queue the job.
This very basic example is perhaps too basic, and will lose the programs output to stdout and stderr, but we can add two more lines to save them too.
Note
You should not use this example. It’s too basic to be helpful except in explaining the submission file format.
Basic example
1Universe = vanilla
2Executable = myprogram
3Output = out.log
4Error = err.log
5Log = condor.log
6
7Request_Cpus = 2
8Request_Memory = 1GB
9
10Queue
The two new lines (Output and Error) will now save the stdout and stderr of the program so that you can see what your program actually displayed on the console.
The other two new lines ask for the amount of resources you want to use. HTCondor will ensure that these resources are available to you before sending your job to the execute node.
Tip
If you don’t request cpus, you will get allocated 1 cpu.
If you don’t request memory, you will get allocated a very small amount of memory, calculated by the size of your executable + 1023MB, and then dividing by 1024. This is usually around 1MB.
Basic example with arguments
1Universe = vanilla
2Executable = myprogram
3Arguments = -data myfile.dat
4Output = out.log
5Error = err.log
6Log = condor.log
7
8Request_Cpus = 2
9Request_Memory = 1GB
10
11Queue
The only addition here is that we’re now passing arguments to our program. This can be very useful to give options to your program if it accepts them on the command line.
Note
This is a good starting point for a submission file.
Queueing multiple jobs from the same file
1Universe = vanilla
2Executable = myprogram
3Arguments = -data myfile.dat
4Output = out.log
5Error = err.log
6Log = condor.log
7
8Request_Cpus = 2
9Request_Memory = 1GB
10
11Initialdir = data_file_directory
12Queue
13
14Initialdir = more_data_files_here
15Queue
-
There are two changes here:
Initialdir sets the working directory where you will run the command in
There are now two queue entries.
The Initialdir will change to the directory data_file_directory and then run the command “myprogram -data myfile.dat”. When that program is finished OR there is additional space available, HTCondor will start a second job running in the directory “more_data_files_here” and run the same program with the same arguments. This can get crazy fast if you have a lot of directories of files to run, so here’s an alternative.
Queueing multiple jobs from the same file with globbing
Sometimes you might have hundreds of input files that need to be run. If your data is structured in individual directories HTCondor can automate this process for you by “globbing” all possible results in a directory.
Imagine you have a directory named data and inside data you have the directories 1, 2, 3, …, 999. It would be a lot of work to create a submission script that listed all of these directories to run your program inside! Here we can use the keyword matching and a variable to find and represent each possible entry.
1Universe = vanilla
2Executable = myprogram
3Arguments = -data myfile.dat
4Output = out.$(Process)
5Error = err.$(Process)
6Log = condor.log
7
8Request_Cpus = 2
9Request_Memory = 1GB
10
11Initialdir = $(dirname)
12Queue dirname matching dirs data/*
-
There a two changes here again:
All directories under the data directory will be submitted as jobs.
Each job will get it’s own output and error file so you know which job had which results.
For each directory HTCondor finds under the data directory it will submit a new job to the scheduler. This job will change to that directory, and run the equivalent of the following.
The first job:
cd ~/data/1 && myprogram -data myfile.data >out.0 2>err.0
The second job:
cd ~/data/2 && myprogram -data myfile.data >out.1 2>err.1
Note that we create the output and error log files under each directory, which can let us see the results easily.
Submitting to the parallel universe
As jobs become more and more complex and require additional compute resources, the idea is run your job on multiple machines. Utilizing the Parallel universe in HTCondor will allow you to schedule your jobs across machines, though it is up the application to decide how it should communicate. Luckily, the most common method is MPI which is fully supported in our environment.
Basic OpenMPI example
HTCondor leverages a helper script to invoke your MPI programs to help setup the background communication. Depending on the software this often means you simply need to very slightly adjust your submission script to reference this helper script.
Most of the work with OpenMPI isn’t done in the HTCondor submission file, but rather the helper script. HTCondor provides a great example of such a file with openmpiscript.
This ``openmpiscript`` will need to be modified to fit your needs,
and will not work correctly as is. At the very least, you will need to
modify the MPDIR
variable to point to the version of MPI that your
program is linked against. If you used the system default then it
would be: MPDIR=/usr
The openmpiscript provides an example of a submission script that is can be used, but it also specifies things that are not needed in the CSE environment, such as the file transfers. Instead this is an even simpler one:
1Universe = parallel
2Executable = openmpiscript
3Arguments = actual_mpi_job arg1 arg2 arg3
4Getenv = false
5
6Output = out.$(NODE)
7Error = err.$(NODE)
8Log = condor.log
9
10Request_Cpus = 1
11Request_Memory = 1GB
12Machine_Count = 4
13
14queue
Removing jobs
Sometimes jobs don’t run as intended, or you realize you’ve mad a mistake in your submission script and need to fix it. While there are ways to modify an existing job, it’s often easier just to remove the job, correct it, and resubmit. This also ensures your template is correct for future submissions.
To remove a job you can use the condor_rm command. It be used in many forms, such as removing all jobs for yourself, removing all jobs in a batch, or removing a single job from a batch.
Perhaps the easiest is the command to remove all jobs for yourself:
condor_rm myclusterusername
Of course, you’ll need to replace myclusterusername with the actual username that you use to login to the cluster.
If you have multiple batches of jobs and just want to remove a certain batch, you can lookup the batch number with the condor_q command mentioned above. The “JOB_IDS” column will list the batch ID, as well individual jobs in that batch. The batch id is the number before the ‘.’ character, where the actual job number is the full number. To remove the batch just replace the myclusterusername with the job id.
condor_rm ###
And of course, replace ### with your job id.
Don’t worry about someone else removing your jobs, they don’t have permission to do so. The only person that can remove your jobs is yourself and the system administrators.
Cluster node status
From time to time you may want to get an estimate of what kind of resources are available or in use in the cluster at any given time. You can use the condor_status command for this.
Additional resources
Man pages
In Unix and Unix-like environments the manual pages for everything with the man command. Simply type man followed by the command you would like to read the manual page on. So for condor_q you would type:
man condor_q
Which will load the manual page for condor_q(1). You can use the space bar to scroll down a page at a time, as well as the up and down keys to scroll a line at a time.
HTCondor Manual
https://htcondor.readthedocs.io/en/24.0/
Sometimes it can be nice to directly reference the user’s manual for HTCondor. There are many excellent examples of submission scripts and detailed explanations of how things work with HTCondor.
It’s important to look at the correct version of the documentation. The CSaW cluster environment runs the stable release but may lag behind a release or two. Things described in the latest release may not apply to our environment.
Check the F.A.Q.
The Cluster F.A.Q. has solutions to many common questions that you may be encountered.
Direct Support
You can always reach out to support to get direct help.
Please visit the Support page under the Contact section to the left.