Getting Started

Logging in to the cluster

It’s important to note that there are now two separate cluster environments running at WWU. The Computer Science (CSCI) cluster, and the College of Science and Engineering cluster (CSE). Both clusters are interconnected on the backend to allow for sharing of resources as they are available, but it is very important to login to the correct cluster head node so your jobs run in the correct environment.

The CSE cluster’s head node is accessible at cse-head.cluster.cs.wwu.edu while the CSCI cluster’s head node is accessible at csci-head.cluster.cs.wwu.edu. Please use the appropriate one when connecting and setting up your config file mentioned below. The rest of the document will use the notation HEAD-NODE-NAME, but you will need to replace it with the appropiate address.

Both of the cluster head nodes are only reachable from either on campus, or through WWU’s VPN. ATUS has put together a page with information about the VPN, as well as a brief overview. The Computer Science Department’s support team has put together guides with step by step screenshots for installing the client and connecting to the VPN on Windows, macOS, and Linux.

Accessing the cluster is done using SSH through the terminal (In Windows you can use PowerShell or the Windows Terminal) and can be done a number of different ways. The suggested way is to modify your ssh configuration file to define an easy method to access the cluster. Open (or create) your ~/.ssh/config (C:\Users\local_username\.ssh\config in Windows) in your favorite text editor (emacs, vi, nano, …) and paste in the following:

SSH Config

  • Unix / Linux / macOS – ~/.ssh/config

  • Windows 10+ – C:\Users\local_username\.ssh\config

Host cluster
  HostName HEAD-NODE-NAME
  Port 22
  User my_wwu_username

Make sure you edit the HEAD-NODE-NAME with the correct node name from above, local_username with the username that you use to login to the local computer with, and my_wwu_username in the example to use your WWU username. Your WWU username does not include “@wwu.edu” just the part before that.

Windows SSH Config How-to

Browse to or create C:\Users\local_username\.ssh in a File Explorer Window, then click the View tab in the top left, and select the File name extensions checkbox so that file extensions are shown.

Right mouse-click in the large empty area and select New ‣ Text Document. Name the file :”config.txt”. Double left mouse-click the file to open it :in your text editor and paste the above contents into it. Pressing Control-s will save the file. Close your editor, and Right mouse-click on the “config.txt” file, selecting Rename from the list. Remove the “.txt” extension, :leaving only “config”. You will be prompted with a dialog box :asking you to confirm the file rename and telling it “might become :unusable” – select Yes that you want to change it.

If you need to make changes to the SSH config file later you can open it by dragging it your text editor of choice, without needing to rename it again.

Important

The rest of this document assumes that you have created a cluster entry in your ~/.ssh/config file named cluster.

Once you inserted this you can connect to the cluster by connecting to the VPN and typing the following in your shell/terminal:

ssh cluster

Note

You can open a shell a lot of different ways:

  • Unix / Linux

    This varies greatly between environments, but usually you can find a terminal in the launcher menu for your environment under system utilities.

  • Windows 10

    Click the Start menu, then type powershell. Click the PowerShell icon to launch it.

  • macOS

    Press Command+Space to open the Spotlight Search, then type terminal.app in the text box and press Return.

Alternatively you can specify everything on the command line each time, which will look something like this:

ssh my_wwu_username@HEAD-NODE-NAME

Attention

The first time you connect to a host via SSH you will get a message that the host key can’t be verified, and a message telling you the fingerprint of the host key so you can manually verify it, such as the following:

USER@host:~$ ssh cluster
The authenticity of host 'cse-head.cluster.cs.wwu.edu (140.160.143.131)' can't be established.
ECDSA key fingerprint is SHA256:Hoqit68dvsh8HCN9XIhaiqE3jYP6ZF+7RgWvANsvqss.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

The following are the host key fingerprints of the CSE and CSCI head nodes:

  • cse-head:

3072 SHA256:nLWYmuVJLyZ3eH+/bpyvty8y5rE7G5OYQVSf68mkhS4 cse-head.cluster.cs.wwu.edu (RSA)
256 SHA256:Hoqit68dvsh8HCN9XIhaiqE3jYP6ZF+7RgWvANsvqss cse-head.cluster.cs.wwu.edu (ECDSA)
256 SHA256:tTQ9rwqjZd+GCSzxYan1uvSrvvF+VqHXlHCuHuTGSJ4 cse-head.cluster.cs.wwu.edu (ED25519)
  • csci-head:

3072 SHA256:3YtV2yFb+S7jpxIEAhQ1WGmRRD9q7Dyu8T+jGCyTIHI csci-head.cluster.cs.wwu.edu (RSA)
256 SHA256:2IzHLYPKhWR+VVY/8LYvStXITCE7cIhAjJB9lkUMY+U csci-head.cluster.cs.wwu.edu (ECDSA)
256 SHA256:MbvffMsjHnKlg2TaZZgpsAQlJnLlpikzJ3WlgZTaI28 csci-head.cluster.cs.wwu.edu (ED25519)

If you frequently login to the cluster or other systems using SSH, it may be advantageous to use SSH keys and run an SSH-Agent, but that’s outside of the scope of this getting started guide.

Copying files to and from the cluster

Copy files to the cluster

Often times your data is prepared somewhere else and you just want to load it on the cluster’s filesystem so it can be processed in a job you’re going to schedule. There’s a convenient tool that is a part of the SSH package known as SCP or Secure CoPy. This tool is typically run on your system, and not on the research cluster itself, due to how the networking is configured. The basic layout of the command is as follows:

scp [options] <src> <dst>

Here the [options] we might want to set is -r. The -r says we should “recursively copy entire directories” which means we can specify a directory as our source (src) and have it and all of it’s contents copied to the destination (dst). For a complete list of options and more examples, please see the man page for scp(1).

Since we’re copying to the cluster filesystem, the src is a local path to a file or directory, and the dst is a special kind of path that says the username, the address of the server, and the remote machine’s path. Here’s a quick example to copy one file.

scp my_data.txt cluster:

This copies the file my_data.txt in your current directory (You can type pwd to see what your current directory is if your shell doesn’t already show you) to cluster filesystem and puts it in your home directory. The dst can be broken down to who@where:path – In the above example we didn’t specify a path, so it will use our home directory. Next lets copy a directory of files:

scp -r my_data_dir/ cluster:my_project/

This will take the local directory my_data_dir and put it, and all of its contents in ~/my_project/my_data_dir.

Copy files from the cluster

Copying files from the cluster is the same as what we did above, except the source and destination are switched. As an example, we can copy a “results” directory after the job has finished running:

scp -r cluster:my_project/results my_results

Here all of the files in “~/my_project/results” will be copied to your local machine into a directory named “my_results”. This can be useful for offline analysis of files after your jobs have completed.

Using the job scheduler

Scheduling background information

The WWU CSE cluster is leveraging the HTCondor job scheduler to send work out to all the nodes participating in the cluster. The “HT” in HTCondor stands for High Throughput, and as such our system is designed for high throughput rather than high performance. To quote the HTCondor Manual:

For many research and engineering projects, the quality of the research or the product is heavily dependent upon the quantity of computing cycles available. It is not uncommon to find problems that require weeks or months of computation to solve. Scientists and engineers engaged in this sort of work need a computing environment that delivers large amounts of computational power over a long period of time. Such an environment is called a High-Throughput Computing (HTC) environment. In contrast, High Performance Computing (HPC) environments deliver a tremendous amount of compute power over a short period of time. HPC environments are often measured in terms of FLoating point Operations Per Second (FLOPS). A growing community is not concerned about operations per second, but operations per month or per year. Their problems are of a much larger scale. They are more interested in how many jobs they can complete over a long period of time instead of how fast an individual job can complete.

HTCondor Manual

This falls inline with most of the research we do here at WWU, where we have access to a shared resource among researchers.

Viewing the job queue and status of your jobs

The first and easiest task to perform is the see the jobs in the queue. This will give you a good idea of how much work is being performed and how your work is progressing (if you have anything queued).

condor_q

The condor_q command also takes many options such as -allusers which will print out the status of all jobs, not just your personal ones.

condor_q -allusers

Testing your software

Before you submit a long running job it’s a good idea to do a quick test to ensure that your software will run correctly. The front end node is a shared resource where you can login, test your software, submit jobs, and check their status.

Here are a few things to take into consideration:

  1. Where does your data live? The /cluster filesystem is mounted on all the nodes so if your data lives there it can be accessed from the front end and compute nodes.

  2. Do you need a special environment to run your software?

    • Python can and typically runs by utilizing a virtual environment. It’s a good idea to write a shell script that activates this environment so you can have HTCondor run that.

    • OpenMPI based software typically requires you specifying the number of processes to run and potentially multiple hosts. HTCondor can set this up for you automatically, but for now you can just test with a few processes (Something small like -np 12) to see if your software runs.

  3. Be polite when testing. The front end node is a shared resource and should be used for short tests only. This is not where you will run your software long term. Do not start something and intend for it to run for multiple hours.

Submitting a job

Determining the correct universe to run in

HTCondor leverages what it refers to as universes to determine how and where to run your job when it dispatches the job to a compute node.

The most basic universe is known as the vanilla universe. In this universe HTCondor will do nothing more than copy any files over you specify, run the command you specify, and copy any specified results back. Your command will run on a single host and utilize whichever resources you have requested, i.e. 4 cpus, 1 GPU, etc.

Another universe commonly used here at WWU is the parallel universe. This universe is used when you would like to schedule one job to run on multiple compute nodes. OpenMPI programs can take advantage of this universe and have their workload split across nodes to leverage additional resources.

If your program does not use multiple processes or threads, and only minimal network connections, and you have access to the source code to it, then it can be recompiled and linked into the standard universe. Despite it’s name of standard, this is not the default universe. This universe provides checkpointing allowing your job to be snapshotted, stopped, moved to a different node, and restarted, all without losing any work.

If you need a lot of tools in order to run your research it may be advantageous to leverage the Docker universe. Here you can bundle everything up into a container and run everything out of that container. For example if you work was a mix of shell scripts, Perl scripts, a Python script, and some C++ binaries, you can bundle everything together into a container to ensure that the exact version of the required tools are available to you when you run.

Writing a job submission script

In order to tell HTCondor how to run your job you’ll need to write a job description file. This tells HTCondor which program to run, and with which arguments, as well as which universe to run the program in, any additional files to copy to/from the compute nodes, and where to save the logs of the work that it performed.

Submitting to the vanilla universe

The most basic and plain universe available to HTCondor is a great starting point for scheduling jobs.

Super basic example

Let’s examine a very basic script to start running code.

Super basic HTCondor submission example
1Universe   = vanilla
2Executable = myprogram
3Log        = condor.log
4Queue
  • Line 1 says which universe to use when running the program. Here we don’t need anything special, so vanilla is fine.

  • Line 2 says what program to run. By default it will run in the current directory where you submitted this script.

  • Line 3 says where to have condor store its information about what it did in order to schedule and run your program.

  • Line 4 says that you are ready to queue the job.

This very basic example is perhaps too basic, and will lose the programs output to stdout and stderr, but we can add two more lines to save them too.

Note

You should not use this example. It’s too basic to be helpful except in explaining the submission file format.

Basic example
Basic HTCondor submission example
 1Universe   = vanilla
 2Executable = myprogram
 3Output     = out.log
 4Error      = err.log
 5Log        = condor.log
 6
 7Request_Cpus = 2
 8Request_Memory = 1GB
 9
10Queue
  • The two new lines (Output and Error) will now save the stdout and stderr of the program so that you can see what your program actually displayed on the console.

  • The other two new lines ask for the amount of resources you want to use. HTCondor will ensure that these resources are available to you before sending your job to the execute node.

Tip

  • If you don’t request cpus, you will get allocated 1 cpu.

  • If you don’t request memory, you will get allocated a very small amount of memory, calculated by the size of your executable + 1023MB, and then dividing by 1024. This is usually around 1MB.

Basic example with arguments
Basic HTCondor submission example with arguments
 1Universe   = vanilla
 2Executable = myprogram
 3Arguments  = -data myfile.dat
 4Output     = out.log
 5Error      = err.log
 6Log        = condor.log
 7
 8Request_Cpus   = 2
 9Request_Memory = 1GB
10
11Queue
  • The only addition here is that we’re now passing arguments to our program. This can be very useful to give options to your program if it accepts them on the command line.

Note

This is a good starting point for a submission file.

Queueing multiple jobs from the same file
HTCondor submission example that queues multiple jobs
 1Universe   = vanilla
 2Executable = myprogram
 3Arguments  = -data myfile.dat
 4Output     = out.log
 5Error      = err.log
 6Log        = condor.log
 7
 8Request_Cpus   = 2
 9Request_Memory = 1GB
10
11Initialdir = data_file_directory
12Queue
13
14Initialdir = more_data_files_here
15Queue
  • There are two changes here:

    1. Initialdir sets the working directory where you will run the command in

    2. There are now two queue entries.

The Initialdir will change to the directory data_file_directory and then run the command “myprogram -data myfile.dat”. When that program is finished OR there is additional space available, HTCondor will start a second job running in the directory “more_data_files_here” and run the same program with the same arguments. This can get crazy fast if you have a lot of directories of files to run, so here’s an alternative.

Queueing multiple jobs from the same file with globbing

Sometimes you might have hundreds of input files that need to be run. If your data is structured in individual directories HTCondor can automate this process for you by “globbing” all possible results in a directory.

Imagine you have a directory named data and inside data you have the directories 1, 2, 3, …, 999. It would be a lot of work to create a submission script that listed all of these directories to run your program inside! Here we can use the keyword matching and a variable to find and represent each possible entry.

HTCondor submission example that queues multiple jobs via globbing
 1Universe   = vanilla
 2Executable = myprogram
 3Arguments  = -data myfile.dat
 4Output     = out.$(Process)
 5Error      = err.$(Process)
 6Log        = condor.log
 7
 8Request_Cpus   = 2
 9Request_Memory = 1GB
10
11Initialdir = $(dirname)
12Queue dirname matching dirs data/*
  • There a two changes here again:

    1. All directories under the data directory will be submitted as jobs.

    2. Each job will get it’s own output and error file so you know which job had which results.

For each directory HTCondor finds under the data directory it will submit a new job to the scheduler. This job will change to that directory, and run the equivalent of the following.

The first job:

cd ~/data/1 && myprogram -data myfile.data >out.0 2>err.0

The second job:

cd ~/data/2 && myprogram -data myfile.data >out.1 2>err.1

Note that we create the output and error log files under each directory, which can let us see the results easily.

Submitting to the parallel universe

As jobs become more and more complex and require additional compute resources, the idea is run your job on multiple machines. Utilizing the Parallel universe in HTCondor will allow you to schedule your jobs across machines, though it is up the application to decide how it should communicate. Luckily, the most common method is MPI which is fully supported in our environment.

Basic OpenMPI example

HTCondor leverages a helper script to invoke your MPI programs to help setup the background communication. Depending on the software this often means you simply need to very slightly adjust your submission script to reference this helper script.

Most of the work with OpenMPI isn’t done in the HTCondor submission file, but rather the helper script. HTCondor provides a great example of such a file with openmpiscript.

This ``openmpiscript`` will need to be modified to fit your needs, and will not work correctly as is. At the very least, you will need to modify the MPDIR variable to point to the version of MPI that your program is linked against. If you used the system default then it would be: MPDIR=/usr

The openmpiscript provides an example of a submission script that is can be used, but it also specifies things that are not needed in the CSE environment, such as the file transfers. Instead this is an even simpler one:

Basic HTCondor OpenMPI submission example
 1Universe   = parallel
 2Executable = openmpiscript
 3Arguments  = actual_mpi_job arg1 arg2 arg3
 4Getenv     = false
 5
 6Output     = out.$(NODE)
 7Error      = err.$(NODE)
 8Log        = condor.log
 9
10Request_Cpus   = 1
11Request_Memory = 1GB
12Machine_Count = 4
13
14queue

Removing jobs

Sometimes jobs don’t run as intended, or you realize you’ve mad a mistake in your submission script and need to fix it. While there are ways to modify an existing job, it’s often easier just to remove the job, correct it, and resubmit. This also ensures your template is correct for future submissions.

To remove a job you can use the condor_rm command. It be used in many forms, such as removing all jobs for yourself, removing all jobs in a batch, or removing a single job from a batch.

Perhaps the easiest is the command to remove all jobs for yourself:

condor_rm myclusterusername

Of course, you’ll need to replace myclusterusername with the actual username that you use to login to the cluster.

If you have multiple batches of jobs and just want to remove a certain batch, you can lookup the batch number with the condor_q command mentioned above. The “JOB_IDS” column will list the batch ID, as well individual jobs in that batch. The batch id is the number before the ‘.’ character, where the actual job number is the full number. To remove the batch just replace the myclusterusername with the job id.

condor_rm ###

And of course, replace ### with your job id.

Don’t worry about someone else removing your jobs, they don’t have permission to do so. The only person that can remove your jobs is yourself and the system administrators.

Cluster node status

From time to time you may want to get an estimate of what kind of resources are available or in use in the cluster at any given time. You can use the condor_status command for this.

Additional resources

Man pages

In Unix and Unix-like environments the manual pages for everything with the man command. Simply type man followed by the command you would like to read the manual page on. So for condor_q you would type:

man condor_q

Which will load the manual page for condor_q(1). You can use the space bar to scroll down a page at a time, as well as the up and down keys to scroll a line at a time.

HTCondor Manual

https://htcondor.readthedocs.io/en/23.0/

Sometimes it can be nice to directly reference the user’s manual for HTCondor. There are many excellent examples of submission scripts and detailed explanations of how things work with HTCondor.

It’s important to look at the correct version of the documentation. The CSaW cluster environment runs the stable release but may lag behind a release or two. Things described in the latest release may not apply to our environment.

Cluster F.A.Q.

The Cluster F.A.Q. has solutions to many common questions that you may be encountered.

Direct Support

You can always reach out to support to get direct help.

Please visit the Support page under the Contact section to the left.