Cluster F.A.Q.
How
How do I see my job status?
The command condor_q will show you the status of your jobs in the queue.
$ condor_q
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:30341> @ 05/21/20 17:02:01
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
USER ID: 55 5/21 17:01 _ 4 96 100 55.0-99
Total for query: 100 jobs; 0 completed, 0 removed, 96 idle, 4 running, 0 held, 0 suspended
Total for USER: 100 jobs; 0 completed, 0 removed, 96 idle, 4 running, 0 held, 0 suspended
Total for all users: 100 jobs; 0 completed, 0 removed, 96 idle, 4 running, 0 held, 0 suspended
Starting in HTCondor 9.0, we also now have the condor_watch_q which will update your display in real-time with information about your jobs in the queue.
USER@cse-head:~/gromacs_example$ condor_watch_q
BATCH IDLE RUN DONE TOTAL JOB_IDS
ID: 1056 - 1 - 1 1056.0 [=============================================================================]
[=============================================================================]
Total: 1 jobs; 1 running
Updated at 2022-04-22 17:00:41
Input ^C to exit
The message at the bottom “Input ^C to exit” means that you press Ctrl+C to exit. The ^ character is a way to indicate Ctrl in terminal lingo.
How do I see the status of all jobs, not just my own?
The -allusers
flag can be passed to condor_q to
see information about all user’s jobs, not just your own. This may
help you to understand how busy the queue is.
$ condor_q -allusers
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:30341> @ 05/26/20 10:14:31
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
USER ID: 1773 1/27 14:55 _ 1 _ 1 1773.0
USER ID: 1776 1/27 15:09 _ 1 _ 1 1776.0
USER ID: 1778 1/27 15:15 _ 1 _ 1 1778.0
USER ID: 1932 2/11 16:40 1 2 _ 3 1932.1-2
USER ID: 1933 2/11 16:42 7 8 _ 15 1933.2-14
USER ID: 1936 2/11 17:23 _ 2 _ 2 1936.0-1
USER ID: 2015 2/24 14:06 9 1 _ 10 2015.9
USER ID: 2042 2/27 05:43 92 20 _ 112 2042.13-109
USER ID: 2049 2/28 13:50 _ 30 _ 30 2049.0-29
USER ID: 2287 5/2 04:28 49 11 _ 60 2287.0-59
USER ID: 2288 5/2 05:29 13 1 _ 14 2288.6
USER ID: 2328 5/4 12:58 10 4 _ 14 2328.1-6
USER ID: 2376 5/7 13:19 _ 1 _ 1 2376.0
USER ID: 2437 5/21 13:05 _ 1 _ 1 2437.0
USER ID: 2506 5/25 04:58 _ 1 _ 1 2506.0
USER ID: 2507 5/25 05:22 _ 22 _ 22 2507.0-21
USER ID: 2508 5/25 05:29 _ 1 _ 1 2508.0
USER ID: 2509 5/25 05:32 _ 2 _ 2 2509.0-1
USER ID: 2510 5/25 05:38 _ 1 _ 1 2510.0
Total for query: 111 jobs; 0 completed, 0 removed, 0 idle, 111 running, 0 held, 0 suspended
Total for all users: 111 jobs; 0 completed, 0 removed, 0 idle, 111 running, 0 held, 0 suspended
How do I remove my job?
If you would like to remove your job you can use the condor_rm command.
condor_rm ###
Where ### is the job number to be removed.
How do I see the CPU resources in use and available?
Unfortunately there’s not a direct HTCondor command to see this. It can however be achieved in two parts:
-
Get the number of cores (resources) in use:
You can use the condor_userprio -all -allusers command to get this. The
-all
and-allusers
are required to see the full output. Otherwise it will hide output from some users (even if they’re actively consuming resources).$ condor_userprio -all -allusers Last Priority Update: 6/5 14:38 Effective Real Priority Res Total Usage Usage Last Time Since User Name Priority Priority Factor In Use (wghted-hrs) Start Time Usage Time Last Usage ---------------------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- <SNIPPED TO SAVE SPACE IN EXAMPLE> USER 500.00 0.50 1000.00 0 0.00 18418+21:3 USER 500.00 0.50 1000.00 0 0.00 18418+21:3 USER 500.00 0.50 1000.00 0 0.00 18418+21:3 USER 500.00 0.50 1000.00 0 0.00 18418+21:3 USER 28349.49 28.35 1000.00 26 181207.45 10/29/2019 16:18 6/05/2020 14:38 <now> USER 30633.99 30.63 1000.00 33 50297.62 10/17/2019 21:00 6/05/2020 14:38 <now> USER 63582.67 63.58 1000.00 63 308335.69 10/14/2019 15:50 6/05/2020 14:38 <now> USER 115569.41 115.57 1000.00 117 200068.20 10/21/2019 16:26 6/05/2020 14:38 <now> ---------------------------------------------- ------------ -------- --------- ------ ------------ ---------------- ---------------- ---------- Number of users: 63 240 940681.88
The number you’re looking for is in the “Res In Use” column – This shows 240 cores in use, as well as how much each user is consuming.
-
Next we’ll need to find the total number of cores across all the nodes. To do this, we will use condor_status with the options
-af Cpus
. You will get multiple responses from condor_status so we will pipe it to AWK and let it do some math for us:condor_status -af Cpus | awk '{cpus+=$1} END {print "Total CPUs: ", cpus}' Total CPUs: 656
Note
If this seems complicated, the answer as of 03/16/2021 is 656 cores.
So with the two above things, we can see in the example that there are 240 out of 656 cores in use.
How do I see the GPU resources in use and available?
Much like viewing the number of CPUs is a two part operation, listing GPUs is the same.
To get the number of GPUs in use you can use this line:
user@csci-head:~$ condor_q -constraint 'jobstatus==2' -af requestgpus | awk '{gpus+=$1} END {print "GPUs in use: ", gpus}'
GPUs in use: 25
This will print the number of requested GPUs from all jobs that are currently running.
To get the total number of GPUs you can get more information that you need, and then reduce it with AWK though:
user@csci-head:~$ condor_status -af GPUs | awk '{gpus+=$1} END {print "Total GPUs: " gpus}'
Total GPUs: 27
How do I schedule a job to use multiple CPUs?
You can set the Request_Cpus
variable in your submission
file. There maybe additional setting required for your software
though. Some research packages require you to confirm how many cpus
(cores) it should use, this method just tells HTCondor to reserve that
many cores on a node for you. This will also set the environment
variable OMP_NUM_THREADS
for your program to reference. If
your program leverages OpenMP, you’re all set! If it leverages
something else you can read that variable into your program and adjust
accordingly.
How do I schedule a job to use a GPU?
Much in the way you requested multiple CPUs, you can set the
Request_Gpus
variable to 1.
How do I get a shell on the node running my job so I can check on it?
HTCondor has a command for this: condor_ssh_to_job. You just need to find and specify the job ID.
$ condor_q
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:9618?... @ 07/27/20 11:56:26
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
USER ID: 2993 7/27 11:56 _ 1 _ 1 2993.0
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for USER: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 687 jobs; 0 completed, 0 removed, 450 idle, 237 running, 0 held, 0 suspended
$ condor_ssh_to_job 2993.0
Welcome to slot1_21@c-0-4.cluster.cs.wwu.edu!
Your condor job is running with pid(s) 32510.
$ ls
ckpt_dash_de9076167dcaa02-40000-5cbc98d069370a_files count.sh dmtcp.2993.0.log dmtcp-USER@c-0-4.cluster.cs.wwu.edu out
condor.log dmtcp.2958.0.log dmtcp-USER@c-0-2.cluster.cs.wwu.edu err test_cluster.job
$
How do I get an interactive shell on an execute node so I can test before submitting a job?
HTCondor’s condor_submit has a -interactive
flag
(or -i
for short) that can be passed to it. Additionally,
you can pass submission variables as options as well to ensure that
they are passed to the scheduler when looking for a node for your
interactive job. This can help you get a large chunk of resources
quickly to compile or test code before submitting your job.
There are limitations in place to interactive sessions to ensure that resources are not just claimed forever and can be made available to others. There is a warning that is displayed at the beginning of an interactive session to remind you of this.
$ condor_submit -interactive request_cpus=4 request_memory=2048
Submitting job(s).
1 job(s) submitted to cluster 1035.
Waiting for job to start...
Welcome to slot1_3@c-X-X.cluster.cs.wwu.edu!
You will be logged out after 3600 seconds of inactivity.
**********
WARNING!
**********
This HTCondor session will end after 24 hours, or 1 hour of idle time.
To run for more than 24 hours, submit the job without "-interactive".
user@c-0-2:~$
Note
Your environment will be reset when the interactive job starts,
which matches what happens when a normal jobs runs via
HTCondor. However, there are times where you want to preserve your
environment because you need access to environment variables like
your $HOME
, $USER
, etc. A quick way to grab your environment on the
remote side is to use the getenv
variable and either set it
to True
, or a list of variables you want to keep (if you know
them).
condor_submit -interactive request_cpus=4 request_memory=2048 getenv=true
This will keep your current environment variables set on the
execute node so you can quickly do things like cd ~/
.
Hint
If you have an ssh-agent loaded and setup in your environment, or if you have agent forwarding enabled for your session and have a key loaded your connection may error with something similar to this:
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535
This can be quickly worked around by unsetting your SSH_AUTH_SOCK
environment variable for the command:
$ SSH_AUTH_SOCK= condor_submit -interactive request_cpus=4 request_memory=2048
How do I build my software against CUDA?
CUDA is not installed on the submission nodes, but is installed on the GPU nodes. To get an interactive shell on a GPU node without requesting a GPU you can add an extra variable when submitting your interactive job.
Simply add a new variable CUDABuild
and set it to True
. Like
the following:
If you don’t need to specify any extra settings, you can do this as a single line from your shell:
condor_submit -interactive MY.CUDABuild=True
Note
The MY.
part is only needed when doing this from the command
line. If you’re submitting this as part of a file, it’s not
required.
Once you have your interactive shell, you will need to add CUDA and/or
CUDDN to your PATH
and LD_LIBRARY_PATH
. To help
make this easier you can source the provided activate.sh script in
each CUDNN directory.
Look through the contents of /usr/local/cuda*
to see which
version of CUDA you need.
$ ls -1d /usr/local/cuda*
/usr/local/cuda
/usr/local/cuda-10.1
/usr/local/cuda-10.2
/usr/local/cuda-11.0
This shows that CUDA 10.1, 10.2, and 11.0 are available.
Then inside that CUDA-#.# directory look to see which version of CUDNN you need. If you don’t need CUDNN and just need CUDA, you can safely pick the newest CUDNN version and ignore the extra things that it provides.
$ ls -1d /usr/local/cuda*/cudnn-*
/usr/local/cuda-10.1/cudnn-7.6
/usr/local/cuda-10.1/cudnn-8.0
/usr/local/cuda-10.2/cudnn-7.6
/usr/local/cuda-10.2/cudnn-8.0
/usr/local/cuda-11.0/cudnn-8.0
/usr/local/cuda/cudnn-8.0
Here you can see that for older releases of CUDA that both 7.6 and 8.0 are available, while the newest CUDA 11.0 only has support for CUDNN 8.0.
To load the activate.sh script you can use the .
command in your
shell. So if you wanted to load CUDA-10.2 with CUDNN-7.6 you could use
the following:
$ printenv PATH LD_LIBRARY_PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/bin:/usr/bin:/usr/central/bin:/usr/central/cluster/bin
$ . /usr/local/cuda-10.2/cudnn-7.6/activate.sh
$ printenv PATH LD_LIBRARY_PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/bin:/usr/bin:/usr/central/bin:/usr/central/cluster/bin:/usr/local/cuda-10.2/bin
:/usr/local/cuda-10.2/lib64:/usr/local/cuda-10.2/cudnn-7.6/lib64
Alternatively you can have HTCondor set your environment variables manually in the submission script. Take a look at the HTCondor documentation on condor_submit in the “environment” section for that.
Note
Be careful with this option. If you incorrectly set your
PATH
environment variable you may not be able to run your
submitted job.
Before loading the path we verify PATH
and
LD_LIBRARY_PATH
before activating the new additions, and
then we load them with
. /usr/local/cuda-10.2/cudnn-7.6/activate.sh
and finally check
that they’re loaded. This shows different output from before, with the
/usr/local/cuda-10.2/bin
being added to the PATH
, and
LD_LIBRARY_PATH
contains both CUDA-10.2 and CUDNN-7.6 so you
can use the libraries.
How do I run my software against a particular version of CUDA / CUDNN?
You will need to modify your environment to include the needed paths for CUDA / CUDNN libraries. See the question above about building against CUDA for an easy to use script to source in your startup script.
Alternatively you can have HTCondor set your environment variables manually in the submission script. Take a look at the HTCondor documentation on condor_submit in the “environment” section for that.
Note
Be careful with this option. If you incorrectly set your
PATH
environment variable you may not be able to run your
submitted job.
How do I submit a job for running on the CSCI desktops?
The first thing to be aware of when planning to run your job on the idle desktops is their restrictions:
A maximum of 6 cores are made available on each machine, additionally memory will be limited to less than the full amount (currently 2GB are reserved for non-HTCondor use). Only jobs in the vanilla universe can be run on the desktop machines, and the cluster filesystem is not available on the desktops either.
The final restriction is that if the system sees and interactive use, it will suspend your job for up to 10 minutes while someone interacts with it. If your job remains suspended for the full 10 minutes and someone is still interacting with it, the job will be evicted and put back into the queue.
To best deal with the final restriction you should use one of the following:
Small jobs that don’t take very long to run so if work is lost, it’s not days of work.
Checkpointing. Ideally you would let HTCondor checkpoint your work when evicting your job so it can be resumed later.
With these restrictions in mind, the desktops have proven incredibly useful for things like batch processing ~30,000 jobs that each run for about an hour.
To have a job run on an idle CSCI desktop, you need to send and extra variable with your job.
+CSCI_GrpDesktop = True
It doesn’t matter where you submit the job from, if that variable is not set to true, the desktop systems will not accept your job.
From CSE or CSCI head node
When submitting from either the CSE or CSCI head node you need to have your job “flocked” over to the desktop pool. This can be accomplished by adding another variable:
+WantFlocking = True
This will allow your job to leave your pool, and travel to a different pool. To ensure that your job only flocks to the desktop pool, you can add an additional requirement to it:
Requirements = (TARGET.PoolName == "CSCI Lab Cluster")
This will ensure that only devices in the “CSCI Lab Cluster” will run your job. If this is not set, and WantFlocking is set to true, your job will simply float to some other pool, possibly not the one you intended it to run in!
Finally, be aware that the CSCI desktop machines can not read from the cluster filesystem, so you will need to transfer all files manually. See https://htcondor.readthedocs.io/en/23.0/users-manual/file-transfer.html for additional information about file transfers.
From csci-lab-head as a CSCI student
There are no additional required submission variables (except for the usual CSCI_GrpDesktop) for submitting jobs to the desktops from the csci-lab-head node, as it is already in the same pool as the desktops.
The interesting thing when submitting from the csci-lab-head node is
that as a CSCI student your jobs will run as your normal CSCI user,
and have access to all the files that you normally have access to on
the CSCI filesystem. This means that file transfers are not required
when submitting this way. If you submit a job from your /research/
directory, it will run directly out of the /research/
directory as
you.
From a CSCI lab system as a CSCI student
In addition to adding the CSCI_GrpDesktop you will need to acquire
your IDTOKEN for the csci-lab-head system. Connect to csci-lab-head
via SSH and use the command condor_token_fetch to fetch
your IDTOKEN. The easiest way to do this is to use it with the
-token NAME
option, which will store the token in
~/.condor/tokens.d/NAME
where NAME is name you specified.
USER@cf420-07:~$ ssh -p 922 csci-lab-head
USER@csci-lab-head's password:
Welcome to the WWU Computer Science cluster environment
If you have questions or need help regarding the cluster systems or
software running on them, please contact CSaW Support at
csaw.support@wwu.edu or visit the following URL for assistance:
https://cluster.cs.wwu.edu/
Last login: Mon Sep 13 15:42:25 2021 from cf405-03.cs.wwu.edu
USER@csci-lab-head:~$ condor_token_fetch -token csci-lab-head
USER@csci-lab-head:~$ ls ~/.condor/tokens.d/
csci-lab-head
USER@csci-lab-head:~$ logout
Connection to csci-lab-head closed.
USER@cf420-07:~$ ls ~/.condor/tokens.d/
csci-lab-head
USER@cf420-07:~$ cd ~/example_submit_scripts/basic/
USER@cf420-07:~/example_submit_scripts/basic$ ls
logs test.job test.sh
USER@cf420-07:~/example_submit_scripts/basic$ condor_submit test.job
Submitting job(s).
1 job(s) submitted to cluster 2704.
How do I manually specify my accounting group?
When you submit a job to HTCondor it will automatically set your accounting group to one from a known list. However, if you’re doing research with two different research groups, you may need to specify which one you are doing the research for. This only requires adding an additional line to your HTCondor submit file:
accounting_group = research_lastname
You can copy/paste the above and adjust the Lastname portion to your researcher’s last name.
Be aware, you can not just set it to some random group, but rather only one you belong. This also means you need to ensure you correctly spell your researcher’s name, otherwise you will get an error when you submit your job(s).
How do I connect to the VPN?
ATUS has put together a page with information about the VPN, as well as a brief overview. The Computer Science Department’s support team has put together guides with step by step screenshots for installing the client and connecting to the VPN on Windows, macOS, and Linux.
How much disk storage space do I have?
The current quota is 5GB for your home directory (/cluster/home/$USER) and 5TB for your shared research space (/cluster/research-groups/$PI). It is suggested that your keep all of your research data and work in your shared research space to ensure that others in your group can access your work.
How do I use the scratch SSD space?
Some of the nodes in the cluster environments have scratch SSDs to help speed up processing of data. As of 03/31/2022 the following nodes have scratch SSDs installed.
-
CSE Cluster Environment
g-0-3
-
CSCI Cluster Environment
(This is currently all of the GPU nodes)
g-1-0
g-1-1
g-1-2
g-2-0
g-2-1
To make requesting a node with scratch SSD space easy, the HasScratchSSD variable was added to nodes that have them. This means you can set a new Requirement option, such as:
Requirements = HasScratchSSD
Setting the above will ensure that the node your job runs on has a scratch SSD. Please note: there may be additional requirements to run on nodes that have scratch SSDs, the above requirement only ensures that you pick from the list of nodes that have one.
Running on a node with a scratch SSD allows for two things:
File transfers done via HTCondor will be done to the scratch SSDs. All disk I/O from the SSD will be much faster than that of the NFS file server. The downside here is that not as much space is available compared to the file server. You should transfer only what you need to complete your work quickly.
Additional scratch space for researchers, outside of the job file transfers. This is a place intended to transfer larger datasets that will be processed by multiple jobs either simultaneously or serially without having to transfer the data for each job. The location for the research groups space is located in
/scratch_ssd/research-groups/
. Please see the What is the cleanup policy for the scratch disk space? for information about data cleanup. Each research group needs to be manually configured for storage here so the appropriate cleanup policies can be put in place. If you are a P.I. and your research group would like space here, please reach out to so that a directory can be setup for your group.
If you use method #1, then HTCondor will manage the files for you and you’ll get the speedup for free on these nodes.
However, if you have very large datasets that need to be shared amongst your jobs, then you will need to manage the data in your /scratch_ssd/research-groups/$RESEARCHER manually. This can be as simple as ensuring the data you need is available by copying it there in your job’s startup script before the main job runs, or it can be done manually via shell to that node. Remember, you will need to copy your data to one exact node, then ensure your job runs on that exact node. To schedule a job to a specific host where your data has been previously staged, or to manually stage your data, please see the How do I run a job on a specific node? question.
How do I run a job on a specific node?
While it is possible to ensure your job runs on one exact node, you almost never want to do this. That being said, in one of the rare circumstances you do actually need to run on one specific node, you can set the Requirements to match the Machine name.
Requirements = Machine == "c-X-X.cluster.cs.wwu.edu"
Or if you’re submitting this interactively:
condor_submit -interactive Requirements='Machine == "c-X-X.cluster.cs.wwu.edu"'
How do I use the module system?
The module system allows custom packages to be easily loaded and unloaded, without needing to understand or manually set your environment variables.
All interactions with modules use the module command, followed by some sub-command.
To list all available modules, you can use the module available command. With the exception of the Nvidia modules, all modules are available on all execute nodes. The Nvidia modules are only installed on the nodes with Nvidia GPUs.
[USER@g-X-X ~]$ module avail
------------------------ /usr/share/Modules/modulefiles ------------------------
dot module-git module-info modules null use.own
---------------------------- /usr/share/modulefiles ----------------------------
mpi/openmpi-x86_64 pmi/pmix-x86_64
----------------------- /opt/nvidia/hpc_sdk/modulefiles ------------------------
nvhpc-byo-compiler/22.5 nvhpc-hpcx/23.1 nvhpc-nompi/23.1 nvhpc/23.1
nvhpc-byo-compiler/22.11 nvhpc-nompi/22.5 nvhpc/22.5
nvhpc-byo-compiler/23.1 nvhpc-nompi/22.11 nvhpc/22.11
Once you’ve found a module that available and that you would like to load, you load it using the module load sub-command. In the example below, it is shown that there are no environment variables set pertaining to MPI, then we load the OpenMPI module, and check again; verifying that the module has set all the relevant MPI environment variables.
[USER@g-X-X ~]$ printenv | grep MPI
[USER@g-X-X ~]$ module load mpi/openmpi-x86_64
[USER@g-X-X ~]$ printenv | grep MPI
MPI_PYTHON3_SITEARCH=/usr/lib64/python3.6/site-packages/openmpi
MPI_FORTRAN_MOD_DIR=/usr/lib64/gfortran/modules/openmpi
MPI_COMPILER=openmpi-x86_64
MPI_SUFFIX=_openmpi
MPI_INCLUDE=/usr/include/openmpi-x86_64
MPI_HOME=/usr/lib64/openmpi
MPI_SYSCONFIG=/etc/openmpi-x86_64
MPI_BIN=/usr/lib64/openmpi/bin
MPI_LIB=/usr/lib64/openmpi/lib
MPI_MAN=/usr/share/man/openmpi-x86_64
Finally, if you have loaded a module but no longer wish to use it, you can also unload the module as easily as it was loaded. This time the sub-command is module unload. Below the example shows that the OpenMPI module has been unloaded, and that all relevant environment variables have been unset.
[USER@g-X-X ~]$ module unload mpi/openmpi-x86_64
[USER@g-X-X ~]$ printenv | grep MPI
How do I use other versions of GCC?
If the default GCC compiler suite does not meet your needs, or if you’re just interested in trying the latest optimizations available in a newer compiler, there are additional versions available. To get the list of available versions, you can use the scl list-collections, which will list out the various software collections.
[USER@c-X-X ~]$ scl list-collections
gcc-toolset-12
This shows the only currently available software collection is GCC 12, and it’s related tools. To use these tools you can “run” or “enable” them for a given process.
To enable access to a given software collection you will want to enable it for a new shell session. This can be done with the scl enable <collection> bash. This starts a new Bash shell that will use the new software collection. The example below shows the current version, and the version after enabling it for a new shell session.
[USER@c-X-X ~]$ gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[USER@c-X-X ~]$ scl enable gcc-toolset-12 bash
bash-4.4$ gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
bash-4.4$
Note
The shell prompt changes to “bash” and not your normal prompt indicating that it has activated. If you’re finished using the tools, you can type exit to leave.
Why
Why is my job in a “hold” state?
Something went wrong with your job
The administrator put it there
You can check the hold message with the following command:
condor_q -hold ###
Where ### is replaced with your job number.
If you modify your job, you can release it with the following command:
condor_release ###
Where ### is replaced with your job number. Be aware that many people prefer to simply edit their submission file and resubmit. If you do this please be sure to remove your held job in order to help clean up the queue.
If an administrator has put your job in a hold state, please contact the administrator about getting it released.
Why is my job still idle?
Jobs will remain in a hold state until the scheduler and negotiator can find a place to send your job. Sometimes is can happen very quickly (almost immediately!), but other times it can sit idle for minutes at a time even when resources are available. If you know that there are enough resources available to run your job, it can be worth analyzing it to determine why it’s not being selected to run.
Starting with the easiest option:
$ condor_q -analyze 2995
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:9618?...
The Requirements expression for job 2995.000 is
((TARGET.PoolName == "NotARealPool")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.HasFileTransfer)
No successful match recorded.
Last failed match: Mon Jul 27 15:03:29 2020
Reason for last match failure: no match found
2995.000: Run analysis summary ignoring user priority. Of 10 machines,
10 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
WARNING: Be advised:
No machines matched the jobs's constraints
condor_q -analyze will give you a quick overview of your job. In the example above it has warned us that the we have no matches. This has at least informed us that we would be waiting forever since no machine can accept our job.
The next level of analysis is condor_q -better-analyze.
$ condor_q -better-analyze 2995
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:9618?...
The Requirements expression for job 2995.000 is
((TARGET.PoolName == "NotARealPool")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.HasFileTransfer)
Job 2995.000 defines the following attributes:
DiskUsage = 1
RequestDisk = DiskUsage
RequestMemory = 100
The Requirements expression for job 2995.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 0 TARGET.PoolName == "NotARealPool"
No successful match recorded.
Last failed match: Mon Jul 27 15:11:30 2020
Reason for last match failure: no match found
2995.000: Run analysis summary ignoring user priority. Of 10 machines,
10 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
WARNING: Be advised:
No machines matched the jobs's constraints
Here we can see what went wrong. The Requirements section calls out that we matched 0 slots because we set a requirement that the PoolName is “NotARealpool” – Of course, this pool can not be found, and scheduler can’t proceed.
But what if we want a better analysis, and want to see what each individual node thinks of our job?
condor_q -better-alayze:reverse will do a “reverse” analysis, which is analyzing the job from the perspective of each node, not from the perspective of the scheduler.
$ condor_q -better-analyze:reverse 2995
-- Schedd: cse-head.cluster.cs.wwu.edu : <140.160.143.131:9618?...
-- Slot: slot1@c-0-0.cluster.cs.wwu.edu : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements expression for this slot is
(START) && (IsValidCheckpointPlatform) &&
(WithinResourceLimits)
START is
(CUDABuild isnt true)
IsValidCheckpointPlatform is
(TARGET.JobUniverse isnt 1 ||
((MY.CheckpointPlatform isnt undefined) &&
((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) ||
(TARGET.NumCkpts == 0))))
WithinResourceLimits is
(ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET._condor_RequestCpus <= MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
TARGET._condor_RequestMemory <= MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory,false)) &&
ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET._condor_RequestDisk <= MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk,false)))
This slot defines the following attributes:
CheckpointPlatform = "LINUX X86_64 4.15.0-64-generic normal N/A avx ssse3 sse4_1 sse4_2"
Cpus = 24
Disk = 881227232
Memory = 64396
Job 2995.0 has the following attributes:
TARGET.JobUniverse = 5
TARGET.NumCkpts = 0
TARGET.RequestCpus = 1
TARGET.RequestDisk = 1
TARGET.RequestMemory = 100
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[0] 1 CUDABuild isnt true
[1] 1 IsValidCheckpointPlatform
[3] 1 WithinResourceLimits
slot1@c-0-0.cluster.cs.wwu.edu: Run analysis summary of 1 jobs.
0 (0.00 %) match both slot and job requirements.
1 match the requirements of this slot.
0 have job requirements that match this slot.
<SNIPPED TO SAVE SPACE IN EXAMPLE>
This is a lot of output, and you will get output for each individual slot. The output was trimmed here, but expect to see several screens of output. The output in the above example shows that the slot was fine accepting our job, but our job did not accept the slot.
Why does condor_userprio show my user name with a group prefix?
HTCondor is configured to auto assign groups to all users. Groups are currently used to prioritize jobs to research groups that have generously purchased hardware for the cluster. These nodes will accept all jobs from any user, but will prefer jobs from the research group that purchased them. condor_userprio shows syntax to confirm that your job(s) are running as both you and your group.
The current groups are research_*, academic, and unaffiliated. If you get an unaffiliated prefix it means that the mapping needs to be adjusted for you account. Please contact so it can be fixed.
If you belong to multiple research groups you can override the default setting; see the answer to the question “How do I manually specify my accounting group?” on this page.
Why is there a quota / limit on disk space?
The CSaW clusters are a shared environment, and we want to ensure that there are available resources to all that need them. Because of this, there is a limit to ensure that there is enough storage for each PI and student. If you are a PI and need additional storage, please reach out about pricing for additional drives to expand your storage quota.
Why does VS Code Remote - SSH plugin keep disconnecting me?
The “VS Code Remote - SSH” plugin is not allowed to connect to the CSaW head nodes because it downloads a NodeJS binary and various other plugins on the head node, then runs all of it’s processing on the head node. The processing it does can occasionally run-away and consume all the cores and all the memory until the process crashes. During this time the system becomes unresponsive, causing issues for others in the shared environment.
Because of the issues it causes, “VS Code Remote - SSH” is terminated as soon as a connection is detected from it. There are multiple editors available locally in the cluster, or you can use other editors to connect and edit remotely that don’t abuse the remote host.
What
What is the maximum amount of resources I can request?
CSE cluster
As of 03/16/2021 the largest node has 128 threads, but will only accept jobs in the parallel universe. The next largest node has 32 threads, So a vanilla universe job can request up 32, but it will most likely take a very, very long time to get an entire node to yourself without speaking to an administrator first, as most nodes are largely consumed at all times. There are a total of 651 threads, so a parallel universe job can request 24 cpus on 18 different machines, and 32 cpus on 4 machines, and 128 threads on 1 machine.
CSCI cluster
As of 10/27/2020 the largest CPU node has 64 threads. So a vanilla universe job can request up 64. There are a total of 512 threads on the CPU only nodes, so a parallel universe job can request 64 cpus on 8 different machines.
The GPU nodes in the CSCI cluster are a little different, and have the requirement that you request a GPU to get scheduled there. The newer 8-GPU nodes have 8 threads available per-GPU. The older 4-GPU nodes are setup to be dynamic slots, so you can carve up the 64 as needed, though please be respectful of others using the node – Requesting 1 GPU and 64 cpus is a good way to get a nasty gram from an administrator (unless you previously got the go-ahead).
CSCI desktops
The desktop systems allow for up to 90% of the cpu threads to be allocated to HTCondor, and up to $MEMORY - 2GB memory to be allocated, where this currently (as of 08/30/2021) translates to 7 cpu threads, and about 14GB of memory.
What kind of jobs should I be running on CSCI desktops?
To best utilize the desktop systems you should use one of the following:
Small jobs that don’t take very long to run so if work is lost, it’s not days of work.
Checkpointing. Ideally you would let HTCondor checkpoint your work when evicting your job so it can be resumed later.
With these restrictions in mind, the desktops have proven incredibly useful for things like batch processing ~30,000 jobs that each run for about an hour.
What happens if I use more resources than I request?
Long story short: it depends. If you use more resources than you requested but no one else has been allocated them already on that system you should be OK. If however someone else has been allocated the resources on the same system and you try and take their resources your job will most likely be stopped. Sometimes the limiting system fails, and an administrator will most likely contact you about correcting your submission script.
What does ‘Exec format error’ mean?
You may have seen a hold message similar to the following:
Error from slotX_X@c-X-X.cluster.cs.wwu.edu: Failed to execute 'my_script.py': (errno=8: 'Exec format error')
This name of the file is insignificant, it could be any kind of file that you’re trying to run. This error refers to the UNIX file permission system and how that determines if a file is able to be executed or not. In order for HTCondor to be able to execute a file, that file must have its execute permission bit set. The easiest to do this is to set it for all mode groups (user, group, other). This can be done with the command chmod to “change mode” the file. In the following session you can see the permission bits (rwx, etc.) on the file before and after setting the execute bit with chmod:
The added ‘x’ for user, group, and other means that you, anyone in your group, and anybody else can execute that file now.
What is the cleanup policy for the scratch disk space?
There are multiple scratch spaces that can be used by jobs. Not all of them are managed by HTCondor, and they can fill up if you’re not careful. To help keep them available for use, the following cleanup policies are in place and managed via systemd’s tmpfiles.d(5):
-
/scratch
Available space: Varies from node to node
Cleanup policy: Any data that is put here and hasn’t been accessed for 24 hours will be deleted.
-
/scratch_memory_backed
Available space: 2 GB
Cleanup policy: Any data that is put here and hasn’t been accessed for 24 hours will be deleted.
-
/scratch_ssd/research-groups/
Available space: Varies from node to node
Cleanup policy: Any data that is put here and hasn’t been accessed for seven (7) days will be deleted.
Note
Not all systems have a scratch ssd. If you want a job scheduled to a node that does have one, please see the How do I use the Scratch SSD space? question.
What is Preemption and how do I opt into it?
Sometimes researchers need dedicated resources to help perform their research. But sometimes they’re not using those resources and want to make them available to others, with the understanding that when they need them, they can quickly get them back.
Enter preemption. A job can be preempted (or stopped and put back in the queue) if certain conditions are met. In this case a researcher who does not have dedicated access to a compute node can submit a job (A) marked for allowing preemption and have the opportunity to run on additional research nodes, with the understanding that if the node owner’s group submits a job (B), their job (A) can be stopped and put back in the queue in favor of job (B) running.
If you’re running smaller jobs, this can be a great advantage to get access to even more cores, and if your job is interrupted you won’t lose too much compute time. However, for larger jobs you can still leverage this resources, but it is suggested you have checkpointing implemented in your program. How this is done is up to the developer, but for information about how HTCondor manages checkpoints, please see the HTCondor documentation for Self-Checkpointing Applications.
To opt into preemption, you can add one additional line to your submission file:
+WantPreempt = True
By default the WantPreempt setting is undefined, which will evaluate to False and your job can not be preempted. By setting it to True you allow your job to run on hardware dedicated to other researchers who are willing to share it when they’re not using it. This currently includes access to very large core count systems (128 cores), large memory systems (512GB memory), and very fast GPUs (Nvidia T4 and A100).