Scheduled Maintenance & Downtime

01/02/2025 - OS Upgrade & HTCondor Upgrade

On Thursday, January 2nd from 8AM to 5PM the CSaW clusters will be offline for an upgrade to Rocky Linux 9 and HTCondor 24.0 on all nodes.

Rocky 9 will offer us access to new and improved tools, such as GCC 11.5 by default (up from 8.5 previously), system Python 3.9 (up from 3.6) and adds Python 3.12 (up from 3.11), Java 21 (up from 17). In addition to GCC 11.5, there are scl’s for 12 (12.2), 13 (13.3), and even the recently released 14 (14.2).

HTCondor 24.0 brings in some wonderful new features that will better isolate and protect each job running on a shared compute node, and needs the newer features of Rocky 9 in order to accomplish this. As the cluster usage continues to reach new peaks in usage, this is critical in ensuring that research can run fairly and in a protected environment.

Because of the major OS upgrade, all running jobs will be stopped and removed. Most compiled software will need to be recompiled before those jobs can run again. Those researchers using containers and Conda environments will likely not require any additional work.

See the archived maintenance and downtime page.

Updates

08/26/2024 - HTCondor Cookbook & Apptainer Example

A new section for the HTCondor Cookbook has been added. Here you will find small snippets of HTCondor submission files and relevant recipes for doing various tasks with HTCondor. Current recipes include: submitting lots of jobs, using a GPU from the CSE cluster, requesting an H100, and ensuring access to InfiniBand nodes.

The Singularity documentation has been partially rewritten, to reference the changeover to Apptainer instead. Apptainer should largely function as a drop in replacement for CSaW users, but also allow for building your own containers on our environment without requiring a third party account or resources.

07/29/2024 - Email Notifications

Email notifications are now available in the cluster environments. With this setting enabled in your HTCondor job, you can receive email notifications when your job finishes or errors out.

04/03/2024 - JupyterHub

JupyterHub is now available in the cluster environments. This will bring an easy to use web interface to the cluster’s resources. It, like the rest of the cluster environment, is only accessible from the WWU campus network (wifi, labs, etc.) or via the VPN.

01/04/2024 - HTCondor 23.0

All cluster pools have been upgraded to HTCondor 23.0, the current LTS (Long Term Stable) release. Thanks again to the backwards compatibility to the previous version we were running, the upgrade was able to be rolled out between jobs finishing and before starting new ones. There aren’t too many new features between our previous version and this one due to running on the “feature branch” previously. It does however provide us with continued bug fixes and security updates for the coming year.

09/22/2023 - HTCondor 10.8

All cluster pools have been upgraded to HTCondor 10.8. Thanks to the backwards compatibility to the previous version we were running, the upgrade was able to be rolled out between jobs finishing and before starting new ones. This upgrade should address the crashes we were seeing in two places:

  • The scheduler crashing when parallel jobs crashed. This prevented people from both seeing the state of their jobs as well as submitting new jobs.

  • The individual nodes crashing due to a kernel panic. HTCondor’s resource limiting system utilizes Linux’s cgroup’s subsystem. This unfortunately triggered a kernel panic in the disk layer when jobs were heavily writing to the disk. The cgroup control of block i/o (disk access) has been removed to work around this.

There is one change that might affect a small number of power users who are trying to run their jobs on specific GPUs; the new require_gpus variable that can be set at submission time. Please see the HTCondor documentation on “Jobs that Require GPUs” for an example of using it as well as how to query nodes to find the attributes that can be requested.

01/12/2023 - CSCI Lab Scheduler Restored

The csci-lab-head node is able to send jobs to the desktops in the Computer Science labs again. The HTCondor team released version 10.0.1 which included support for Ubuntu 22.04 (Jammy). This OS upgrade was deployed by CS Support over the Winter break, which had caused temporary issue.

The setup is the same as before and is detailed in the F.A.Q..

12/27/2022 - OS Updates, Account Changes

  • SSH is now set to use the default port, 22, instead of 922

    The Getting Started Guide has been updated to reflect this change. It includes an updated copy-paste example of a ~/.ssh/config that you can replace your current one with. Alternatively, you can edit the line that says “Port 922” to “Port 22” or simply remove the line, as port 22 is the default.

    If you do not update your config, you will get a “No route to host” error when trying to connect.

  • You now login with your WWU Universal Account credentials

    Your username is the same as before, but if you were accessing the cluster with your CSCI password, you will need to use your WWU password. SSH keys will continue to work exactly as they did before.

  • All files have had their owner and group IDs updated

    The were many old files on the cluster file server that belonged to people who do not have a WWU account any longer, and could not have their IDs corrected. If you find you can’t access a file that you previously had access to, please contact .

  • Some software may need to be recompiled or environments rebuilt

    The operating system has changed to Rocky Linux 8 from Ubuntu 18.04. Due to incompatible changes between them, some software may need to be recompiled. If you are using the system Python you may find your virtual environments are also no longer working due to the Python upgrade; you will want to rebuild them from your requirements.txt files. If you are leveraging (ana/mini/bio)conda your environments should continue to work.

  • Software compilation and testing happens only on execute nodes

    In order to ensure enough resources for other upcoming changes, the software development tools are now only installed on the execute nodes. You can get a remote shell on an execute node using condor_submit -interactive, as is detailed in the F.A.Q.

05/10/2022 - GROMACS & Much More!

GROMACS is now available to all in the cluster environment, with documentation and example scripts having been written to help researchers to leverage it for their work.

Additionally entries in the F.A.Q. have been updated for HTCondor 9.0, new updates to the Getting Started Guide’s example jobs to mention requesting resources to help ensure that resources are available for each job when asked for. The Getting Started Guide now also includes a link to a Linux command line tutorial to help those new to using the command line and provide more examples of navigating using the command line.

Entries added to the F.A.Q. about leveraging the SSDs in some of the GPU nodes to help improve throughput.

Finally, HTCondor has also been recently updated to the latest stable release, 9.0.12, which should resolve the scheduler crashes that were seen recently. The occurred when the scheduler tried to re-connect to already running parallel universe jobs.

Older News

See the archived news page.