Notifications
Overview
Have you ever wanted to know when your job finished, completed successfully, or errored out?
The HTCondor notification
variable does just that! When enabled,
it can send you an email when any of the above happens!
There are four possible settings, two of which are identical, one of which disables notifications, and one limits when they will be sent.
The HTCondor man page for condor_submit says…
notification = <Always | Complete | Error | Never>
Owners of HTCondor jobs are notified by e-mail when certain events occur. If defined by Always or Complete, the owner will be notified when the job terminates. If defined by Error, the owner will only be notified if the job terminates abnormally, (as defined by
JobSuccessExitCode
, if defined) or if the job is placed on hold because of a failure, and not by user request. If defined by Never (the default), the owner will not receive e-mail, regardless to what happens to the job. The HTCondor User’s manual documents statistics included in the e-mail.
There’s a lot there, and still said in a very succint way. To make it even shorter, here’s the TL;DR version:
Always
-
Complete
If a job finishes, send an email. It doesn’t matter if the job finished with an error or not, just send the summary email.
-
Error
If the job had an error and terminated before completion, send an email.
-
Never
Don’t send an email. This is the default if you don’t set the notification variable.
Example Submission Files
1Executable = test.sh
2
3Output = out.log
4Error = err.log
5Log = condor.log
6
7Request_Cpus = 1
8Request_Memory = 2GB
9
10Notification = Always
11
12Queue
The single, emphasized line is the only line that matters in the example. It simply turns on the notifications. It could be set to Error if you only wanted to be notified that the job failed. Assuming the test.sh script ran correctly, this example would send a message like the example email below.
Example Emails
From: csaw.support@wwu.edu
To: USER@wwu.edu
Subject: [Condor] Condor Job 17521.0
Body:
This is an automated email from the Condor system
on machine "cse-head.cluster.cs.wwu.edu". Do not reply.
Condor job 17521.0
/cluster/home/USER/example_submit_scripts/email/test.sh
submitted from directory /cluster/home/USER/example_submit_scripts/email
exited normally with status 0
Submitted at: Fri Jul 26 08:35:15 2024
Completed at: Fri Jul 26 08:35:22 2024
Real Time: 0 00:00:07
Virtual Image Size: 720 Kilobytes
Statistics from last run:
Allocation/Run time: 0 00:00:06
Remote User CPU Time: 0 00:00:00
Remote System CPU Time: 0 00:00:00
Total Remote CPU Time: 0 00:00:00
Statistics totaled from all runs:
Allocation/Run time: 0 00:00:06
Network:
0.0 B Run Bytes Received By Job
0.0 B Run Bytes Sent By Job
0.0 B Total Bytes Received By Job
0.0 B Total Bytes Sent By Job
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or HTCondor in general?
Email address of the local HTCondor administrator: csaw.support@wwu.edu
The Official HTCondor Homepage is http://www.cs.wisc.edu/htcondor
Limits
Currently, the HTCondor configuration on the backend prohibits you from submitting a batch size of more than 5 jobs when notifications are enabled. This ensures you won’t submit 10,000 jobs and have all 10,000 fail or complete in rapid succession and have your inbox overflowing with mail.