BatchUsageCondor < Computing/Documentation

Tags: view all tags
Condor works differently to many other batch systems so it is advised that you have a look at the [[http://research.cs.wisc.edu/htcondor/manual/v8.0/2_Users_Manual.html][User Manual]]. We are currently only supporting the "Vanilla" universe.

---+++ Submitting a Job

To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this:

<pre>####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    
                                                                        
Executable   = myexe                                                    
Log          = myexe.log                                                    
Input        = inputfile
Output       = outputfile
Queue
</pre>

That runs =myexe= on the batch machine (after copying it and =inputfile= to a temporary directory on the machine) and copies back the standard output of the job to a file called =outputfile=

A more complex example submit description would look like:

<pre>####################                                                    
# 
# Example 2                                                            
# More Complex HTCondor submit description file                                    
#                                                                       
####################                                                    

Universe               = vanilla
Executable             = my_analysis.sh
Arguments              = input-$(process).txt result/output-$(process).txt
Log                    = log/my_analysis-$(Process).log
Input                  = input/input-$(process).txt
Output                 = output/my_analysis-$(Process).out
Error                  = output/my_analysis-$(Process).err
Request_memory         = 2 GB
Transfer_output_files  = result/output-$(process).txt
Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt"
Notification           = complete
Notify_user            = your.name@stfc.ac.uk
Getenv                 = True
Queue 20
</pre>

This submit runs 20 copies (=Queue 20=) of =my_analysis.sh input-$(process).txt result/output-$(process).txt= where =$(process)= is replaced by a number 0 to 19. It will copy =my_analysis.sh= and =input-$(process).txt= to each of the worker nodes (taking the input file from the local =input= directory). The standard output and error from the job are copied back to the local =output= directory at the end of the job and the file =result/output-$(process).txt= is copied back to the local <results>results directory. It copies over the local environment to the worker node (=Getenv = True=) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes.

---+++ Monitoring Your Jobs

The basic command for monitoring jobs is =condor_q= by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run =condor_q -global=

If jobs have been idle for a while you can use =condor_q -analyze &lt;job_id&gt;= to look at the resources requested by the job and how they match to the available resources on the cluster.

Failed jobs often go into a "Held" state rather than disappearing, =condor_q -held &lt;jobid&gt;= will often give some information on why the job failed.

=condor_userprio= will give you an idea of the current usage and failshares on the cluster.

---+++ Local Commands

The PPD interactive machines have some local commands for making the Condor batch system a bit more convenient to use and more like the LSF system used on lxplus at CERN.

You can use =bqsub= to submit jobs (similar to LSF's bsub command). The full command can be specified on the command-line, so you don't need to create a "submit description file". Specify a larger memory job with, eg. =bqsub -s 8GB= (the default is 3GB) eg.
<verbatim>
bqsub -s 8GB Sherpa PTCUT:=20 EVENTS=1000</verbatim>

You can also submit many jobs at once, eg. =bqsub -n 10 ./myscript output_%%.root= will run 10 jobs with arguments =output_0.root= or =output_1.root= etc (=%%= is replaced by the job number in each sub-job).

=qjobs= (like LSF's =bjobs=) lists your running jobs with more helpful information.

=qpeek= (like LSF's =bpeek=) shows a running job's logfile.

Use =bqsub -h=, =qjobs -h=, or =qpeek -h= for help.

---+++ How to retrieve output files from a running job

Sometimes it is useful to retrieve files from running jobs to check them or before killing a job which is not needed anymore. Here is a short guide:

   * from the user interface type =condor_ssh_to_job [jobid]= to connect to the scratch directory on the batch worker node (this may prompt for your password);

   * by typing =ls= you can list all the files used by your jobs;

   * by typing =cp [filename] /home/ppd/yourname/yourpath= you can copy =[filename]= to a folder of choice in your home area;

   * =exit= disconnects your session from the scratch dir, bringing you back to the UI. Now you can check if your files were successfully copied and kill your job if needed.

-- Main.ChrisBrew - 2014-03-26
Topic revision: r6 - 2016-07-20 - TimAdye