Condor works differently to many other batch systems so it is advised that you have a look at the [[http://research.cs.wisc.edu/htcondor/manual/v8.0/2_Users_Manual.html][User Manual]]. We are currently only supporting the "Vanilla" universe. ---+++ Submitting a Job To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this: <pre>#################### # # Example 1 # Simple HTCondor submit description file # #################### Executable = myexe Log = myexe.log Input = inputfile Output = outputfile Queue </pre> That runs =myexe= on the batch machine (after copying it and =inputfile= to a temporary directory on the machine) and copies back the standard output of the job to a file called =outputfile= A more complex example submit description would look like: <pre>#################### # # Example 2 # More Complex HTCondor submit description file # #################### Universe = vanilla Executable = my_analysis.sh Arguments = input-$(process).txt result/output-$(process).txt Log = log/my_analysis-$(Process).log Input = input/input-$(process).txt Output = output/my_analysis-$(Process).out Error = output/my_analysis-$(Process).err Request_memory = 2 GB Transfer_output_files = result/output-$(process).txt Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt" Notification = complete Notify_user = your.name@stfc.ac.uk Getenv = True Queue 20 </pre> This submit runs 20 copies (=Queue 20=) of =my_analysis.sh input-$(process).txt result/output-$(process).txt= where =$(process)= is replaced by a number 0 to 19. It will copy =my_analysis.sh= and =input-$(process).txt= to each of the worker nodes (taking the input file from the local =input= directory). The standard output and error from the job are copied back to the local =output= directory at the end of the job and the file =result/output-$(process).txt= is copied back to the local <results>results directory. It copies over the local environment to the worker node (=Getenv = True=) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes. ---+++ Monitoring Your Jobs The basic command for monitoring jobs is =condor_q= by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run =condor_q -global= If jobs have been idle for a while you can use =condor_q -analyze <job_id>= to look at the resources requested by the job and how they match to the available resources on the cluster. Failed jobs often go into a "Held" state rather than disappearing, =condor_q -held <jobid>= will often give some information on why the job failed. =condor_userprio= will give you an idea of the current usage and failshares on the cluster. ---+++ Local Commands The PPD interactive machines have some local commands for making the Condor batch system a bit more convenient to use and more like the LSF system used on lxplus at CERN. You can use =bqsub= to submit jobs (similar to LSF's bsub command). The full command can be specified on the command-line, so you don't need to create a "submit description file". Specify a time limit with =bqsub -c hh:mm=. eg. <verbatim> bqsub -c 24:00 Sherpa PTCUT:=20 EVENTS=1000</verbatim> You can also submit many jobs at once, eg. =bqsub -n 10 ./myscript output_%%.root= will run 10 jobs with arguments =output_0.root= or =output_1.root= etc (=%%= is replaced by the job number in each sub-job). =qjobs= (like LSF's =bjobs=) lists your running jobs with more helpful information. =qpeek= (like LSF's =bpeek=) shows a running job's logfile. Use =bqsub -h=, =qjobs -h=, or =qpeek -h= for help. ---+++ How to retrieve output files from a running job Sometimes it is useful to retrieve files from running jobs to check them or before killing a job which is not needed anymore. Here is a short guide: * from the user interface type =condor_ssh_to_job [jobid]= to connect to the scratch directory on the batch worker node (this may prompt for your password); * by typing =ls= you can list all the files used by your jobs; * by typing =cp [filename] /home/ppd/yourname/yourpath= you can copy =[filename]= to a folder of choice in your home area; * =exit= disconnects your session from the scratch dir, bringing you back to the UI. Now you can check if your files were successfully copied and kill your job if needed. -- Main.ChrisBrew - 2014-03-26
This topic: Computing/Documentation
>
WebHome
>
BatchSystem
>
BatchUsageCondor
Topic revision: r5 - 2016-02-04 - TimAdye
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback