Tags:
tag this topic
create new tag
view all tags
Condor works differently to many other batch systems so it is advised that you have a look at the [[http://research.cs.wisc.edu/htcondor/manual/v8.0/2_Users_Manual.html][User Manual]]. We are currently only supporting the "Vanilla" universe. ---+++ Submitting a Job To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this: <pre>#################### # # Example 1 # Simple HTCondor submit description file # #################### Executable = myexe Log = myexe.log Input = inputfile Output = outputfile Queue </pre> That runs =myexe= on the batch machine (after copying it and =inputfile= to a temporary directory on the machine) and copies back the standard output of the job to a file called =outputfile= A more complex example submit description would look like: <pre>#################### # # Example 2 # More Complex HTCondor submit description file # #################### Universe = vanilla Executable = my_analysis.sh Arguments = input-$(process).txt result/output-$(process).txt Log = log/my_analysis-$(Process).log Input = input/input-$(process).txt Output = output/my_analysis-$(Process).out Error = output/my_analysis-$(Process).err Request_memory = 2 GB Transfer_output_files = result/output-$(process).txt Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt" Notification = complete Notify_user = your.name@stfc.ac.uk Getenv = True Queue 20 </pre> This submit runs 20 copies (=Queue 20=) of =my_analysis.sh input-$(process).txt result/output-$(process).txt= where =$(process)= is replaced by a number 0 to 19. It will copy =my_analysis.sh= and =input-$(process).txt= to each of the worker nodes (taking the input file from the local =input= directory). The standard output and error from the job are copied back to the local =output= directory at the end of the job and the file =result/output-$(process).txt= is copied back to the local <results>results directory. It copies over the local environment to the worker node (=Getenv = True=) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes. ---+++ Monitoring Your Jobs The basic command for monitoring jobs is =condor_q= by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run =condor_q -global= If jobs have been idle for a while you can use =condor_q -analyze <job_id>= to look at the resources requested by the job and how they match to the available resources on the cluster. Failed jobs often go into a "Held" state rather than disappearing, =condor_q -held <jobid>= will often give some information on why the job failed. =condor_userprio= will give you an idea of the current usage and failshares on the cluster. ---+++ Local Commands The PPD interactive machines have some local commands for making the Condor batch system a bit more convenient to use and more like the LSF system used on lxplus at CERN. You can use =bqsub= to submit jobs (similar to LSF's bsub command). The full command can be specified on the command-line, so you don't need to create a "submit description file". Specify a larger memory job with, eg. =bqsub -s 8GB= (the default is 3GB) eg. <verbatim> bqsub -s 8GB Sherpa PTCUT:=20 EVENTS=1000</verbatim> You can also submit many jobs at once, eg. =bqsub -n 10 ./myscript output_%%.root= will run 10 jobs with arguments =output_0.root= or =output_1.root= etc (=%%= is replaced by the job number in each sub-job). =qjobs= (like LSF's =bjobs=) lists your running jobs with more helpful information. =qpeek= (like LSF's =bpeek=) shows a running job's logfile. Use =bqsub -h=, =qjobs -h=, or =qpeek -h= for help. ---+++ How to retrieve output files from a running job Sometimes it is useful to retrieve files from running jobs to check them or before killing a job which is not needed anymore. Here is a short guide: * from the user interface type =condor_ssh_to_job [jobid]= to connect to the scratch directory on the batch worker node (this may prompt for your password); * by typing =ls= you can list all the files used by your jobs; * by typing =cp [filename] /home/ppd/yourname/yourpath= you can copy =[filename]= to a folder of choice in your home area; * =exit= disconnects your session from the scratch dir, bringing you back to the UI. Now you can check if your files were successfully copied and kill your job if needed. -- Main.ChrisBrew - 2014-03-26
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r6 - 2016-07-20
-
TimAdye
Home
Site map
Atlas web
CMS web
Computing web
Documentation web
PPDITForum web
LHCb web
Main web
PPDLabSpace web
Sandbox web
TWiki web
Documentation Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Register User
E
dit
A
ttach
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback