Tags:
view all tags
---+ The PPD Batch System The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor. ---++ Condor Condor works differently to many other batch systems so it is advised that you have a look at the [[http://research.cs.wisc.edu/htcondor/manual/v8.0/2_Users_Manual.html][User Manual]]. We are currently only supporting the "Vanilla" universe. ---+++ Submitting a Job To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this: <pre> #################### # # Example 1 # Simple HTCondor submit description file # #################### Executable = myexe Log = myexe.log Input = inputfile Output = outputfile Queue </pre> That runs <code>myexe</code> on the batch machine (after copying it and <code>inputfile</code> to a temporary directory on the machine) and copies back the standard output of the job to a file called <code>outputfile</code> A more complex example submit description would look like: <pre> #################### # # Example 2 # More Complex HTCondor submit description file # #################### Universe = vanilla Executable = my_analysis.sh Arguments = input-$(process).txt result/output-$(process).txt Log = log/my_analysis-$(Process).log Input = input/input-$(process).txt Output = output/my_analysis-$(Process).out Error = output/my_analysis-$(Process).err Request_memory = 2 GB Transfer_output_files = result/output-$(process).txt Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt" Notification = complete Notify_user = your.name@stfc.ac.uk Getenv = True Queue 20 </pre> This submit runs 20 copies (<code>Queue 20</code>) of <code>my_analysis.sh input-$(process).txt result/output-$(process).txt</code> where <code>$(process)</code> is replaced by a number 0 to 19. It will copy <code>my_analysis.sh</code> and <code>input-$(process).txt</code> to each of the worker nodes (taking the input file from the local <code>input</code> directory). The standard output and error from the job are copied back to the local <code>output</code> directory at the end of the job and the file <code>result/output-$(process).txt</code> is copied back to the local <results>results</code> directory. It copies over the local environment to the worker node (<code>Getenv = True</code>) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes. ---+++ Monitoring Your Jobs The basic command for monitoring jobs is <code>condor_q</code> by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run <code>condor_q -global</code> Failed jobs often go into a "Held" state rather than disappearing, <code>condor_q -held <jobid></code> will often give some information on why the job failed. <code>condor_userprio</code> will give you an idea of the current usage and failshares on the cluster. ---++ Torque [[BatchUsage][Submitting and monitoring jobs]] ---+++ Resources The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4 While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away. ---+++ Allocations The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors to try to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days. The current highest level Shares are: | Local | 15% | | Grid | 85% | Then below that: | CMS | 74.64% | | Atlas | 20.39% | | LHCb | 4.05% | | Other | 1.00% | (Which actually makes slightly over 100%, but the system sorts it out!) -- Main.ChrisBrew - 2009-11-17
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r8
<
r7
<
r6
<
r5
<
r4
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r6 - 2014-02-28
-
ChrisBrew
Home
Site map
Atlas web
CMS web
Computing web
Documentation web
PPDITForum web
LHCb web
PPDLabSpace web
Sandbox web
TWiki web
Documentation Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Register User
Edit
Attach
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback