Difference: BatchSystem (1 vs. 8)

Revision 82015-01-13 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

Changed:
<
<
The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor.
>
>
The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). Since the switch to Sl6 the batch system is based on Condor
 

Resources

Changed:
<
<
The PPD Batch Cluster currently has a nominal capacity of 2700 job slots (one for each of the 2700 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 8.18 for the oldest CPUs to 12.67 for the newest. The nominal total HEPSPEC06 rating of the cluster is 28,000
>
>
The PPD Batch Cluster currently has a nominal capacity about 3500 job slots (one for each of the 3500 CPUs) on 268 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 8.18 for the oldest CPUs to 12.67 for the newest. The nominal total HEPSPEC06 rating of the cluster is about 35,000
 
Deleted:
<
<
At the moment approximately 1800 jobs slots are assigned to the SL6 Condor service with about 900 of the older CPU allocated to the SL5 Torque/Maui service.
 

Allocations

The system tries to "fairly" share out jobs starts between different users, it uses a number of different factors to try to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individual user accounts. These are compared to the usage over the last 14 days.

Grid submission to the SL5 Cluster is now switched off so 100% of the remaining resources are available for local Tier 3 jobs.

For the SL6 Condor service the current highest level Shares are:

Local 15%
Grid 85%

There are currently no group shares in the Local partition, for the grid jobs the split below the Grid partition is as follows:

CMS 13%
Atlas 54%
LHCb 20%
Other 3%
Changed:
<
<

Using the PPD batch systems

>
>

Using the PPD Condor batch system

 
Deleted:
<
<

Condor

 Submitting and monitoring Condor jobs
Deleted:
<
<

Torque

Submitting and monitoring Torque jobs
 

-- ChrisBrew - 2009-11-17

Revision 72014-03-26 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor.

Changed:
<
<

Condor

>
>

Resources

 
Changed:
<
<
Condor works differently to many other batch systems so it is advised that you have a look at the User Manual. We are currently only supporting the "Vanilla" universe.
>
>
The PPD Batch Cluster currently has a nominal capacity of 2700 job slots (one for each of the 2700 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 8.18 for the oldest CPUs to 12.67 for the newest. The nominal total HEPSPEC06 rating of the cluster is 28,000
 
Changed:
<
<

Submitting a Job

>
>
At the moment approximately 1800 jobs slots are assigned to the SL6 Condor service with about 900 of the older CPU allocated to the SL5 Torque/Maui service.
 
Changed:
<
<
To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this:
>
>

Allocations

Deleted:
<
<
####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    
                                                                        
Executable   = myexe                                                    
Log          = myexe.log                                                    
Input        = inputfile
Output       = outputfile
Queue
 
Changed:
<
<
That runs myexe on the batch machine (after copying it and inputfile to a temporary directory on the machine) and copies back the standard output of the job to a file called outputfile
>
>
The system tries to "fairly" share out jobs starts between different users, it uses a number of different factors to try to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individual user accounts. These are compared to the usage over the last 14 days.
 
Changed:
<
<
A more complex example submit description would look like:
>
>
Grid submission to the SL5 Cluster is now switched off so 100% of the remaining resources are available for local Tier 3 jobs.
Deleted:
<
<
####################                                                    
# 
# Example 2                                                            
# More Complex HTCondor submit description file                                    
#                                                                       
####################                                                    

 
Changed:
<
<
Universe = vanilla
>
>
For the SL6 Condor service the current highest level Shares are:
Deleted:
<
<
Executable = my_analysis.sh Arguments = input-$(process).txt result/output-$(process).txt Log = log/my_analysis-$(Process).log Input = input/input-$(process).txt Output = output/my_analysis-$(Process).out Error = output/my_analysis-$(Process).err Request_memory = 2 GB Transfer_output_files = result/output-$(process).txt Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt" Notification = complete Notify_user = your.name@stfc.ac.uk Getenv = True Queue 20
 
Changed:
<
<
This submit runs 20 copies (Queue 20) of my_analysis.sh input-$(process).txt result/output-$(process).txt where $(process) is replaced by a number 0 to 19. It will copy my_analysis.sh and input-$(process).txt to each of the worker nodes (taking the input file from the local input directory). The standard output and error from the job are copied back to the local output directory at the end of the job and the file result/output-$(process).txt is copied back to the local results directory. It copies over the local environment to the worker node (Getenv = True) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes.
>
>
Local 15%
Added:
>
>
Grid 85%
 
Changed:
<
<

Monitoring Your Jobs

>
>
There are currently no group shares in the Local partition, for the grid jobs the split below the Grid partition is as follows:
 
Changed:
<
<
The basic command for monitoring jobs is condor_q by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run condor_q -global
>
>
CMS 13%
Added:
>
>
Atlas 54%
LHCb 20%
Other 3%
 
Changed:
<
<
Failed jobs often go into a "Held" state rather than disappearing, condor_q -held <jobid> will often give some information on why the job failed.
>
>

Using the PPD batch systems

 
Changed:
<
<
condor_userprio will give you an idea of the current usage and failshares on the cluster.
>
>

Condor

 
Changed:
<
<

Torque

>
>
Submitting and monitoring Condor jobs
Deleted:
<
<
Submitting and monitoring jobs
 
Changed:
<
<

Resources

>
>

Torque

Added:
>
>
Submitting and monitoring Torque jobs
 
Deleted:
<
<
The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors to try to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.

The current highest level Shares are:

Local 15%
Grid 85%

Then below that:

CMS 74.64%
Atlas 20.39%
LHCb 4.05%
Other 1.00%

(Which actually makes slightly over 100%, but the system sorts it out!)

  -- ChrisBrew - 2009-11-17

Revision 62014-02-28 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor.

Condor

Condor works differently to many other batch systems so it is advised that you have a look at the User Manual. We are currently only supporting the "Vanilla" universe.

Submitting a Job

To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this:

####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    
                                                                        
Executable   = myexe                                                    
Log          = myexe.log                                                    
Input        = inputfile
Output       = outputfile
Queue

That runs myexe on the batch machine (after copying it and inputfile to a temporary directory on the machine) and copies back the standard output of the job to a file called outputfile

A more complex example submit description would look like:

####################                                                    
# 
# Example 2                                                            
# More Complex HTCondor submit description file                                    
#                                                                       
####################                                                    

Universe               = vanilla
Executable             = my_analysis.sh
Arguments              = input-$(process).txt result/output-$(process).txt
Log                    = log/my_analysis-$(Process).log
Input                  = input/input-$(process).txt
Output                 = output/my_analysis-$(Process).out
Error                  = output/my_analysis-$(Process).err
Request_memory         = 2 GB
Transfer_output_files  = result/output-$(process).txt
Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt"
Notification           = complete
Notify_user            = your.name@stfc.ac.uk
Getenv                 = True
Queue 20

This submit runs 20 copies (Queue 20) of my_analysis.sh input-$(process).txt result/output-$(process).txt where $(process) is replaced by a number 0 to 19. It will copy my_analysis.sh and input-$(process).txt to each of the worker nodes (taking the input file from the local input directory). The standard output and error from the job are copied back to the local output directory at the end of the job and the file result/output-$(process).txt is copied back to the local results directory. It copies over the local environment to the worker node (Getenv = True) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes.

Added:
>
>

Monitoring Your Jobs

The basic command for monitoring jobs is condor_q by default this only shows jobs submitted to the "schedd" (essentially submit node) you are using, to see all the jobs in the system run condor_q -global

Failed jobs often go into a "Held" state rather than disappearing, condor_q -held <jobid> will often give some information on why the job failed.

condor_userprio will give you an idea of the current usage and failshares on the cluster.

 

Torque

Submitting and monitoring jobs

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

Changed:
<
<
The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.
>
>
The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors to try to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.
  The current highest level Shares are:

Local 15%
Grid 85%

Then below that:

CMS 74.64%
Atlas 20.39%
LHCb 4.05%
Other 1.00%

(Which actually makes slightly over 100%, but the system sorts it out!)

-- ChrisBrew - 2009-11-17

Revision 52014-02-28 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor.

Condor

Condor works differently to many other batch systems so it is advised that you have a look at the User Manual. We are currently only supporting the "Vanilla" universe.

Submitting a Job

To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this:

####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    
                                                                        
Executable   = myexe                                                    
Log          = myexe.log                                                    
Input        = inputfile
Output       = outputfile
Queue

That runs myexe on the batch machine (after copying it and inputfile to a temporary directory on the machine) and copies back the standard output of the job to a file called outputfile

A more complex example submit description would look like:

####################                                                    
# 

Changed:
<
<
# Example 1 # Simple HTCondor submit description file
>
>
# Example 2 # More Complex HTCondor submit description file
 # ####################

Universe = vanilla Executable = my_analysis.sh Arguments = input-$(process).txt result/output-$(process).txt Log = log/my_analysis-$(Process).log Input = input/input-$(process).txt Output = output/my_analysis-$(Process).out Error = output/my_analysis-$(Process).err Request_memory = 2 GB Transfer_output_files = result/output-$(process).txt Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt" Notification = complete Notify_user = your.name@stfc.ac.uk Getenv = True Queue 20

This submit runs 20 copies (Queue 20) of my_analysis.sh input-$(process).txt result/output-$(process).txt where $(process) is replaced by a number 0 to 19. It will copy my_analysis.sh and input-$(process).txt to each of the worker nodes (taking the input file from the local input directory). The standard output and error from the job are copied back to the local output directory at the end of the job and the file result/output-$(process).txt is copied back to the local results directory. It copies over the local environment to the worker node (Getenv = True) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes.

Torque

Submitting and monitoring jobs

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.

The current highest level Shares are:

Local 15%
Grid 85%

Then below that:

CMS 74.64%
Atlas 20.39%
LHCb 4.05%
Other 1.00%

(Which actually makes slightly over 100%, but the system sorts it out!)

-- ChrisBrew - 2009-11-17

Revision 42014-02-13 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

Changed:
<
<
The PPD Batch System is shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). It uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly.
>
>
The PPD Batch Systems are shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). We now have two batch system, the legacy Scientific Linux 5 system uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly. While the newer Scientific Linux 6 services use Condor.
 
Added:
>
>

Condor

Condor works differently to many other batch systems so it is advised that you have a look at the User Manual. We are currently only supporting the "Vanilla" universe.

Submitting a Job

To submit a jobs to the condor batch system you first need to write a "submit description file" to describe the job to the system: A very simple file would look like this:

####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    
                                                                        
Executable   = myexe                                                    
Log          = myexe.log                                                    
Input        = inputfile
Output       = outputfile
Queue

That runs myexe on the batch machine (after copying it and inputfile to a temporary directory on the machine) and copies back the standard output of the job to a file called outputfile

A more complex example submit description would look like:

####################                                                    
# 
# Example 1                                                            
# Simple HTCondor submit description file                                    
#                                                                       
####################                                                    

Universe               = vanilla
Executable             = my_analysis.sh
Arguments              = input-$(process).txt result/output-$(process).txt
Log                    = log/my_analysis-$(Process).log
Input                  = input/input-$(process).txt
Output                 = output/my_analysis-$(Process).out
Error                  = output/my_analysis-$(Process).err
Request_memory         = 2 GB
Transfer_output_files  = result/output-$(process).txt
Transfer_output_remaps = "output-$(process).txt = results/output-$(process).txt"
Notification           = complete
Notify_user            = your.name@stfc.ac.uk
Getenv                 = True
Queue 20

This submit runs 20 copies (Queue 20) of my_analysis.sh input-$(process).txt result/output-$(process).txt where $(process) is replaced by a number 0 to 19. It will copy my_analysis.sh and input-$(process).txt to each of the worker nodes (taking the input file from the local input directory). The standard output and error from the job are copied back to the local output directory at the end of the job and the file result/output-$(process).txt is copied back to the local results directory. It copies over the local environment to the worker node (Getenv = True) and requests 2GB of memory to run in on the worker node. Finally it e-mails the user when each job completes.

Torque

 Submitting and monitoring jobs

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.

The current highest level Shares are:

Local 15%
Grid 85%

Then below that:

CMS 74.64%
Atlas 20.39%
LHCb 4.05%
Other 1.00%

(Which actually makes slightly over 100%, but the system sorts it out!)

Deleted:
<
<

Queues

  -- ChrisBrew - 2009-11-17

Revision 32012-04-05 - RobHarper

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch System is shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). It uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly.

Submitting and monitoring jobs

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

Changed:
<
<
The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last seven day.
>
>
The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last 14 days.
  The current highest level Shares are:
Changed:
<
<
Local 10%
Grid 90%
>
>
Local 15%
Grid 85%
  Then below that:
Changed:
<
<
Atlas 22.14%
BaBar 14.64%
CMS 36.20%
LHCb 25.98%
Other 1.04%
>
>
CMS 74.64%
Atlas 20.39%
LHCb 4.05%
Other 1.00%
Added:
>
>
(Which actually makes slightly over 100%, but the system sorts it out!)
 

Queues

-- ChrisBrew - 2009-11-17

Revision 22010-01-18 - TimAdye

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch System is shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). It uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly.

Added:
>
>
Submitting and monitoring jobs
 

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

Changed:
<
<
While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.
>
>
While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.
 

Allocations

The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last seven day.

The current highest level Shares are:

Local 10%
Grid 90%

Then below that:

Atlas 22.14%
BaBar 14.64%
CMS 36.20%
LHCb 25.98%
Changed:
<
<
Other 1.04%
>
>
Other 1.04%
 

Queues

-- ChrisBrew - 2009-11-17

Revision 12009-11-17 - ChrisBrew

 
META TOPICPARENT name="WebHome"

The PPD Batch System

The PPD Batch System is shared between the Grid Tier 2 and Local usage (often referred to as Tier 3). It uses the torque batch system (an evolution of PBS) and the Maui scheduler to allocate the job slots fairly.

Resources

The PPD Batch Cluster currently has a nominal capacity of 1584 job slots (one for each of the 1584 CPUs) on 240 nodes. The power of the CPUs are measured using a benchmark called HEPSPEC06, the individual CPUs range from a HEPSPEC06 rating of 6.80 for the oldest CPUs to 8.50 for the newest. The nominal total HEPSPEC06 rating of the cluster is 12,918.4

While almost all of the CPUs (1544) run 64bit Scientific Linux 5, there is a small 32bit SL4 service of 40 CPUs for groups that still need it - though that will be phased out as the need goes away.

Allocations

The system uses the Maui scheduler to "fairly" share out jobs starts between different users, it uses a number of different factors totry to do this. Amongst them are whether it is a "local" or grid job, the Grid VO, groups within the VOs and individualuser accounts. These are compared to theusage over the last seven day.

The current highest level Shares are:

Local 10%
Grid 90%

Then below that:

Atlas 22.14%
BaBar 14.64%
CMS 36.20%
LHCb 25.98%
Other 1.04%

Queues

-- ChrisBrew - 2009-11-17

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback