Difference: LinuxFAQ (1 vs. 5)

Revision 52016-11-28 - ChrisBrew

 
META TOPICPARENT name="WebHome"

Why aren't my batch jobs starting?

The RALPP Batch system uses the Maui scheduler to fairly allocate CPU resources between users, groups and Tier 2 (Grid) and Tier 3 (local) use. This means that which job starts next not on how long it has been in the queue but on recent usage of the batch system by the different user groups. So, although you job might head the list of queued jobs reported by qstat that does not mean that it will necessarily start soon.

Luckily Maui provides a few commands to look at the current status which allow you to work out what is going on.

The first showq is fairly straight forward in that it displays the current job queue in priority order:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

5768701            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
5768715            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
...
5792780            pltatl09    Running     1  4:00:00:00  Mon Oct 25 12:24:33
5792777             prdh107    Running     1  4:00:00:00  Mon Oct 25 12:24:33

  1500 Active Jobs    1502 of 1636 Processors Active (91.81%)
                       223 of  242 Nodes Active      (92.15%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831310            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
5831311            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
...
5825489             prdh107       Idle     1  4:00:00:00  Wed Oct 27 23:21:21
5830811             prdh107       Idle     1  4:00:00:00  Thu Oct 28 14:04:08

596 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831115            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:46:15
5831178            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33
5831179            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33

Total Jobs: 2099   Active Jobs: 1500   Idle Jobs: 596   Blocked Jobs: 3

So in this case job number 5831310 for pltatl09 is likely to be the next job to start, though if a user with a higher priority submits a job before that happens the new job would start first.

The "Blocked" atlassgm jobs have special limits which is why they are currently blocked. However, if your jobs appear in the blocked list it generally means that there was a problem either with one of the batch nodes or the communication between the node an the batch server and you job will not run.

The other useful command is diagnose -f

Here is a cut down sample of the output:

ACCT								      
-------------							      
atlas*           31.41  20.39    67.01   19.79   24.40   39.20   36.49
cms*             37.18  74.64    29.26   42.98    5.40   19.36   31.55
lhcb              0.07   4.05     0.10    0.13    0.06    0.05    0.06
other*            9.07   1.00     0.64   25.43   30.49    5.46    3.12
babar             0.01 -------    0.01    0.01    0.02    0.01    0.01
								      
QOS								      
-------------							      
grid*            77.72  85.00    97.02   88.33   60.33   64.07   71.23
local*           22.25  15.00     2.97   11.63   39.62   35.90   28.76
mon               0.03   1.00     0.01    0.04    0.05    0.03    0.01
								      
CLASS								      
-------------							      
prod             22.25 -------    2.97   11.63   39.62   35.90   28.76
grid             77.75 -------   97.03   88.37   60.38   64.10   71.24
There is more output above and to the right.

For local users the most interesting part is the QOS section which deals with the split between Tier 2 (grid) and Tier 3 (local) usage.

The first number is the "recent usage" (we'll get to how that's calculated later) and the second number is the fairshare target. In the example output the local users have had 22.25% of the recent use of the farm which is over their target of 15% (the numbers aren't quite percentages because of the monitoring share but that's a very small correction), whilst the grid has only had 77.72% on a target of 85%. So assuming that there are grid jobs available they will be started ahead of local jobs until the system either runs out of grid jobs or the actual usage hits 85:15 at which point it will start local jobs again.

The next numbers are the historical daily usage by the different classes of jobs so for today the split of grid to local CPU usage is 97.02:2.97, yesterday it was 88.33:11.63 and so on. The "recent usage" number is calculated as:

todays usage + (1 day ago)*0.9 + (2 days ago)*0.9^2 + ... + (13 days ago)*0.9^13
i.e. a two week rolling average historical usage counts less and less the longer ago it was.
Added:
>
>

How can I use kerberos to log into lxplus without needing to enter a password each time?

Add the following section to you ~/.ssh/config file:

Host lxplus
        GSSAPIAuthentication yes
        GSSAPIDelegateCredentials yes
        GSSAPITrustDns yes
        PubkeyAuthentication no
        RSAAuthentication no
        StrictHostKeyChecking no
        Protocol 2
        ForwardX11 yes
        ForwardAgent
        User  # useful if your RAL and CERN usernames are different
        HostName lxplus.cern.ch
  -- ChrisBrew - 2010-10-25

Revision 42012-01-19 - ChrisBrew

 
META TOPICPARENT name="WebHome"

Why aren't my batch jobs starting?

The RALPP Batch system uses the Maui scheduler to fairly allocate CPU resources between users, groups and Tier 2 (Grid) and Tier 3 (local) use. This means that which job starts next not on how long it has been in the queue but on recent usage of the batch system by the different user groups. So, although you job might head the list of queued jobs reported by qstat that does not mean that it will necessarily start soon.

Luckily Maui provides a few commands to look at the current status which allow you to work out what is going on.

The first showq is fairly straight forward in that it displays the current job queue in priority order:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

5768701            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
5768715            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
...
5792780            pltatl09    Running     1  4:00:00:00  Mon Oct 25 12:24:33
5792777             prdh107    Running     1  4:00:00:00  Mon Oct 25 12:24:33

  1500 Active Jobs    1502 of 1636 Processors Active (91.81%)
                       223 of  242 Nodes Active      (92.15%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831310            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
5831311            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
...
5825489             prdh107       Idle     1  4:00:00:00  Wed Oct 27 23:21:21
5830811             prdh107       Idle     1  4:00:00:00  Thu Oct 28 14:04:08

596 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831115            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:46:15
5831178            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33
5831179            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33

Total Jobs: 2099   Active Jobs: 1500   Idle Jobs: 596   Blocked Jobs: 3

So in this case job number 5831310 for pltatl09 is likely to be the next job to start, though if a user with a higher priority submits a job before that happens the new job would start first.

The "Blocked" atlassgm jobs have special limits which is why they are currently blocked. However, if your jobs appear in the blocked list it generally means that there was a problem either with one of the batch nodes or the communication between the node an the batch server and you job will not run.

Added:
>
>
The other useful command is diagnose -f

Here is a cut down sample of the output:

ACCT								      
-------------							      
atlas*           31.41  20.39    67.01   19.79   24.40   39.20   36.49
cms*             37.18  74.64    29.26   42.98    5.40   19.36   31.55
lhcb              0.07   4.05     0.10    0.13    0.06    0.05    0.06
other*            9.07   1.00     0.64   25.43   30.49    5.46    3.12
babar             0.01 -------    0.01    0.01    0.02    0.01    0.01
								      
QOS								      
-------------							      
grid*            77.72  85.00    97.02   88.33   60.33   64.07   71.23
local*           22.25  15.00     2.97   11.63   39.62   35.90   28.76
mon               0.03   1.00     0.01    0.04    0.05    0.03    0.01
								      
CLASS								      
-------------							      
prod             22.25 -------    2.97   11.63   39.62   35.90   28.76
grid             77.75 -------   97.03   88.37   60.38   64.10   71.24
There is more output above and to the right.

For local users the most interesting part is the QOS section which deals with the split between Tier 2 (grid) and Tier 3 (local) usage.

The first number is the "recent usage" (we'll get to how that's calculated later) and the second number is the fairshare target. In the example output the local users have had 22.25% of the recent use of the farm which is over their target of 15% (the numbers aren't quite percentages because of the monitoring share but that's a very small correction), whilst the grid has only had 77.72% on a target of 85%. So assuming that there are grid jobs available they will be started ahead of local jobs until the system either runs out of grid jobs or the actual usage hits 85:15 at which point it will start local jobs again.

The next numbers are the historical daily usage by the different classes of jobs so for today the split of grid to local CPU usage is 97.02:2.97, yesterday it was 88.33:11.63 and so on. The "recent usage" number is calculated as:

todays usage + (1 day ago)*0.9 + (2 days ago)*0.9^2 + ... + (13 days ago)*0.9^13
i.e. a two week rolling average historical usage counts less and less the longer ago it was.
  -- ChrisBrew - 2010-10-25

Revision 32010-10-28 - ChrisBrew

 
META TOPICPARENT name="WebHome"

Why aren't my batch jobs starting?

The RALPP Batch system uses the Maui scheduler to fairly allocate CPU resources between users, groups and Tier 2 (Grid) and Tier 3 (local) use. This means that which job starts next not on how long it has been in the queue but on recent usage of the batch system by the different user groups. So, although you job might head the list of queued jobs reported by qstat that does not mean that it will necessarily start soon.

Luckily Maui provides a few commands to look at the current status which allow you to work out what is going on.

The first showq is fairly straight forward in that it displays the current job queue in priority order:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

5768701            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
5768715            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
...
5792780            pltatl09    Running     1  4:00:00:00  Mon Oct 25 12:24:33
5792777             prdh107    Running     1  4:00:00:00  Mon Oct 25 12:24:33

  1500 Active Jobs    1502 of 1636 Processors Active (91.81%)
                       223 of  242 Nodes Active      (92.15%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831310            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
5831311            pltatl09       Idle     1  4:00:00:00  Thu Oct 28 15:04:06
...
5825489             prdh107       Idle     1  4:00:00:00  Wed Oct 27 23:21:21
5830811             prdh107       Idle     1  4:00:00:00  Thu Oct 28 14:04:08

596 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

5831115            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:46:15
5831178            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33
5831179            atlassgm       Idle     1  4:00:00:00  Thu Oct 28 14:52:33

Total Jobs: 2099   Active Jobs: 1500   Idle Jobs: 596   Blocked Jobs: 3
Added:
>
>
So in this case job number 5831310 for pltatl09 is likely to be the next job to start, though if a user with a higher priority submits a job before that happens the new job would start first.

The "Blocked" atlassgm jobs have special limits which is why they are currently blocked. However, if your jobs appear in the blocked list it generally means that there was a problem either with one of the batch nodes or the communication between the node an the batch server and you job will not run.

 -- ChrisBrew - 2010-10-25

Revision 22010-10-28 - ChrisBrew

 
META TOPICPARENT name="WebHome"

Why aren't my batch jobs starting?

The RALPP Batch system uses the Maui scheduler to fairly allocate CPU resources between users, groups and Tier 2 (Grid) and Tier 3 (local) use. This means that which job starts next not on how long it has been in the queue but on recent usage of the batch system by the different user groups. So, although you job might head the list of queued jobs reported by qstat that does not mean that it will necessarily start soon.

Luckily Maui provides a few commands to look at the current status which allow you to work out what is going on.

The first showq is fairly straight forward in that it displays the current job queue in priority order:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

5768701            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
5768715            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
...
5792780            pltatl09    Running     1  4:00:00:00  Mon Oct 25 12:24:33
5792777             prdh107    Running     1  4:00:00:00  Mon Oct 25 12:24:33

Changed:
<
<
1133 Active Jobs 1133 of 1640 Processors Active (69.09%) 225 of 243 Nodes Active (92.59%)
>
>
1500 Active Jobs 1502 of 1636 Processors Active (91.81%) 223 of 242 Nodes Active (92.15%)
  IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Added:
>
>
5831310 pltatl09 Idle 1 4:00:00:00 Thu Oct 28 15:04:06 5831311 pltatl09 Idle 1 4:00:00:00 Thu Oct 28 15:04:06 ... 5825489 prdh107 Idle 1 4:00:00:00 Wed Oct 27 23:21:21 5830811 prdh107 Idle 1 4:00:00:00 Thu Oct 28 14:04:08
 
Changed:
<
<
0 Idle Jobs
>
>
596 Idle Jobs
  BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Added:
>
>
5831115 atlassgm Idle 1 4:00:00:00 Thu Oct 28 14:46:15 5831178 atlassgm Idle 1 4:00:00:00 Thu Oct 28 14:52:33 5831179 atlassgm Idle 1 4:00:00:00 Thu Oct 28 14:52:33
 
Changed:
<
<
Total Jobs: 1146 Active Jobs: 1146 Idle Jobs: 0 Blocked Jobs: 0
>
>
Total Jobs: 2099 Active Jobs: 1500 Idle Jobs: 596 Blocked Jobs: 3
  -- ChrisBrew - 2010-10-25

Revision 12010-10-25 - ChrisBrew

 
META TOPICPARENT name="WebHome"

Why aren't my batch jobs starting?

The RALPP Batch system uses the Maui scheduler to fairly allocate CPU resources between users, groups and Tier 2 (Grid) and Tier 3 (local) use. This means that which job starts next not on how long it has been in the queue but on recent usage of the batch system by the different user groups. So, although you job might head the list of queued jobs reported by qstat that does not mean that it will necessarily start soon.

Luckily Maui provides a few commands to look at the current status which allow you to work out what is going on.

The first showq is fairly straight forward in that it displays the current job queue in priority order:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

5768701            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
5768715            prdcms12    Running     1  2:09:06:18  Sat Oct 23 21:30:51
...
5792780            pltatl09    Running     1  4:00:00:00  Mon Oct 25 12:24:33
5792777             prdh107    Running     1  4:00:00:00  Mon Oct 25 12:24:33

  1133 Active Jobs    1133 of 1640 Processors Active (69.09%)
                       225 of  243 Nodes Active      (92.59%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1146   Active Jobs: 1146   Idle Jobs: 0   Blocked Jobs: 0
-- ChrisBrew - 2010-10-25
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback