forked from jbarber/maui-admin-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11.1jobholds.html
103 lines (76 loc) · 8.95 KB
/
11.1jobholds.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<title></title>
</head>
<body>
<div class="sright">
<div class="sub-content-head">
Maui Scheduler
</div>
<div id="sub-content-rpt" class="sub-content-rpt">
<div class="tab-container docs" id="tab-container">
<div class="topNav">
<div class="docsSearch"></div>
<div class="navIcons topIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="11.0generaljobadmin.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="11.0generaljobadmin.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="11.2jobpriority.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
<h1>11.1 Job Holds</h1>
<h2>Holds and Deferred Jobs</h2>
<p>A job <i>hold</i> is a mechanism by which a job is placed in a state where it is not eligible to be run. Maui supports job holds applied by users, admins, and even resource managers. These holds can be seen in the output of the showq and checkjob commands. A job with a hold placed on it cannot be run until the hold is removed. If a hold is placed on a job via the resource manager, this hold must be released by the resource manager provided command (i.e., llhold for Loadleveler, or qhold for PBS).</p>
<p>Maui supports two other types of holds. The first is a temporary hold known as a 'defer'. A job is deferred if the scheduler determines that it cannot run. This can be because it asks for resources which do not currently exist, does not have allocations to run, is rejected by the resource manager, repeatedly fails after start up, etc. Each time a job gets deferred, it will stay that way, unable to run for a period of time specified by the <a href="a.fparameters.html#defertime">DEFERTIME</a> parameter. If a job appears with a state of deferred, it indicates one of the previously mentioned failures has occurred. Details regarding the failure are available by issuing the '<a href="commands/checkjob.html">checkjob</a> <JOBID>' command. Once the time specified by DEFERTIME has elapsed, the job is automatically released and the scheduler again attempts to schedule it. The 'defer' mechanism can be disabled by setting DEFERTIME to '0'. To release a job from the defer state, issue '<a href="commands/releasehold.html">releasehold</a> -a <JOBID>'.</p>
<p>The second 'Maui-specific' type of hold is known as a 'batch' hold. A batch hold is only applied by the scheduler and is only applied after a serious or repeated job failure. If a job has been deferred and released DEFERCOUNT times, Maui will place it in a batch hold. It will remain in this hold until a scheduler admin examines it and takes appropriate action. Like the defer state, the causes of a batch hold can be determined via <a href="commands/checkjob.html">checkjob</a> and the hold can be released via <a href="commands/releasehold.html">releasehold</a>.</p>
<p>Like most schedulers, Maui supports the concept of a job hold. Actually, Maui supports four distinct types of holds, <i>user</i> holds, <i>system</i> holds, <i>batch</i> holds, and <i>defer</i> holds. Each of these holds effectively block a job, preventing it from running, until the hold is removed.</p>
<h2>User Holds</h2>
<p>User holds are very straightforward. Many, if not most, resource managers provide interfaces by which users can place a hold on their own job which basically tells the scheduler not to run the job while the hold is in place. The user may utilize this capability because the job's data is not yet ready, or he wants to be present when the job runs so as to monitor results. Such user holds are created by, and under the control of a non-privileged and may be removed at any time by that user. As would be expected, users can only place holds on their jobs. Jobs with a user hold in place will have a Maui state of <b>Hold</b> or <b>UserHold</b> depending on the resource manager being used.</p>
<h2>System Holds</h2>
<p>The second category of hold is the system hold. This hold is put in place by a system administrator either manually or by way of an automated tool. As with all holds, the job is not allowed to run so long as this hold is in place. A batch administrator can place and release system holds on any job regardless of job ownership. However, unlike a user hold, a normal user cannot release a system hold even on his own jobs. System holds are often used during system maintenance and to prevent particular jobs from running in accordance with current system needs. Jobs with a system hold in place will have a Maui state of <b>Hold</b> or <b>SystemHold</b> depending on the resource manager being used.</p>
<h2>Batch Holds</h2>
<p>Batch holds constitute the third category of job holds. These holds are placed on a job by the scheduler itself when it determines that a job cannot run. The reasons for this vary but can be displayed by issuing the '<b><a href="commands/checkjob.html">checkjob</a> <JOBID></b>' command. Some of the possible reasons are listed below:</p>
<p>No Resources - the job requests resources of a type or amount that do not exist on the system<br>
System Limits - the job is larger or longer than what is allowed by the specified system policies<br>
Bank Failure - the allocations bank is experiencing failures<br>
No Allocations - the job requests use of an account which is out of allocations and no fallback account has been specified<br>
RM Reject - the resource manager refuses to start the job<br>
RM Failure - the resource manager is experiencing failures<br>
Policy Violation - the job violates certain throttling policies preventing it from running now and in the future<br>
No QOS Access - the job does not have access to the QOS level it requests</p>
<p>Jobs which are placed in a batch hold will show up within Maui in the state <b>BatchHold</b>.</p>
<h2>Job Defer</h2>
<p>In most cases, a job violating these policies will not be placed into a batch hold immediately. Rather, it will be deferred. The parameter <a href="a.fparameters.html#defertime">DEFERTIME</a> indicates how long it will be deferred. At this time, it will be allowed back into the idle queue and again considered for scheduling. If it again is unable to run at that time or at any time in the future, it is again deferred for the timeframe specified by DEFERTIME. A job will be released and deferred up to <a href="a.fparameters.html#defercount">DEFERCOUNT</a> times at which point the scheduler places a batch hold on the job and waits for a system administrator to determine the correct course of action. Deferred jobs will have a Maui state of <b>Deferred</b>. As with jobs in the BatchHold state, the reason the job was deferred can be determined by use of the <b>checkjob</b> command.</p>
<p>At any time, a job can be released from any hold or deferred state using the '<a href="commands/releasehold.html">releasehold</a>' command. The Maui logs should provide detailed information about the cause of any batch hold or job deferral.</p>
<table class="note">
<tbody>
<tr>
<td class="noteIMG"><img src="note.png" title="Note" alt="Note"></td>
<td class="noteDetail">As of Maui 3.0.7, the reason a job is deferred or placed in a batch hold is stored in memory but is not checkpointed. Thus this info is available only until Maui is recycled at which point the checkjob command will no longer display this 'reason' info.</td>
</tr>
</tbody>
</table>
<hr width="100%">
(under construction)
<p>Controlling Backfill Reservation Behavior<br>
Reservation Thresholds<br>
Reservation Depth<br>
Resource Allocation Method<br>
First Available<br>
Min Resource<br>
Last Available<br>
WallClock Limit<br>
Allowing jobs to exceed wallclock limit<br>
MAXJOBOVERRUN<br>
Using Machine Speed for WallClock limit scaling<br>
USEMACHINESPEED<br>
Controlling Node Access<br>
NODEACCESSPOLICY</p>
<div class="navIcons bottomIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="11.0generaljobadmin.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="11.0generaljobadmin.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="11.2jobpriority.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
</div>
</div>
</div>
<div class="sub-content-btm"></div>
</div>
</body>
</html>