Skip to content

Job Priority Plugin Design

Don Lipari edited this page Oct 19, 2017 · 3 revisions

The job priority plugin provides advanced pending job queue prioritization to a Flux scheduler. It is an optional component whose sole purpose is to modify the priority field of a pending job record, and thereby influence its scheduling.

Without the job priority plugin loaded, the job's default priority value remains as initially set. The default initial setting of a job's priority value is one unit less than the priority of the last submitted job. With this scheme, and assuming the very first job ever submitted starts with a very large priority value (e.g., 0xFFFFFFFF), the pending job queue will be prioritized in a first-come, first-served (FCFS) order.

This default solution for setting the job's initial priority will serve basic scheduling needs. For example, if a user loads a scheduler into their instance that will schedule uncertainty quantification ensembles, then a FCFS scheduling order could be all that is needed. For this application, no priority plugin would need to be loaded.

Multi-Factor Plugin

A job priority plugin will be needed to service scheduling needs that go beyond FCFS. The need for a priority plugin arises as a way to tailor the utilization of a cluster's resources to meet desired goals. Different job usage profiles can be achieved by raising or lowering a job's position in the pending queue. For example, for clusters procured to service large parallel jobs, larger jobs would be conferred a higher priority. Similarly, urgent work that needs ready access to the cluster would receive a higher priority. Priority could be used to provide equal access to resources as the job from a user who has not submitted any jobs this month is granted priority over a user who has already run multiple jobs.

The basic inputs of the job priority plugin are the parameters which contribute components to the prioritization algorithm. The proposed algorithm calls for the weighted sum of a number of factors. There are currently six factors, but there design can accommodate more if and when necessary:

job priority = ( wait_time_weight * wait_time_factor ) +
               ( fair_share_weight * fair_share_factor ) +
               ( qos_weight * qos_factor ) +
               ( queue_weight * queue_factor ) +
               ( job_size_weight * job_size_factor ) +
               ( user_weight * user_factor )

The operation of the priority plugin is independent from the scheduling module. The scheduling module just needs to see a prioritized job queue at the start of every scheduling loop. The scheduler could, at the start of its scheduling loop, trigger the job priority plugin to update the priority fields of all the pending job records. Or, the priority plugin could independently update job priorities of pending jobs on a periodic basis. The scheduler would grab a snapshot of the latest pending job queue prior to the start of its scheduling loop.

Inputs to the Job Priority Plugin

The job priority plugin needs the following inputs. The source of inputs other than the job record can come from reading the contents of a database or reading a file. Adapters for pulling in these inputs will need to be written, whether it is reading a flat file, doing a query on a database, or running a command.

  1. From the job record:
  • The user who submitted the job
  • The time the job was submitted
  • The size of the job (nominally number of cores or nodes)
  • The priority value the user specified for the job
  • The charge account the user specified for the job
  • The Quality of Service (QoS) the user specified for the job
  1. The assigned shares for each record in the charge account / user hierarchy.
  2. A record of computing resource usage charged to each account / user from that hierarchy.
  3. The factor assigned for each Quality of Service (QoS)
  4. The configured weights for each of the components in the formula above.

Outputs of the Job Priority Plugin

The primary output of the job priority plugin is the job priority calculated for each job using the above formula.

To provide transparency to the user and demonstrate how the job priority was calculated for every job, each component from the priority formula needs to be provided to a client tool the user can run on demand. This tool would provide output very similar to Slurm's sprio command:

  JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS        USER

6253241 billybob    1294572       7027     271920      15625          0    1000000           0                     
6266688 billybob    1277696        568     271920       5208          0    1000000           0                     
6266856   marsha    1485389        508     469256      15625          0    1000000           0                     
6266918   arnold    1276426        475     272697       3255          0    1000000           0                     
6267641    tammy    1297931        226     294451       3255          0    1000000           0                     
6267677      sue    1553224        205     532186      20833          0    1000000           0                     

Making It Generic

While the above design serves to demonstrate the basic workings of a job priority plugin, the design could be made to be more generic. The job priority formula could be re-written as this:

job priority = ( weight[0] * factor[0] ) +
               ( weight[1] * factor[1] ) +
               ( weight[2] * factor[2] ) +
               ( weight[3] * factor[3] ) +
               ( weight[4] * factor[4] ) + ...

and a name[] array could supply the column headings to the job priority retrieval client (a generic version of sprio).

Then, the weights and names for each element of the above arrays would need to be input into the priority plugin. In addition, a mechanism would be needed to provide (or compute) every value in the factor array.

Job Priority Plugin Design

The following pseudo-code illustrates the operation of the job priority plugin:

weight[] = load_weight_array()    // read each weight from config file, command-line arguments, etc.
loop()
  j = read_job_record ()
  priority = 0
  factor[] = load_factor_array ( j )   // compute or populate each factor based in info from
                                       // job record and external elements in factor array
                                       // e.g., charge account / user hierarchy from Slurm db
  for each i in num_factors
    priority += weight[i] * factor[i]
  write_job_priority (j, priority)     // update priority field of job record
  save_factor_array (j, factor)        // for display by priority client
// end job loop()

A Note About Synchronicity

The scheduler loop is event driven. There is no need to initiate a scheduling cycle unless one of the following events have occurred:

  1. One or more jobs have been submitted
  2. One or more jobs have terminated
  3. The resources in the Flux instance have grown or shrunk

Once the scheduling loop has begun, it must visit potentially thousands of pending jobs and find and select idle resources to allocate or reserve to each job. This leads to the question of whether to make the job prioritization loop described above an integral part of the scheduling loop.

Traditional schedulers do just that. At the start of every scheduling cycle, the job priority value is recalculated for every pending job. Once that activity completes, the scheduling loop is entered and resources are sought for every pending job in the queue.

Asynchronous Proposal

Given that the activities of the job priority plugin are entirely independent from the scheduling activities of the scheduler, we have an opportunity to increase overall performance by decoupling these activities. If the job prioritization activity runs in its own thread, then the scheduling activities do not have to wait for the pending queue to be re-prioritized. The scheduler can just grab the latest pending job queue, sort it by the priority value, and start scheduling resources for each job.

This would argue for having the job priority plugin be elevated to be a standard Flux module and not just a plugin to the sched module. One could argue that running a job priority calculation multiple times between scheduling events could be wasteful. What is the point of recalculating the priorities of all the jobs in the pending queue if none of the three events listed above have occurred?

The answer is that these potentially wasted cycles can be offset by improving the performance of the scheduling loop by eliminating the potentially expensive delay of recalculating job priorities at the start of each scheduling cycle. If the two activities are decoupled, multiple scheduling cycles could be run faster than traditional schedulers. And with the enhanced graph-aware scheduler that Flux offers, performance gains such as this could prove to be quite welcome.