Skip to content

Note About The Flux Fair Share Algorithm

Don Lipari edited this page Nov 18, 2017 · 3 revisions

The fair-share algorithm goes all the way back to LCRM. Danny Auble and Don Lipari studied that implementation and devised a variant which we added to Slurm. Lipari wrote up the original description of that algorithm which is now here: https://slurm.schedmd.com/priority_multifactor.html

This algorithm created a fair-share priority factor that was proportional to the difference between the shares of the cluster assigned to a user and charge account (called an association) and the historical job usage charged to that user/account (measured in allocated cpu-time, but later extended to include any computing resource - i.e., TRES).

While this algorithm worked fine for most cases, users discovered corner cases where things were not fair. There were multiple modifications that were added to the original algorithm over the years to address these issues.

The Fair-Tree enhancement devised by developers at BYU was the last and most successful effort to address the problems: https://slurm.schedmd.com/fair_tree.html . It enacted a very simple but effective rule: the fair share factor of every descendant of a charge account should be no greater than the parent’s value. Their algorithm visits every pending job and assigns a priority factor which not only reflects the difference between the shares and usage, but also insures that fair-share factors of child associations do not exceed the value of their parents.

The success of this algorithm obsoleted some of the earlier fixes, and these earlier fixes were removed from Slurm. LLNL saw the value of the fair-tree solution and activated it across all of our clusters where it remains to this day.

The only complaint about the use of the fair-tree algorithm was in user-understanding. Users require transparency in knowing how their job’s priority value was determined. The sprio utility was designed to display the breakdown of every component to the job’s priority calculation. This included a the fair-share factor.

The sshare utility was designed to display how the fair-share factor was calculated. This tool lists a line for each association, i.e, user under each charge account they have permission to charge (as entered into the slurmdbd). Each line shows the shares that user/account has been assigned and additional lines show all of the sibling and parent shares up the association hierarchy. sshare also displays the raw and normalized usage values which historical jobs have charged to each association.

The original sshare utility provided a straightforward presentation of all of the components and users could do the math and assure themselves that their job had indeed been assigned the correct fair-share factor. When the fair-tree algorithm is activated, the sshare output changes and the transparency into the algorithm is obfuscated.

This opens up a inevitable problem for the hotline consultants and managers who have to answer to user calls when they can’t figure out why there job does not have enough priority to place it higher in the queue.

My implementation for Flux is an enhancement to the fair-tree algorithm that adheres to the innovative fair-tree rule and thereby avoids the pitfalls of the original algorithm. In addition, it preserves the traditional components needed for an understandable (Flux equivalent of the) sshare output.

The methodology is (hopefully) more easy to understand. At the first level of associations in the hierarchy under root, each association is assigned a piece of the 0.0 to 1.0 fair-share factor range based on its shares and usage. If there were only three associations at that first level, you can imagine their fair-share factors could be: 0.3, 0.5, and 0.8. The higher the number, the more under-serviced the association is.

The children of each of these three associations must have fair-share factors that span the range assigned to their parent. Hence the fair-share factors of any and all descendants of the first association must have fair-share factors that range from 0.0 to 0.3. All descendants of the second association would fall between 0.3 and 0.5. The third, 0.5 to 0.8.

This behavior is repeated at each lower level in the hierarchy. Sibling children of the first level association will fall within the 0.0 to 0.3 range. This pattern repeats for each descendant all the way down to the user leaves of the tree.

As with the fair-tree algorithm, ties are allowed. There could be multiple sibling associations assigned the same fair-share factor. Such would commonly be the case, for example, for those associations with zero historical usage.

The Flux solution could indeed be retrofitted into the original Slurm implementation and eliminate the need for the fair-tree modifications. Perhaps someday someone will do this.