-
Notifications
You must be signed in to change notification settings - Fork 27
Support_complex_provider_selection_algorithms
This description is based on Angus’ email sent to the mailing list and the follow-up discussion:
We should present administrators, with the ability to configure the launch-time provider account selection policy for a specific pool, and to set a global default policy to apply in pools where no custom policy is defined.
The policy would be applied after a set of valid matches has been identified. For the purposes of this document, a “match” is a particular combination of a provider account, provider Hardware Profile, provider realm , and provider image). These will be the list of matches that meet the minimum requirements for launch maps, images have been pushed, quota is not exceeded, etc.)
The selection policy will perform a two-level prioritization, allowing both strict preferences of matches , and probabilistic distribution . The former will be needed to handle “ec2 spillover” sorts of use cases, and the latter will be useful to distribute load across providers taking relative capacity into account. Policies can make use of one or both of these ranking styles.
The selection policy will first apply priority ranking, assigning a priority rank to each match. If two or more matches have the same rank, then a priority ranking is done within this rank. Another way of looking at this is that the matches will be arranged in a list of “priority groups”, with each match in the same group having the same priority. Within each priority group, the selection policy should define a probability distribution, stating how likely each provider account is to be selected to host the new deployment, expressed as a percentage. Once those percentage are calculated, Conductor should pick a random number between 1 and 100 and attempt to launch on the lucky provider account.
Using a probability range and randomly selecting within that range might seem counter-intuitive: Having done the maths and assigned a numerical probability to each provider account, based on its suitability to host the deployment, why not just launch on the “best” provider? The issue is one of scale. When considering a single launch, selecting the account which gathered the highest score makes sense, but once Conductor is managing a large volume of deployments, the downside of that approach becomes clear - If one provider account gathered more than 50% of the probability ranking, it would get 100% of the instances, without the randomness. With the hybrid two-level approach we can handle the “probability distribution” at launch for those providers that we intend to make regular use of, while still retaining the ability to use strict priority ordering for a true spillover case (i.e. a match that should only be used if the “better” options are no longer available)
Priority-only policies are possible by simply making every match its own priority group, leaving the probability distribution phase a selection with only one option -- which results in selection of the highest priority match -- when it’s filled up, the next will be used for subsequent launches. Note that this is not the same as the current static “priority” field in the provider account. This is part of the pluggable policy, so priority ranking can be just as dynamic in its definition as probability ranking.
Probability-only policies are possible by placing all matches in a single priority group. In this case, all matches are considered under the probability match rules.
Whilst the various policies should be stackable, one of the two following policies should be the initial basis for the calculation:
With this probability-only policy, Conductor would use each of the available provider accounts equally, by assigning the same probability to each of them. Varying the probabilities, to assign a weighting, would be useful in instances where the private cloud providers associated with each provider account are of differing sizes. e.g. Three vSphere clusters, one of which has double the capacity of the other two. In that circumstance, the Administrator could adjust the weighting ratios to more closely reflect the actual capacities of each cluster.
It is worth noting that this isn’t strictly round robin. The provider accounts wouldn’t be selected in strict rotation, though the overall result is the same.
This probability-only policy would make most sense in scenarios where Conductor is the sole means by which instances are launched on private cloud providers. Conductor would seek to ensure that the usage of the providers was balanced, by giving a higher probability to whichever provider accounts are currently least used. As with round robin, the weightings could be adjusted to reflect differing capacities between providers.
Having used on of those two policies to acquire an initial set of probabilities, administrators could then elect to apply additional policies, including:
The probability assigned to each provider account would be adjusted according to the provider accounts’ priority, by increasing the probability ranking percentage of the higher priority provider accounts at the expense of the lower priority ones. This could be done either as a strict priority ranking or as a probabilistic one.
Once the audit history records past failures, for each occurrence of a launch failure within a configurable period (6 hours feels reasonable), a provider account would be fined 5% from its probability ranking. This would serve to reduce the attempts to launch on a provider which is running out of capacity, or experiencing hardware issues etc. This is another probability-only policy.
There are three principle cloud uses which can incur costs: consumption of network bandwidth, consumption of storage and running a VM.
Happily, only one of these needs to be a factor when Conductor is selecting a provider account to launch: the cost of running the VM.
The amount of network bandwidth that a deployment will consume is pretty much unknowable at launch time. And, if it is known because, for example, a deployment is for a streaming media server, then Administrators can minimize costs by only launching that deployment in a “Low cost bandwith” pool.
As long as we’re not supporting deployments which include the allocation of additional storage, the costs of storage consumption are an issue to consider at build & push time, rather than at launch time.
So, in order to allow cost to be another factor which affects the probability rankings, all we need is a cost per realm, per hour, for each provider hardware profile, for each provider account.
Admins are going to have to enter that data themselves. That’s not as onerous as it sounds, given that, for example, it will often be the case that costs will not vary across realms, so the UI can help by pre-filling.
Clearly, for private clouds, no alternative means for getting pricing data into Conductor exists. For public providers, it would be beneficial if their APIs exported list pricing, however:
- Few organizations which operate on a scale which justifies using Aeolus are likely to be paying list price
- Organizations may wish to store and export the adjusted costs that they’ll be assigning to users, rather than the basic costs appearing on the provider’s monthly invoice.
Once the cost data is in Conductor, adjusting the probabilities of each provider account to favour whichever provider could more cheaply host the specific range of hardware profiles would be a relatively simple matter of increasing the selection probability percentages of cheaper provider accounts, by a configurable amount, at the expense of the more costly provider accounts.
Cost is another one of those areas which may lend itself to the priority-only ranking. If your goal is to match to the lowest-cost match, it makes more sense to fully utilize the lower-cost options beforem moving to higher-cost matches.
Having completed the stack of modules’ calculations, the result is a final set of probabilities. At this point, Conductor would roll the loaded dice and attempt to launch on the winner.
The UI to allow Administrators to enable modules, and to tune the parameters associated with them, could give a real-time representation of the effect of the current settings for a specific deployable. A certain type of Administrator would be very happy, tuning options and seeing an immediate change in, for example, a pie chart, which showed the resultant probability ranking percentages.
In future, we could provide Administrators with an interface to implement their own selection modules. They might choose, for example, to vary the selection probability percentages according to time and date, to increase usage of private cloud at times when they would otherwise be relatively idle.
Thread on aeolus-devel: https://fedorahosted.org/pipermail/aeolus-devel/2012-April/009903.html