This software is pre-production and should not be deployed to production servers.
Table of Contents
Resource allocation interface allows to provide plugin with resource control logic. Such component can enforce isolation based on platform and tasks resources usage metrics.
To enable allocation feature, the agent has to be configured to use the AllocationRunner
component.
The runner requires Allocator, to be provided. Allocation decisions are based
on results from allocate
method from Allocator.
Example of minimal configuration that uses AllocationRunner
:
# Basic configuration to dump metrics on stderr with NOPAnomaly detector
runner: !AllocationRunner
measurement_runner: !MeasurementRunner
node: !MesosNode
allocator: !NOPAllocator
measurement_runner
is responsible for discovering tasks running on node
, provides this information to
allocator
and then reconfigures resources like cpu shares/quota, cache or memory bandwidth.
For more information about MeasurementRunner
please refer to Measurement API.
All information about existing allocations, detected anomalies or other metrics are stored in corresponding storage classes.
Please refer to API documentation of AllocationRunner for full
list of available parameters of AllocationRunner
.
@dataclass
class AllocationConfiguration:
# Default value for cpu.cpu_period [ms] (used as denominator).
cpu_quota_period: Numeric(1000, 1000000) = 1000
# Multiplier of AllocationType.CPU_SHARES allocation value.
# E.g. setting 'CPU_SHARES' to 2.0 will set 2000 shares effectively
# in cgroup cpu controller.
cpu_shares_unit: Numeric(1000, 1000000) = 1000
# Default resource allocation for last level cache (L3) and memory bandwidth
# for root RDT group.
# Root RDT group is used as default group for all tasks, unless explicitly reconfigured by
# allocator.
# `None` (the default value) means no limit (effectively set to maximum available value).
default_rdt_l3: Str = None
default_rdt_mb: Str = None
Allocator
subclass must implement an allocate
function with following signature:
class Allocator(ABC):
@abstractmethod
def allocate(
self,
platform: Platform,
tasks_data: TasksData
) -> (TasksAllocations, List[Anomaly], List[Metric]):
...
All but TasksAllocations
input arguments types are documented in detection document.
Both TaskAllocations
and TasksAllocations
structures are simple python dict types defined as follows:
class AllocationType(Enum, str):
QUOTA = 'cpu_quota'
SHARES = 'cpu_shares'
RDT = 'rdt'
CPUSET_CPUS = 'cpuset_cpus'
CPUSET_MEMS = 'cpuset_mems'
CPUSET_MEMORY_MIGRATE = 'cpuset_memory_migrate'
MIGRATE_PAGES = 'migrate_pages'
TaskId = str
TaskAllocations = Dict[AllocationType, Union[float, int, RDTAllocation]]
TasksAllocations = Dict[TaskId, TaskAllocations]
# example
tasks_allocations = {
'some-task-id': {
'cpu_quota': 0.6,
'cpu_shares': 0.8,
'rdt': RDTAllocation(name='hp_group', l3='L3:0=fffff;1=fffff', mb='MB:0=20;1=5')
},
'other-task-id': {
'cpu_quota': 0.5,
'rdt': RDTAllocation(name='hp_group', l3='L3:0=fffff;1=fffff', mb='MB:0=20;1=5')
}
'one-another-task-id': {
'cpu_quota': 0.7,
'rdt': RDTAllocation(name='be_group', l3='L3:0=000ff;1=000ff', mb='MB:0=1;1=1'),
}
'another-task-with-own-rdtgroup': {
'cpu_quota': 0.7,
'rdt': RDTAllocation(l3='L3:0=000ff;1=000ff', mb='MB:0=1;1=1'), # "another-task-with-own-rdtgroup" will be used as `name`
}
...
}
Please refer to rdt allocation type for definition of RDTAllocation
structure.
TasksAllocations
is used as:
- an input representing currently enforced configuration,
- an output representing desired allocations that will be applied in the current
AllocationRunner
iteration.
allocate
function does not need to return TaskAllocations
for all tasks.
For omitted tasks, allocations will not be affected.
AllocationRunner
is stateless and relies on operating system to store the state.
Note that, if the agent is restarted, then already applied allocations will not be reset (current state of allocation on system will be read and provided as input).
Following built-in allocations types are supported:
- cpu_quota
float
- CPU Bandwidth Control called quota (normalized), - cpu_shares
float
- CPU shares for Linux CFS (normalized), - rdt
RDTAllocation
- Intel RDT resources. - cpuset_cpus
List[int]
- support for cpu pinning(requires specific isolator for Mesos) - cpuset_mems
List[int]
support for memory pinning - cpuset_memory_migrate
bool
- cgroups based memory migration to NUMA nodes - migrate_pages
int
syscall based memory migration to NUMA node
type: float
cpu_quota
is normalized in respect to whole system capacity (all logical processor) and will be applied using cgroups cpu subsystem
using CFS bandwidth control.
Formula for calculating quota normalized to platform capacity:
effective_cpu_quota = cpu_quota * allocation_configuration.cpu_quota_period * platform.cpus
For example, with default cpu_period
set to 100ms on machine with 16 logical processor, setting cpu_quota
to 0.25, means that
hard limit on quarter on the available CPU resources, will effectively translated into 400ms quota.
Note that, setting cpu_quota
:
- to or above 1.0, means disabling the hard limit at all (effectively set to it to -1 in cpu.cfs_quota_us),
- to 0.0, limits the allowed time to the minimum allowed value (1ms).
CFS "period" is configured statically in AllocationConfiguration
.
Refer to Kernel sched-bwc.txt document for further reference.
type: float
cpu_shares
value is normalized against configured AllocationConfiguration.cpu_shares_unit
.
effective_cpu_shares = cpu_shares * allocation_configuration.cpu_shares_unit
Note that, setting cpu_shares
:
- to 1.0 will be translated into
AllocationConfiguration.cpu_shares_unit
- to 0.0 will be translated into minimum number of shares allowed by system (effectively "2").
Refer to Kernel sched-design document for further reference.
type: RDTAllocation
@dataclass
class RDTAllocation:
name: str = None # defaults to TaskId from TasksAllocations
mb: str = None # optional - when not provided does not change the existing allocation
l3: str = None # optional - when not provided does not change the existing allocation
You can use RDTAllocation
class to configure Intel RDT resources.
RDTAllocation
wraps resctrl schemata
file. Using name
property allows to specify name for control group.
Sharing control groups among tasks allows to save limited CLOSids resources.
name
field is optional and if not provided, the TaskID
from parent TasksAllocations
class will be used.
Allocation of available bandwidth for mb
field is given format:
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1
expressed in percentage as described in Kernel x86/intel_rdt_ui.txt.
For example:
MB:0=20;1=100
If Software Controller is available and enabled during mount, the format is:
MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1
where bw_MBps0 expresses bandwidth in MBps.
Allocation of cache bit mask for l3
field is given format:
L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
For example:
L3:0=fffff;1=fffff
Note that the configured values are passed as is to resctrl filesystem without validation and in case of error, warning is logged.
Refer to Kernel x86/intel_rdt_ui.txt document for further reference.
type: List[int]
Support for CPU pinning.
Requires specific isolator cgroups/cpuset
enabled for Mesos!
May conflict with CPU manager
feature in Kubernetes!
type: List[int]
Support for memory pinning.
Requires specific isolator cgroups/cpuset
enabled for Mesos!
May conflict with CPU manager
feature in Kubernetes!
type: bool
If set, moves task's memory pages in use to a NUMA node provided in cpuset_mems
.
Refer to Memory migration for further description.
type: int
Attempts to immediately (blocking) move task's memory pages to a NUMA node provided as an argument.
Possible values are target NUMA node from 0 to ( number of memory NUMA nodes - 1 ).
Platform object will provide enough information to be able to construct raw configuration for rdt resources, including:
- number of cache ways, number of minimum number of cache ways required to allocate
- number of sockets
based on /sys/fs/resctrl/info/
and procfs
class Platform:
...
rdt_information: RDTInformation
...
@dataclass
class RDTInformation:
cbm_mask: Optional[str] # based on /sys/fs/resctrl/info/L3/cbm_mask
min_cbm_bits: Optional[str] # based on /sys/fs/resctrl/info/L3/min_cbm_bits
rdt_mb_control_enabled: bool # based on 'MB:' in /sys/fs/resctrl/info/L3/cbm_mask
num_closids: Optional[int] # based on /sys/fs/resctrl/info/L3/num_closids
mb_bandwidth_gran: Optional[int] # based on /sys/fs/resctrl/info/MB/bandwidth_gran
mb_min_bandwidth: Optional[int] # based on /sys/fs/resctrl/info/MB/bandwidth_gran
Refer to Kernel x86/intel_rdt_ui.txt document for further reference.
Returned TasksAllocations
will be encoded in Prometheus exposition format:
# TYPE allocation gauge
allocation{allocation_type="cpu_quota",cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1.0 1547663933289
allocation{allocation_type="cpu_shares",cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 0.5 1547663933289
allocation{allocation_type="rdt_l3_cache_ways",cores="28",cpus="56",domain_id="0",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1 1547663933289
allocation{allocation_type="rdt_l3_cache_ways",cores="28",cpus="56",domain_id="1",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1 1547663933289
allocation{allocation_type="rdt_l3_mask",cores="28",cpus="56",domain_id="0",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 2 1547663933289
allocation{allocation_type="rdt_l3_mask",cores="28",cpus="56",domain_id="1",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 2 1547663933289
# TYPE allocation_duration gauge
allocation_duration{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 0.002111196517944336 1547663933289
# TYPE allocations_count counter
allocations_count{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 660 1547663933289
# TYPE allocations_ignored_count counter
allocations_ignored_count{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 0 1547663933289
Please refer to Generating additional labels for tasks.