Skip to content

Latest commit

 

History

History
350 lines (229 loc) · 13.1 KB

allocation.rst

File metadata and controls

350 lines (229 loc) · 13.1 KB

Allocation algorithm interface

This software is pre-production and should not be deployed to production servers.

Resource allocation interface allows to provide plugin with resource control logic. Such component can enforce isolation based on platform and tasks resources usage metrics.

To enable allocation feature, the agent has to be configured to use the AllocationRunner component. The runner requires Allocator, to be provided. Allocation decisions are based on results from allocate method from Allocator.

Example of minimal configuration that uses AllocationRunner:

# Basic configuration to dump metrics on stderr with NOPAnomaly detector
runner: !AllocationRunner
  measurement_runner: !MeasurementRunner
    node: !MesosNode
  allocator: !NOPAllocator

measurement_runner is responsible for discovering tasks running on node, provides this information to allocator and then reconfigures resources like cpu shares/quota, cache or memory bandwidth.

For more information about MeasurementRunner please refer to Measurement API.

All information about existing allocations, detected anomalies or other metrics are stored in corresponding storage classes.

Please refer to API documentation of AllocationRunner for full list of available parameters of AllocationRunner.

@dataclass
class AllocationConfiguration:
    # Default value for cpu.cpu_period [ms] (used as denominator).
    cpu_quota_period: Numeric(1000, 1000000) = 1000

    # Multiplier of AllocationType.CPU_SHARES allocation value.
    # E.g. setting 'CPU_SHARES' to 2.0 will set 2000 shares effectively
    # in cgroup cpu controller.
    cpu_shares_unit: Numeric(1000, 1000000) = 1000

    # Default resource allocation for last level cache (L3) and memory bandwidth
    # for root RDT group.
    # Root RDT group is used as default group for all tasks, unless explicitly reconfigured by
    # allocator.
    # `None` (the default value) means no limit (effectively set to maximum available value).
    default_rdt_l3: Str = None
    default_rdt_mb: Str = None

Allocator subclass must implement an allocate function with following signature:

class Allocator(ABC):

    @abstractmethod
    def allocate(
            self,
            platform: Platform,
            tasks_data: TasksData
    ) -> (TasksAllocations, List[Anomaly], List[Metric]):
        ...

All but TasksAllocations input arguments types are documented in detection document.

Both TaskAllocations and TasksAllocations structures are simple python dict types defined as follows:

class AllocationType(Enum, str):
    QUOTA = 'cpu_quota'
    SHARES = 'cpu_shares'
    RDT = 'rdt'
    CPUSET_CPUS = 'cpuset_cpus'
    CPUSET_MEMS = 'cpuset_mems'
    CPUSET_MEMORY_MIGRATE = 'cpuset_memory_migrate'
    MIGRATE_PAGES = 'migrate_pages'

TaskId = str
TaskAllocations = Dict[AllocationType, Union[float, int, RDTAllocation]]
TasksAllocations = Dict[TaskId, TaskAllocations]

# example
tasks_allocations = {
    'some-task-id': {
        'cpu_quota': 0.6,
        'cpu_shares': 0.8,
        'rdt': RDTAllocation(name='hp_group', l3='L3:0=fffff;1=fffff', mb='MB:0=20;1=5')
    },
    'other-task-id': {
        'cpu_quota': 0.5,
        'rdt': RDTAllocation(name='hp_group', l3='L3:0=fffff;1=fffff', mb='MB:0=20;1=5')
    }
    'one-another-task-id': {
        'cpu_quota': 0.7,
        'rdt': RDTAllocation(name='be_group', l3='L3:0=000ff;1=000ff', mb='MB:0=1;1=1'),
    }
    'another-task-with-own-rdtgroup': {
        'cpu_quota': 0.7,
        'rdt': RDTAllocation(l3='L3:0=000ff;1=000ff', mb='MB:0=1;1=1'),  # "another-task-with-own-rdtgroup" will be used as `name`
    }
    ...
}

Please refer to rdt allocation type for definition of RDTAllocation structure.

TasksAllocations is used as:

  • an input representing currently enforced configuration,
  • an output representing desired allocations that will be applied in the current AllocationRunner iteration.

allocate function does not need to return TaskAllocations for all tasks. For omitted tasks, allocations will not be affected.

AllocationRunner is stateless and relies on operating system to store the state.

Note that, if the agent is restarted, then already applied allocations will not be reset (current state of allocation on system will be read and provided as input).

Following built-in allocations types are supported:

  • cpu_quota float - CPU Bandwidth Control called quota (normalized),
  • cpu_shares float - CPU shares for Linux CFS (normalized),
  • rdt RDTAllocation - Intel RDT resources.
  • cpuset_cpus List[int] - support for cpu pinning(requires specific isolator for Mesos)
  • cpuset_mems List[int] support for memory pinning
  • cpuset_memory_migrate bool - cgroups based memory migration to NUMA nodes
  • migrate_pages int syscall based memory migration to NUMA node

type: float

cpu_quota is normalized in respect to whole system capacity (all logical processor) and will be applied using cgroups cpu subsystem using CFS bandwidth control.

Formula for calculating quota normalized to platform capacity:

effective_cpu_quota = cpu_quota * allocation_configuration.cpu_quota_period * platform.cpus

For example, with default cpu_period set to 100ms on machine with 16 logical processor, setting cpu_quota to 0.25, means that hard limit on quarter on the available CPU resources, will effectively translated into 400ms quota.

Note that, setting cpu_quota:

  • to or above 1.0, means disabling the hard limit at all (effectively set to it to -1 in cpu.cfs_quota_us),
  • to 0.0, limits the allowed time to the minimum allowed value (1ms).

CFS "period" is configured statically in AllocationConfiguration.

Refer to Kernel sched-bwc.txt document for further reference.

type: float

cpu_shares value is normalized against configured AllocationConfiguration.cpu_shares_unit.

effective_cpu_shares = cpu_shares * allocation_configuration.cpu_shares_unit

Note that, setting cpu_shares:

  • to 1.0 will be translated into AllocationConfiguration.cpu_shares_unit
  • to 0.0 will be translated into minimum number of shares allowed by system (effectively "2").

Refer to Kernel sched-design document for further reference.

type: RDTAllocation

@dataclass
class RDTAllocation:
    name: str = None    # defaults to TaskId from TasksAllocations
    mb: str = None      # optional - when not provided does not change the existing allocation
    l3: str = None      # optional - when not provided does not change the existing allocation

You can use RDTAllocation class to configure Intel RDT resources.

RDTAllocation wraps resctrl schemata file. Using name property allows to specify name for control group. Sharing control groups among tasks allows to save limited CLOSids resources.

name field is optional and if not provided, the TaskID from parent TasksAllocations class will be used.

Allocation of available bandwidth for mb field is given format:

MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1

expressed in percentage as described in Kernel x86/intel_rdt_ui.txt.

For example:

MB:0=20;1=100

If Software Controller is available and enabled during mount, the format is:

MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1

where bw_MBps0 expresses bandwidth in MBps.

Allocation of cache bit mask for l3 field is given format:

L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

For example:

L3:0=fffff;1=fffff

Note that the configured values are passed as is to resctrl filesystem without validation and in case of error, warning is logged.

Refer to Kernel x86/intel_rdt_ui.txt document for further reference.

type: List[int]

Support for CPU pinning.

Requires specific isolator cgroups/cpuset enabled for Mesos!

May conflict with CPU manager feature in Kubernetes!

type: List[int]

Support for memory pinning.

Requires specific isolator cgroups/cpuset enabled for Mesos!

May conflict with CPU manager feature in Kubernetes!

type: bool

If set, moves task's memory pages in use to a NUMA node provided in cpuset_mems.

Refer to Memory migration for further description.

type: int

Attempts to immediately (blocking) move task's memory pages to a NUMA node provided as an argument.

Possible values are target NUMA node from 0 to ( number of memory NUMA nodes - 1 ).

Platform object will provide enough information to be able to construct raw configuration for rdt resources, including:

  • number of cache ways, number of minimum number of cache ways required to allocate
  • number of sockets

based on /sys/fs/resctrl/info/ and procfs

class Platform:
    ...
    rdt_information: RDTInformation
    ...

@dataclass
class RDTInformation:
    cbm_mask: Optional[str]  # based on /sys/fs/resctrl/info/L3/cbm_mask
    min_cbm_bits: Optional[str]  # based on /sys/fs/resctrl/info/L3/min_cbm_bits
    rdt_mb_control_enabled: bool  # based on 'MB:' in /sys/fs/resctrl/info/L3/cbm_mask
    num_closids: Optional[int]  # based on /sys/fs/resctrl/info/L3/num_closids
    mb_bandwidth_gran: Optional[int]  # based on /sys/fs/resctrl/info/MB/bandwidth_gran
    mb_min_bandwidth: Optional[int]  # based on /sys/fs/resctrl/info/MB/bandwidth_gran

Refer to Kernel x86/intel_rdt_ui.txt document for further reference.

Returned TasksAllocations will be encoded in Prometheus exposition format:

# TYPE allocation gauge
allocation{allocation_type="cpu_quota",cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1.0 1547663933289
allocation{allocation_type="cpu_shares",cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 0.5 1547663933289
allocation{allocation_type="rdt_l3_cache_ways",cores="28",cpus="56",domain_id="0",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1 1547663933289
allocation{allocation_type="rdt_l3_cache_ways",cores="28",cpus="56",domain_id="1",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 1 1547663933289
allocation{allocation_type="rdt_l3_mask",cores="28",cpus="56",domain_id="0",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 2 1547663933289
allocation{allocation_type="rdt_l3_mask",cores="28",cpus="56",domain_id="1",group_name="be",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2",task_id="root-staging13-stress_ng-default--0-0-6d1f2268-c3dd-44fd-be0b-a83bd86b328d"} 2 1547663933289

# TYPE allocation_duration gauge
allocation_duration{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 0.002111196517944336 1547663933289

# TYPE allocations_count counter
allocations_count{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 660 1547663933289

# TYPE allocations_ignored_count counter
allocations_ignored_count{cores="28",cpus="56",host="igk-0107",wca_version="0.1.dev252+g7f83b7f",sockets="2"} 0 1547663933289

Please refer to Generating additional labels for tasks.