Skip to content

Latest commit

 

History

History
341 lines (212 loc) · 12.3 KB

extending.rst

File metadata and controls

341 lines (212 loc) · 12.3 KB

Extending WCA

This software is pre-production and should not be deployed to production servers.

WCA project contains simple built-in dependency injection framework that allows to extend existing or add new functionalities.

This document contains examples of:

  • simple Runner that outputs "Hello World!",
  • HTTP based Storage component to save metrics in external http based service, using requests library.

To provide new functionality using external compoent operator of WCA has to:

  • provide new component defined as Python class,
  • register this Python class upon starting with extra command line --register parameter as package_name.module_name:class_name) (package name is optional),
  • reference component name in configuration file (using name of class),
  • make Python module accessible by Python interpreter for import (PYTHONPATH and PEX_INHERIT_PATH environment variables)

In this document when referring to component, it means a simple Python class that was registered and by this allowed to be used in configuration file.

All WCA features (detection/CMS integration) are based on internal components and use the same mechanism for initialization.

From high-level standpoint, main entry point to application is only responsible for instantiation of python classes defined in yaml configuration, then parsing and preparing logging infrastructure and then call generic run method on already created Runner instance. Runner class is a main vehicle integrating all other depended objects together.

For example, MeasurementRunner is implements simple loop that uses Node subclass (e.g. MesosNode) instance to discover locally running tasks, then collects metrics for those tasks and then uses a Storage subclass to store those metrics somewhere (e.g. KafkaStorage or LogStorage).

To illustrate that, when someone uses WCA with configuration file like this:

runner: !MeasurementRunner
    node: !MesosNode                # subclass of Node
    metric_storage: !LogStorage     # subclass of Storage
        output_filename: /tmp/logs.txt

it effectively means running equivalent of Python code:

runner = MeasurementRunner(
   node = MesosNode()
   metric_storage = LogStorage(
   output_filename = '/tmp/logs.txt')
)
runner.run()

For example, to provide measure-only mode, anomaly detection mode or resource allocation mode, WCA contains following components:

  • MeasurementRunner that is only responsible for collecting metrics,
  • DetectionRunner that extends MeasurementRunner to allow anomaly detection and generate additional metrics,
  • AllocationRunner that allows to configure resources based on provided Allocator component instance,

It is important to note, that configuration based objects (components) are static singletons available throughout whole application life and only accessible by parent objects.

Let's start with very basic thing and create HelloWorldRunner that just outputs 'Hello world!' string.

With Python module hello_world_runner.py containing HelloWorldRunner subclass of Runner:

from wca.runners import Runner

class HelloWorldRunner(Runner):

    def run(self):
        print('Hello world!')

you need to start WCA with following example config file:

runner: !HelloWorldRunner

and then with WCA started like this

PYTHONPATH=$PWD/examples PEX_INHERIT_PATH=fallback ./dist/wca.pex -c $PWD/configs/extending/hello_world.yaml -r hello_world_runner:HelloWorldRunner
Tip:You can just copy-paste this command, all required example files are already in project, but you have to build pex file first with make.

should output:

Hello world!

To integrate with custom monitoring system it is enough to provide definition of custom Storage class. Storage class is a simple interface that exposes just one method store as defined below:

class Storage:

    def store(self, metrics: List[Metric]) -> None:
        """store metrics; may throw FailedDeliveryException"""
        ...

where Metric is simple class with structure influenced by Prometheus metric model and OpenMetrics initiative :

@dataclass
class Metric:
    name: str
    value: float
    labels: Dict[str, str]
    type: str            # gauge/counter
    help: str

This is simple Storage class that can be used to post metrics serialized as json to external http web service using post method:

(full source code here)

import requests, json
from dataclasses import dataclass
from wca.storage import Storage

@dataclass
class HTTPStorage(Storage):

    http_endpoint: str = 'http://127.0.0.1:8000'

    def store(self, metrics):
        requests.post(
            self.http_endpoint,
            json={metric.name: metric.value for metric in metrics}:w
        )

then in can be used with MeasurementRunner with following configuration file:

runner: !MeasurementRunner
  config: !MeasurementRunnerConfig
    node: !StaticNode
      tasks: []                   # this disables any tasks metrics
    metrics_storage: !HTTPStorage

To be able to verify that data was posted to http service correctly please start naive service using socat:

socat - tcp4-listen:8000,fork

and then run WCA like this:

sudo env PYTHONPATH=$PWD/examples PEX_INHERIT_PATH=fallback ./dist/wca.pex -c $PWD/configs/extending/measurement_http_storage.yaml -r http_store:HTTPStorage --root --log http_storage:info

Expected output is:

# from WCA:
2019-06-14 21:51:17,862 INFO     {MainThread} [http_storage] sending!

# from socat:
POST / HTTP/1.1
Host: 127.0.0.1:8000
User-Agent: python-requests/2.21.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 240
Content-Type: application/json

{"wca_up": 1560541957.1652732, "wca_tasks": 0, "wca_memory_usage_bytes": 50159616,
"memory_usage": 1399689216, "cpu_usage_per_cpu": 1205557,
"wca_duration_seconds": 1.0013580322265625e-05,
"wca_duration_seconds_avg": 1.0013580322265625e-05}

Note:

  • sudo is required to enable perf and resctrl based metrics,
  • --log parameter allow to specify log level for custom components

Depending on Runner component, different kinds of metrics are produced and send to different instances of Storage components:

  1. MeasurementRunner uses Storage instance under metrics_storage property to store:

    • platform level resources usage (CPU/memory usage) metrics,
    • internal WCA metrics: number of monitored tasks, number of errors/warnings, health-checks, WCA memory usage,
    • (per-task) perf system based metrics e.g. instructions, cycles
    • (per-task) Intel RDT based metrics e.g. cache usage, memory bandwidth
    • (per-task) cgroup based metrics e.g. CPU/memory usage

    Each of those metrics has additional metadata attached (in form of labels) about:

    • platform topology (sockets/cores/cpus),
    • extra labels defined in WCA configuration file (e.g. own_ip),
    • labels to identify WCA version wca_version and host name (host) and host CPU model cpu_model,
    • (only for per-task metrics) task id (task_id) and metadata acquired from orchestration system (Mesos task or Kubernetes pod labels)
  2. DetectionRunner uses Storage subclass instances:

    in metrics_storage property:

    • the same metrics as send to MeasurmentRunner in metrics_storage above,

    in anomalies_storage property:

    • number of anomalies detected by Allcocator class
    • individual instances of detected anomalies encoded as metrics (more details here)
  3. AllocationRunner uses Storage subclass instances:

    in metrics_storage property:

    • the same metrics as send to MeasurementRunner in metrics_storage above,

    in anomalies_storage property:

    • the same metrics as send to DetectionRunner in anomalies_storage above,

    in alloation_storage property:

    • number of resource allocations performed during last iteration,
    • details about performed allocations like: number of CPU shares or CPU quota or cache allocation,
    • more details here

Note that it is possible by using YAML anchors and aliases to configure that the same instance of Storage should be used to store all kinds of metrics:

runner: !AllocationRunner
  config: !AllocationRunnerConfig
    metrics_storage: &kafka_storage_instance !KafkaStorage
      topic: all_metrics
      broker_ips:
      - 127.0.0.1:9092
      - 127.0.0.2:9092
      max_timeout_in_seconds: 5.
    anomalies_storage: *kafka_storage_instance
    allocations_storage: *kafka_storage_instance

This approach can help to save resources (like connections), share state or simplify configuration (no need to repeat the same arguments).

If component requires some additional dependencies and you do not want dirty system interpreter library, the best way to bundle new component is to use PEX file to package all source code including dependencies.

(requests library from previous example was available because it is already required by WCA itself).

pex -D examples python-dateutil==2.8.0 -o hello_world.pex -v

where example/hello_world_runner_with_dateutil.py:

from wca.runners import Runner
from dateutil.utils import today

class HelloWorldRunner(Runner):

    def run(self):
        print('Hello world! Today is %s' % today())

then it is possible to combine two PEX files into single environment, by using PEX_PATH environment variable:

PEX_PATH=hello_world.pex ./dist/wca.pex -c $PWD/configs/extending/hello_world.yaml -r hello_world_runner_with_dateutil:HelloWorldRunner

outputs:

Hello world! Today is 2019-06-14 00:00:00

Note this method works great if there is no conflicting sub dependencies (Diamond dependency problem), because only one version will be available during runtime. In such case, you need to consolidate WCA and your component into single project (with common requirments) so that conflicts will be resolved during requirements gathering phase. You can check Platform Resource Manager prm component as an example of such approach.

Any children object that is used by any runner, can be replaced with extrnal component, but WCA was designed to be extended, by providing following components:

  • Node class used by all Runners to perform task discovery,
  • Storage classes used to enable persistance for internal metrics (*_storage properties),
  • Detector class to provide anomaly detection logic,
  • Allocator class to provide anomaly detection and anomaly mittigation logic (by resource allocation),