Skip to content

PDP 45 (Pravega Health Check)

co-jo edited this page Jun 29, 2021 · 28 revisions

Motivation

One of the many challenges in distributed computing environments is determining whether a set of components are considered 'healthy' -- i.e. in an operable state and able to perform its expected duties. Many container orchestration systems provide first-class support for various health related mechanisms. For example:

  • Kubernetes has the notion of readiness, startup and liveness probes 1.
  • Marathon (Mesosphere) provides health checks at different levels of abstraction 2.
  • Docker Swarm implements a more primitive version of (container) health checking via attaching a shell command to be run at regular intervals 3.

There is currently no first class support/integration with any similar mechanisms in Pravega. If a majority of SegmentStore instances are in an Error (as reported by Kubernetes) state, the health of a Pravega Deployment (PravegaCluster) can be reasoned about indirectly but there is no interface relaying this information to the orchestration system.

A HealthCheck service hopes to bridge this gap and allow the core components of Pravega to provide their own idiomatic logic to determine the health status of the component (a Java Class), as well as exporting any health related information. This information will allow us to gain further insight to the status of any given component and if necessary, provide more advanced lifecycle logic.

Summary

This document will cover the following areas:

Health Service Framework
  • Feature Goals
  • Architecture
  • Class Overview
  • Integration
  • Security
  • Operator Integration

Feature Goals

Pravega can in principle be run on any orchestration system, but only gives first class support to Kubernetes based systems. As a result of this, some of the design decisions have been inspired by certain Kubernetes concepts but is in no way tied to any specific implementation.

Expanding upon what has already described in the Motivation section, the following will provide more concrete feature goals of this Health Service.

  • Readiness Endpoint - A readiness endpoint will describe if the component can be considered 'ready'. A pod should be marked as 'ready' if is capable of processing requests, and should receive traffic from Kubernetes.
  • Liveness Endpoint - A liveness endpoint determines if the component is either 'alive' or not. A component that is not alive should be attempted to be restarted (replaced) by Kubernetes.
  • Details Endpoint - The information provided by this endpoint is not used to make certain lifecycle decisions, but is instead a means to export information that the service believes is relevant to assessing the health of the component.
  • Flexible Aggregation Logic - A component (logic or concrete) may have one or more sub-components that provide their own health status. The logic defining whether something (in the aggregate) is considered healthy can not be static. Various aggregation rules should be defined.
  • Composition - A health component should be able to be composed of one or many child components and can use the health of said child components to determine its own health.

* In this section a 'component' is used in an abstract sense.

HealthService (Framework) Class/Interface Overview

  • Health: This class is the result of performing a health check on a given component. It is the object deserialized into JSON used to serve the various health related requests.

    • Status: An enumeration of all the different health states a component can be in.
    • Details: A Map<String, Object> that is populated by the doHealthCheck call of a HealthContributor. This is the information that you want to expose that should not be used in the health Status aggregation.
    • Children: A Map<String, Health> of all the Health results for all the child HealthContributors.
    • Name: The name of the HealthContributor that this Health object was generated for. This name does not contain the fully qualified path of the HealthContributor, only the name provided during registration.
  • HealthContributor: An interface used to convey that some component provides a Health object to be used.

    • HealthContributorImpl: An abstract class that serves as the main interface between the HealthEndpoint and some client. Logically a HealthContributor is also what performs the actual health check on client components and generates the data used during aggregation.
    • Other HealthContributor objects may be registered as a child to any other HealthContributor. If there is one or more children registered to some HealthContributor then the StatusAggregatorRule is referenced to reduce all the children Health.Status results into a single Status.
    • This registration in theory allows any topology of HealthContributors to be defined (including cyclic relationships) but is very easy to avoid in part due to the lack of a global registry.
    • To make de-registration simple, the removal of a child from a parent is done during a health check if it is noted that the child has been closed.
    • The abstract doHealthCheck method is where the Status decision is made and any details to be exported.
  • HealthServiceManager: The top level interface which encapsulates all the operations needed to run the various health related duties.

    • HealthEndpoint: The HealthEndpoint will provide the abstraction layer between an importing service and how it exposes the underlying health information.
      • The HealthEndpoint exposes two kinds of calls: one that accepts no arguments, querying the static RootHealthContributor, and one that accepts a String id used to request information from an arbitrary HealthContributor.
      • The String id expects a fully-qualified name, containing the full path from the RootHealthContributor to the HealthContributor with name id.
    • The HealthEndpoint provides information from a cached Health object.
  • HealthServiceUpdater: Maintains a thread that will perform periodic health checks of the root/service level HealthContributor object, which inturn checks all other contributors.

    • The HealthServiceUpdater is responsible for updating the aforementioned cached Health object.

This framework makes no assumptions about how the information is to be exposed to other machines and is only responsible for maintaining the health state for a single process.

  • Each SegmentStore and Controller instance will have their own HealthServiceManager instances.
  • However being built around the 'Composite Design Pattern' this framework can be used to build another service that aggregates the health state of all PravegaCluster components into a single endpoint.

Architecture/Design Overview

A HealthContributor carries out individual user-defined health queries. It may have one or more child HealthContributors and will aggregate their results. This aggregation creates a Status object representing the combined state of all its child components.

Health information is gathered and aggregated upon each client request and performed at regular intervals by the HealthServiceUpdater.

                         ┌───────── health() ──────────┐
                         │                             │
                         │                             ▼
┌────────────────────────┼────────────────────── HealthService ───────────────────────────────────────────────┐
│                        │                             │                                                      │
│      ┌───────HealthServiceUpdater────────┐           │  ┌─────────── HealthContributor (ROOT)────────────┐  │░░
│      │                 │                 │           │  │                                                │  │░░
│      │                                   │   ┌───────┼─>│ root.getHealth() -> {   (4)                    │  │░░
│      │       scheduleAtFixedRate() ──────┼───┼───────┘  │                                                │  │░░
│      │                                   │   │          │     doHealthCheck() -> Health                  │  │░░
│      │                                   │   │          │                +                               │  │░░
│      │                                   │   │          │     forEach (child -> child.getHealth())       │  │░░
│      └───────────────────────────────────┘   │ (3)      │         ┌──────────────────────────────────┐   │  │░░
│                                              │          │      ┌──┤ HealthContributor (ReadIndex)    │─ ─│─ ┼░─ ─ ─ ─
│                                              │          │      │  └──────────────────────────────────┘   │  │░░     │
│                                              │          │      │  ┌──────────────────────────────────┐   │  │░░
│    ┌─── HealthEndpointInterface ───┐    health()        │      ├──┤ HealthContributor (Durable Log)  │   │  │░░     │
│    │                               │         ▲          │      │  └──────────────────────────────────┘   │  │░░
│    │   * health(String id)         │         │          │      │  ┌──────────────────────────────────┐   │  │░░     │
│    │   * liveness(String id)       ├─────────┘          │      └──┤ HealthContributor (StorageWriter)│   │  │░░
│    │   * details(String id)        │                    │         └──────────────────────────────────┘   │  │░░     │
│    │   * readiness(String id)      ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ }                                              │  │░░
│    │   * status(String id)         │                    └────────────────────────────────────────────────┘  │░░     │
│    │                               │                                                                        │░░
│    │                               │                                                                        │░░     │
│    │                               │                                                                        │░░
│    │                               │                                                                        │░░     │
│    │                               │                                                                        │░░
│    └───────────────▲───┬───────────┘                                                                        │░░     │
│                    │                                                                                        │░░
│                    │   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                            │░░     │
│                    └────────────────────────────────────────────────────────┐                               │░░
│                                             (2)                             │  │                            │░░     │
│                                                                             │                               │░░
│                                                                             │  │                            │░░     │
└─────────────────────────────────────────────────────────────────────────────┼───────────────────────────────┘░░
  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│░░│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     
                                                                              │  │                                    │
                                               ┌──────────────────────────────┼─────────────────────────────────┐
                                               │   SegmentStore               │  │                              │     │
                                               │                              │                                 │
                                               │   ┌──── HealthContributorEndpoint - / ───┐                     │     │
                                               │   │                                      │                     │
                                               │   │ * /readiness/{id} * /health/{id}     │                     │     │
                                               │   │ * /liveness/{id}  * /status/{id}     │                     │
                                               │   │ * /details/{id}                      │                     │     │
                                               │   │                                      │                     │
                                               │   └──────────────────────────┬──┬────────┘                     │     │
                                               │                              │                                 │
                                               │                              │  │                              │     │
                                  GET (1)      │                              │                                 │
                                               │   ┌────────── HealthEndpoint - / ─────────┐                    │     │
       Client                ┌─────────────────┼── │                                       │                    │
                             │                 │   │ * /details   * /health                │                    │     │
 ╭───────────────────────────┴──────────────╮  │┌ ─│ * /readiness *                        │                    │
 │ ○ ○ ○ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│  │   │ * /liveness  * /status                │                    │     │
 ├──────────────────────────────────────────┤  ││  │                                       │                    │
 │ $ curl .../status                        │  │   └───────────────────────────────────────┘                    │     │
 │                                          │  ││                                                               │
 │  {                                       │  │   (5)                                                          │     │
 │      "status": "UP"                      <─ ┤┘                                                               │
 │  }                                       │  │                     HealthService service;                     │     │
 │                                          │  │                                                                │
 │ $                                        │  │  HealthContributor index = new ReadIndexHealthContributor(); ─ ┼ ─ ─ ┘
 │                                          │  │               service.getRoot().register(index);               │
 │                                          │  │                              ...                               │
 │                                          │  │                                                                │
 │                                          │  └────────────────────────────────────────────────────────────────┘
 │                                          │
 │                                          │
 └──────────────────────────────────────────┘

Interfacing with the Health Service

HealthContributor Creation & Registration

Each and every HealthServiceManager has a static RootHealthContributor that acts as the base contributor. To make a HealthContributor visible to the service you must register it under another HealthContributor that is reachable from the RootHealthContributor provided by the HealthServiceManager. There is no global store containing the visible HealthContributor objects so one must provide a HealthContributor object to attach to (either through a class constructor or other means). In effect a 'Health Tree' is created that nicely mirrors the class relations.

Below is an example where the Throttler class is integrated with the HealthCheck framework.

import io.pravega.shared.health.AbstractHealthContributor;

class Throttler implements ..., ... {


    // 1. Provide a HealthContributor that is reachable from the root to attach to.
    //  i) If 'contributor' is the RootHealthContributor the fully-qualified name is '/segmentstore-throttler'.
    // ii) If not, suppose 'contributor' is instead named 'operation-processor' which is registered under the RootHealthContributor.
    //     The fully-qualified name is then '/operation-processor/segmentstore-throttler'.
    Throttler(..., HealthContributor contributor) {
        contributor.register(new ThrottlerHealthContributor("segmentstore-throttler"));
    }

    ...

    // 2. Create a class extending the AbstractHealthContributor and implement the 'doHealthCheck' call.
    class ThrottlerHealthContributor extends AbstractHealthContributor {
        void doHealthCheck(Health.HealthBuilder builder) throws Exception {
            // 3. Determine the logic used to determine the 'Status' of this HealthContributor.
            if (currentDelay.get().remaining.getRemaining().getSeconds() >= ARBITRARY_CUTOFF) {
                build.status(Status.DOWN);
            } else {
                build.status(Status.UP);
            }
            // 4. Provide any other information you want to expose to an operator/client.
            Map<String, Object> details = new Map<String, Object>();
            details.put("delay-remaining-seconds", currentDelay.get().remaining.getRemaining().getSeconds());
            builder.details(details);
        }
    }

}

Consider the first case where 'segmentstore-throttler' is registered under the RootHealthContributor. A call to a REST endpoint exposing the Health of 'segmentstore-throttler' would generate the following JSON response:

{
    "name": "segmentstore-throttler",
    "status": "UP",
    "readiness": "false",
    "alive": "true",
    "details": {
        "delay-remaining-seconds": "1234"
    }
}

HealthEndpoint API

The HealthEndpoint defines a contract that an implementer of a HealthServiceManager must provide. For the purposes of this PDP, let us assume that the controller has been integrated with this framework and will expose each method as a REST Endpoint (using JSON serialization). Here no id is provided so the information returned is parsed from the Health result of the RootHealthContributor:

  • health() - /v1/health: Returns the serialization of the service wide Health object.
{
    "name": "health-contributor-name",
    "status": "UP",
    "readiness": "true",
    "liveness": "true",
    "children": [
        {}
    ],
    "details": {
        "key": "value",
        //...
        "key": "value"
    }
}

This Health serialization is recursive -- if specified, the children property contains a Health serialization for each HealthContributor registered as a dependency of the top level HealthContributor.

  • ()readiness - /v1/health/readiness: Provides the readiness status as defined by the Health result.
{
    "readiness": "true|false",
}
  • ()liveness - /v1/health/liveness: Provides the liveness status as defined by the Health result.
{
    "liveness": "true|false",
}

The readiness/liveness routes can be used to implement Kubernetes liveness and readiness probes.

  • ()details - /v1/health/details: Fetches the details exported during the health check of the HealthContributor.
{
    "details": {
        "key": "value"
    }
}

Each of these routes should also accept an id parameter which requests the result for any specific HealthContributor.

Security

All health information is exported via HTTP routes and therefore we can reuse existing authentication and authorization mechanisms that the controller uses for its HTTP endpoints. To summarize, an AuthHandler service provider is consumed by the AuthHandlerManager to register all implementing classes. Where authentication is needed, the RESTAuthHelper inspects the provided headers and references its AuthHandlerManager to validate the request.

By default the PasswordAuthHandler will be the expected validating AuthHandler, loading its user set from the config/passwd file.

References

  1. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
  2. https://mesosphere.github.io/marathon/docs/health-checks.html
  3. https://docs.docker.com/engine/reference/builder/#healthcheck
  4. https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-health
  5. https://github.com/spring-projects/spring-boot (Much inspiration was taken from spring-boot, however most similarities end at the naming conventions.)
Clone this wiki locally