Managed Service Monitoring #101

mike-gangl · 2023-08-15T18:05:50Z

Managed Service Monitoring

"As an operator, i want to monitor the health of various Unity services"

NOTE: S3 bucket is defined here.

An example of the health dashboard from AWS:

per venue, you'd have something like:

Service	Current
Jupyterhub	🟢
Airflow	🟢
Data Catalog	🟢
Application Catalog	🟢

Market place options:

Juptyer (integrate)
SPS (integrate)
DS (buckets)
SDAP (integrate)

Each Service needs a health endpoint

includes Data Catalog and Application Catalog, app-pack-gen health metrics

UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health

Question to answer during planning:

Where is the health dashboard hosted? on mgmt console? on uiux dashboard?
How do we add the health check endpoint into the marketplace description/output
how will mgmt console query the health endpoint
What format will the health endpoint return to be consumed by the dashboard
permissions issues with getting the health endpoints

Health check SSM params should be defined in the project venue as:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Acceptance Criteria

JuptyerHub is integrated into the marketplace
- includes health endpoint
Airflow/SPS is integrated into the marketplace
- includes health endpoint
U-DS data bucket is integrated into the marketplace
- includes health endpoint
Shared Service health endpoints
- takes the form: /unity/healthCheck/shared-services/
- Data Catalog
- Algorithm Catalog
- App-pack-gen (?)
- Process Mapper health check
A service can query health endpoints and generate or store json health response (e.g. every 5 minutes)
- don't overengineer this, could simply be a lambda that queries health endpoints and creates a json document that is stored in a bucket every 5 minutes. we can optimize later.
Unity-py client to request health status from venue endpoint and return results
UIUX developed dashboard for displaying health of existing services
- should respond to dynamic content
- should read the above generated json file (from a bucket, webservice, etc)

Work Tickets

Link to work tickets required to implement the epic

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

TBC

Out of scope but future work:

Historical health (last 7 days)
Alerting a user based on health
Aggregating health - e.g. another dashboards that can monitor health across all venues (e.g. MMO)
Degradation vs healthy/not healthy
Measuring 'uptime' of a service for SLA metrics

{
  "services": [
    {
      "service": "airflow",
      "landingPage":"https://unity.com/project/venue/processing/ui",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "jupyter",
      "landingPage":"https://unity.com/project/venue/ads/jupyter",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "otherService",
      "landingPage":"https://unity.com/project/venue/other_service",
      "healthChecks": [
        {
          "status": "UNHEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    }
  ]
}

In the future, we might add more detail to a healthcheck object, like date of check, error, or a subgraph of other dependencies (database health, api health).

This should also accommodate the 'historical' record we envision in the future- where multiple healthchecks can be shown (e.g. daily health) for a given service.

rtapella · 2024-04-09T18:18:15Z

Would like to see:

recent_healthy: stores the most recent timestamp of a HEALTHY response (same as “date” if status is HEALTHY)
Maybe : ”endpoint” or “source” to confirm what’s being checked?

mike-gangl · 2024-04-09T21:47:40Z

Think about: Authorization- who owns the username/password for hitting an authenticated endpoint. Multiple components for a service area

future: historical records and tracking 'events'

galenatjpl · 2024-04-10T03:51:35Z

@mike-gangl I updated the diagram and some descriptions, and some work tickets in the above description

mike-gangl · 2024-04-10T16:29:12Z

Updated to include SSM naming parameter:

/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

galenatjpl · 2024-04-10T16:56:52Z

@mike-gangl NOTE: the diagram above is slightly off at this time (still needs an update to have <MARKETPLACE_ITEM>

anilnatha · 2024-04-29T22:47:17Z

Regarding the sample JSON Mike posted earlier. Would like to suggest minor changes.

Use camelcase for the keys.
Can we add a title field that is used to display the name of the service in the UI navbar and in the Health Dashboard data grid?
The landingpage URLs should include the protocol, https://...

anilnatha · 2024-04-29T23:40:22Z

Also, in the list of healthchecks, can it be assumed that these will be stored in descending order, i.e. the most recent health check is the zeroth element in that array?

mike-gangl · 2024-04-30T15:10:15Z

I don't think we plan on ordering the events by health check date.
title and service seem interchangeble?
yeah, the protocol should be a part of the entry. I was just lazy there.
as for camelCase, google agrees with you, and that's good enough for me.

mike-gangl · 2024-04-30T15:14:22Z

@hargitayjpl - see my comment above and the new format of the health check response you'll be writing. camelCase is really the only change, as i think you'll simply pass whatever healthcheck value was supplied by the application.

rtapella · 2024-04-30T16:17:09Z

We can use title and service interchangeably as long as we're happy using "service" as the "English" label for the service.

For the keys, if we use camelCase then we can parse them into title case (e.g., "Camel Case")

galenatjpl · 2024-04-30T18:30:16Z

@mike-gangl @hargitayjpl
I think we really need these two formats:

Shared Services Account components:
/unity/healthCheck/shared-services/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Venue account components:
/unity/healthCheck/<PROJECT>/<VENUE>/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Brandon and I discussed this morning in a meeting, and we want the health components namespaced by what proj/venue they are in. If we simply use something like /unity/healthcheck/airflowUI, it will be ambiguous, and cause data overwrite issues..

mike-gangl · 2024-06-18T21:23:05Z

This is close to being complete. Lambda and crons for proof of concept, management API is up and exposed.

rtapella · 2024-06-18T21:25:34Z

Related UI work: unity-sds/unity-ui#32

rtapella · 2024-06-26T18:51:43Z

Waiting for the U-CS health-endpoint to be ready. Placeholder JSON is being used for the draft implementations of the clients:

unity-sds/unity-py#86
unity-sds/unity-ui#25

galenatjpl · 2024-09-10T17:06:29Z

@brianlee731 should we move this to the current release? This is almost done and I think some other service areas need to integrate into what U-CS built.

mike-gangl added this to Unity Project Board Aug 15, 2023

mike-gangl self-assigned this Aug 15, 2023

mike-gangl converted this from a draft issue Aug 15, 2023

mike-gangl added the Feature Feature label used in Unity Project label Aug 15, 2023

mike-gangl added U-DS U-CS U-SPS U-AS U-ADS U-UIUX labels Mar 29, 2024

mike-gangl moved this from Design Phase to Todo in Unity Project Board Mar 29, 2024

mike-gangl assigned rtapella Mar 29, 2024

rtapella mentioned this issue Apr 2, 2024

[Feature] Add web-based UI to display status for each registered service in a venue unity-sds/unity-ui#25

Closed

galenatjpl mentioned this issue Apr 8, 2024

Implement lambda in Venue account to periodically gather health status unity-sds/unity-cs#367

Closed

2 tasks

LucaCinquini mentioned this issue Apr 10, 2024

Expose the SPS health endpoint(s) unity-sds/unity-sps#60

Open

rtapella mentioned this issue Apr 10, 2024

Figure out what the json reponse format is unity-sds/unity-cs#371

Closed

rtapella mentioned this issue Apr 22, 2024

develop the health-status page based on mockups and spec unity-sds/unity-ui#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Service Monitoring #101

Managed Service Monitoring #101

mike-gangl commented Aug 15, 2023 •

edited by galenatjpl

Loading

mike-gangl commented Apr 9, 2024 •

edited

Loading

rtapella commented Apr 9, 2024

mike-gangl commented Apr 9, 2024 •

edited

Loading

galenatjpl commented Apr 10, 2024

mike-gangl commented Apr 10, 2024

galenatjpl commented Apr 10, 2024

anilnatha commented Apr 29, 2024 •

edited

Loading

anilnatha commented Apr 29, 2024

mike-gangl commented Apr 30, 2024

mike-gangl commented Apr 30, 2024

rtapella commented Apr 30, 2024

galenatjpl commented Apr 30, 2024 •

edited

Loading

mike-gangl commented Jun 18, 2024

rtapella commented Jun 18, 2024

rtapella commented Jun 26, 2024

galenatjpl commented Sep 10, 2024

Managed Service Monitoring #101

Managed Service Monitoring #101

Comments

mike-gangl commented Aug 15, 2023 • edited by galenatjpl Loading