Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Service Monitoring #101

Open
5 of 9 tasks
mike-gangl opened this issue Aug 15, 2023 · 16 comments
Open
5 of 9 tasks

Managed Service Monitoring #101

mike-gangl opened this issue Aug 15, 2023 · 16 comments
Assignees
Labels

Comments

@mike-gangl
Copy link
Contributor

mike-gangl commented Aug 15, 2023

Managed Service Monitoring

"As an operator, i want to monitor the health of various Unity services"
Screenshot 2024-04-09 at 8 34 28 PM
NOTE: S3 bucket is defined here.

An example of the health dashboard from AWS:

Screenshot 2024-03-29 at 1 03 27 PM

per venue, you'd have something like:

Service Current
Jupyterhub 🟢
Airflow 🟢
Data Catalog 🟢
Application Catalog 🟢

Market place options:

  • Juptyer (integrate)
  • SPS (integrate)
  • DS (buckets)
  • SDAP (integrate)

Each Service needs a health endpoint

  • includes Data Catalog and Application Catalog, app-pack-gen health metrics

UI team to design endpoint to show health dashboard https://apigateway-for-unity/project/venue/health

Question to answer during planning:

  • Where is the health dashboard hosted? on mgmt console? on uiux dashboard?
  • How do we add the health check endpoint into the marketplace description/output
  • how will mgmt console query the health endpoint
  • What format will the health endpoint return to be consumed by the dashboard
  • permissions issues with getting the health endpoints

Health check SSM params should be defined in the project venue as:
/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Acceptance Criteria

  • JuptyerHub is integrated into the marketplace
    • includes health endpoint
  • Airflow/SPS is integrated into the marketplace
    • includes health endpoint
  • U-DS data bucket is integrated into the marketplace
    • includes health endpoint
  • Shared Service health endpoints
    • takes the form: /unity/healthCheck/shared-services/
    • Data Catalog
    • Algorithm Catalog
    • App-pack-gen (?)
    • Process Mapper health check
  • A service can query health endpoints and generate or store json health response (e.g. every 5 minutes)
    • don't overengineer this, could simply be a lambda that queries health endpoints and creates a json document that is stored in a bucket every 5 minutes. we can optimize later.
  • Unity-py client to request health status from venue endpoint and return results
  • UIUX developed dashboard for displaying health of existing services
    • should respond to dynamic content
    • should read the above generated json file (from a bucket, webservice, etc)

Work Tickets

Link to work tickets required to implement the epic

Dependencies

Other epics or outside tickets required for this to work

Associated Risks

links to risk issues associated with this epic

  • TBC

Out of scope but future work:

  • Historical health (last 7 days)
  • Alerting a user based on health
  • Aggregating health - e.g. another dashboards that can monitor health across all venues (e.g. MMO)
  • Degradation vs healthy/not healthy
  • Measuring 'uptime' of a service for SLA metrics

previous


This overlaps with the idea of Common metrics /logs aggregation service #92 .

How do we plan to monitor the deployed managed services. I think to evolve into a full multi-tenant system we need to make sure we are monitoring:

Health of a service
Uptime of a service
Degredation (health?) - if it's responding to requests, how fast does it respond?

Think of a single console that can monitor all of the managed services across multiple accounts. What does this look like? how are logs/metrics/events propagated to the "central" dashboard? Or does the dashboard reach into different accounts to view things?


@mike-gangl
Copy link
Contributor Author

mike-gangl commented Apr 9, 2024

Simple health check response format to be supplied by healthResponse

{
  "services": [
    {
      "service": "airflow",
      "landingPage":"https://unity.com/project/venue/processing/ui",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "jupyter",
      "landingPage":"https://unity.com/project/venue/ads/jupyter",
      "healthChecks": [
        {
          "status": "HEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    },
    {
      "service": "otherService",
      "landingPage":"https://unity.com/project/venue/other_service",
      "healthChecks": [
        {
          "status": "UNHEALTHY",
          "date": "2024-04-09T18:01:08Z"
        }
      ]
    }
  ]
}

In the future, we might add more detail to a healthcheck object, like date of check, error, or a subgraph of other dependencies (database health, api health).

This should also accommodate the 'historical' record we envision in the future- where multiple healthchecks can be shown (e.g. daily health) for a given service.

@rtapella
Copy link
Collaborator

rtapella commented Apr 9, 2024

Would like to see:

  • recent_healthy: stores the most recent timestamp of a HEALTHY response (same as “date” if status is HEALTHY)
  • Maybe : ”endpoint” or “source” to confirm what’s being checked?

@mike-gangl
Copy link
Contributor Author

mike-gangl commented Apr 9, 2024

Think about: Authorization- who owns the username/password for hitting an authenticated endpoint. Multiple components for a service area

future: historical records and tracking 'events'

@galenatjpl
Copy link

@mike-gangl I updated the diagram and some descriptions, and some work tickets in the above description

@mike-gangl
Copy link
Contributor Author

Updated to include SSM naming parameter:

/unity/healthCheck/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

@galenatjpl
Copy link

@mike-gangl NOTE: the diagram above is slightly off at this time (still needs an update to have <MARKETPLACE_ITEM>

@anilnatha
Copy link
Collaborator

anilnatha commented Apr 29, 2024

Regarding the sample JSON Mike posted earlier. Would like to suggest minor changes.

  1. Use camelcase for the keys.
  2. Can we add a title field that is used to display the name of the service in the UI navbar and in the Health Dashboard data grid?
  3. The landingpage URLs should include the protocol, https://...

@anilnatha
Copy link
Collaborator

Also, in the list of healthchecks, can it be assumed that these will be stored in descending order, i.e. the most recent health check is the zeroth element in that array?

@mike-gangl
Copy link
Contributor Author

  1. I don't think we plan on ordering the events by health check date.
  2. title and service seem interchangeble?
  3. yeah, the protocol should be a part of the entry. I was just lazy there.
  4. as for camelCase, google agrees with you, and that's good enough for me.

@mike-gangl
Copy link
Contributor Author

@hargitayjpl - see my comment above and the new format of the health check response you'll be writing. camelCase is really the only change, as i think you'll simply pass whatever healthcheck value was supplied by the application.

@rtapella
Copy link
Collaborator

We can use title and service interchangeably as long as we're happy using "service" as the "English" label for the service.

For the keys, if we use camelCase then we can parse them into title case (e.g., "Camel Case")

@galenatjpl
Copy link

galenatjpl commented Apr 30, 2024

@mike-gangl @hargitayjpl
I think we really need these two formats:

Shared Services Account components:
/unity/healthCheck/shared-services/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Venue account components:
/unity/healthCheck/<PROJECT>/<VENUE>/<MARKETPLACE_ITEM>/<COMPONENT_NAME>

Brandon and I discussed this morning in a meeting, and we want the health components namespaced by what proj/venue they are in. If we simply use something like /unity/healthcheck/airflowUI, it will be ambiguous, and cause data overwrite issues..

@mike-gangl
Copy link
Contributor Author

This is close to being complete. Lambda and crons for proof of concept, management API is up and exposed.

@rtapella
Copy link
Collaborator

Related UI work: unity-sds/unity-ui#32

@rtapella
Copy link
Collaborator

Waiting for the U-CS health-endpoint to be ready. Placeholder JSON is being used for the draft implementations of the clients:

unity-sds/unity-py#86
unity-sds/unity-ui#25

@galenatjpl
Copy link

@brianlee731 should we move this to the current release? This is almost done and I think some other service areas need to integrate into what U-CS built.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Todo
Development

No branches or pull requests

4 participants