Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suitability of issue tickets as labels for log anomaly detection on MOC/OCP #936

Closed
drbwa opened this issue Jan 31, 2021 · 6 comments
Closed
Labels
kind/question Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@drbwa
Copy link

drbwa commented Jan 31, 2021

@hemajv @4n4nd Please allow me a number of questions regarding the issue tickets.

Consider an anomaly detector that learns to detect incidents/outages/failures
based on an analysis of a stream of log messages being produced by a system. This
anomaly detector learns in an unsupervised manner. However, we need some
labels (sets of log messages that are indicative of specific issues) to tune
and test the anomaly detector.

The overarching questions I would like to try and clarify are:

  1. Are the tickets currently generated indicative of a variety of issues that
    can occur in MOC/OCP?
  2. Can the issues for which we generate tickets be expected to be visible in
    log messages?
  3. Does the representation of the issue tickets facilitate automated processing?

Let me try to confirm my understanding.

Are alerts configured based on data available in Prometheus? Are alerts based on
metrics only (i.e. not on logs)?

Which components are being monitored and have alerts defined for them?

As far as I understand, the alerts currently configured will generate issue
tickets for un/availability events. Is this correct? Do we currently alert on
any other issues?

Do we have, or is there, some kind of fault model or something like a set of
alert templates for monitoring an OCP deployment that we could use to expend the
set of issues tickets are created for?

What metadata do tickets make available? For example, I suppose that the
outage start time is represented by the ticket creation time. How are we going
to track the end time of an outage?

Which metadata fields are going to be generated automatically and which other
fields do we expect to be added manually, possibly as free text? In other words, is there a 'schema' for the fields in an issue ticket?

Are all tickets going to be generated automatically or will we also have
manually created tickets (e.g., based on user complaints)? What is the template
for manually created tickets?

/cc @davidohana @eranra @ronenschafferibm

@hemajv hemajv added kind/question Categorizes issue or PR as a support question. and removed kind/question Categorizes issue or PR as a support question. labels Feb 1, 2021
@hemajv
Copy link
Member

hemajv commented Feb 1, 2021

Thank you for opening the issue @drbwa! 😃 These are some great questions and I'm not sure if we have the answers for all of them yet, but here are my thoughts.

  1. Are the tickets currently generated indicative of a variety of issues that
    can occur in MOC/OCP?

The tickets currently generated are based on the availability alerts we have so far defined here: https://github.com/operate-first/apps/blob/master/odh/base/monitoring/overrides/prometheus-operator/overlays/alerts/prometheus-rules.yaml
Right now, we have basic availability alerts which detect when any of the applications deployed on MOC go down, but we are planning to add more alerts as well.

  1. Can the issues for which we generate tickets be expected to be visible in
    log messages?

The issues are being created by the GitHub receiver, which does generate some logs such as:

Screenshot from 2021-02-01 09-19-48

  1. Does the representation of the issue tickets facilitate automated processing?

I think the following details may help answer this question:

  • Are alerts configured based on data available in Prometheus? Are alerts based on
    metrics only (i.e. not on logs)?

Yes, currently alerts are configured only based on the metrics we have available in Prometheus.

  • Which components are being monitored and have alerts defined for them?

As I pointed out, you can find our alerting rules defined here. The components we have them defined so far are for JupyterHub, ArgoCD, Grafana, Prometheus, and Observatorium .

  • As far as I understand, the alerts currently configured will generate issue
    tickets for un/availability events. Is this correct? Do we currently alert on
    any other issues?

Yes, currently we have defined only basic availability alerts, but we aim to define other alerts as well.

  • Do we have, or is there, some kind of fault model or something like a set of
    alert templates for monitoring an OCP deployment that we could use to expend the
    set of issues tickets are created for?

We currently do not have any such template created, but there are other teams at Red Hat such as the OSD team, app-SRE team who have some well defined alerts for their monitoring that we are looking into and would like to incorporate.

  • What metadata do tickets make available? For example, I suppose that the
    outage start time is represented by the ticket creation time. How are we going
    to track the end time of an outage?

Here is an example of what an issue looks like: operate-first/sre#49. There are some labels associated with the alert. For the timestamps, we will probably have to look at the logs from the GitHub receiver since they seem to have timestamps of when an alert was resolved.

  • Which metadata fields are going to be generated automatically and which other
    fields do we expect to be added manually, possibly as free text? In other words, is there a 'schema' for the fields in an issue ticket?

Currently, it seems that the GitHub receiver issue templates are static and are defined here: https://github.com/m-lab/alertmanager-github-receiver/blob/master/alerts/template.go#L54. If we want to provide additional fields, I believe we can send the changes upstream and they will review to accept the changes. I had a similar discussion with the upstream folks here.

  • Are all tickets going to be generated automatically or will we also have
    manually created tickets (e.g., based on user complaints)? What is the template
    for manually created tickets?

We aim to have most tickets generated automatically, but as you pointed out things like user complaints will need to be created manually. I think we are planning to follow templates similar to what we have defined here for our support repo, but we haven't defined it yet.

I hope this helps and please do let us know if you have any other questions!
@4n4nd @HumairAK feel free to add anything else I may have missed.

@drbwa
Copy link
Author

drbwa commented Feb 2, 2021

Thank you for your answers @hemajv . Let me share some thoughts.

Regardless of what mechanism you choose to use (e.g., GH issues, Jira, ServiceNow), there are three things that I think will serve you well a bit further down the road.

These tickets will become a repository of useful information over time. Ideally, as these tickets accumulate over time, they will allow us to build an understanding of what is going on in the system, where the hotspots are, gather statistics on different SRE activities (e.g., MTTA, MTTD, MTTR).

The first point is much easier said than done, but I guess it is something to strive for (and would be happy to help figure out how to get there). You want to have a repository of tickets for important issues, indicative or problems that SREs managing this or that component or OCP in general really care about (or at least, be able to identify those tickets).

Second, whatever you can do to enable processing these tickets in an automated manner to extract useful information down the road will be very useful.

Related to this, tickets should make loads of useful metadata available (in my view). For example, source of alert (automated or human), cause for triggering alert, severity and impact of issue, source of issue, various durations (time to detect, time to acknowledge, time to mitigate, time to repair), lifecycle events (was the ticket reopened several times, was it opened and closed again automatically or by human touch), categories of likely root causes, and so on.

I know that the above is in part more philosophical than actionable, but would be happy to help figure out how to get closer to a set of issue/incident tickets that can be mined for useful information.

@sesheta sesheta added the kind/question Categorizes issue or PR as a support question. label Feb 8, 2021
@sesheta
Copy link
Member

sesheta commented Oct 11, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2021
@sesheta
Copy link
Member

sesheta commented Nov 11, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 11, 2021
@sesheta
Copy link
Member

sesheta commented Dec 11, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Dec 11, 2021
@sesheta
Copy link
Member

sesheta commented Dec 11, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@durandom durandom transferred this issue from operate-first/operations Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Categorizes issue or PR as a support question. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants