-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suitability of issue tickets as labels for log anomaly detection on MOC/OCP #936
Comments
Thank you for opening the issue @drbwa! 😃 These are some great questions and I'm not sure if we have the answers for all of them yet, but here are my thoughts.
The tickets currently generated are based on the availability alerts we have so far defined here: https://github.com/operate-first/apps/blob/master/odh/base/monitoring/overrides/prometheus-operator/overlays/alerts/prometheus-rules.yaml
The issues are being created by the GitHub receiver, which does generate some logs such as:
I think the following details may help answer this question:
Yes, currently alerts are configured only based on the metrics we have available in Prometheus.
As I pointed out, you can find our alerting rules defined here. The components we have them defined so far are for JupyterHub, ArgoCD, Grafana, Prometheus, and Observatorium .
Yes, currently we have defined only basic availability alerts, but we aim to define other alerts as well.
We currently do not have any such template created, but there are other teams at Red Hat such as the OSD team, app-SRE team who have some well defined alerts for their monitoring that we are looking into and would like to incorporate.
Here is an example of what an issue looks like: operate-first/sre#49. There are some labels associated with the alert. For the timestamps, we will probably have to look at the logs from the GitHub receiver since they seem to have timestamps of when an alert was resolved.
Currently, it seems that the GitHub receiver issue templates are static and are defined here: https://github.com/m-lab/alertmanager-github-receiver/blob/master/alerts/template.go#L54. If we want to provide additional fields, I believe we can send the changes upstream and they will review to accept the changes. I had a similar discussion with the upstream folks here.
We aim to have most tickets generated automatically, but as you pointed out things like user complaints will need to be created manually. I think we are planning to follow templates similar to what we have defined here for our I hope this helps and please do let us know if you have any other questions! |
Thank you for your answers @hemajv . Let me share some thoughts. Regardless of what mechanism you choose to use (e.g., GH issues, Jira, ServiceNow), there are three things that I think will serve you well a bit further down the road. These tickets will become a repository of useful information over time. Ideally, as these tickets accumulate over time, they will allow us to build an understanding of what is going on in the system, where the hotspots are, gather statistics on different SRE activities (e.g., MTTA, MTTD, MTTR). The first point is much easier said than done, but I guess it is something to strive for (and would be happy to help figure out how to get there). You want to have a repository of tickets for important issues, indicative or problems that SREs managing this or that component or OCP in general really care about (or at least, be able to identify those tickets). Second, whatever you can do to enable processing these tickets in an automated manner to extract useful information down the road will be very useful. Related to this, tickets should make loads of useful metadata available (in my view). For example, source of alert (automated or human), cause for triggering alert, severity and impact of issue, source of issue, various durations (time to detect, time to acknowledge, time to mitigate, time to repair), lifecycle events (was the ticket reopened several times, was it opened and closed again automatically or by human touch), categories of likely root causes, and so on. I know that the above is in part more philosophical than actionable, but would be happy to help figure out how to get closer to a set of issue/incident tickets that can be mined for useful information. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@hemajv @4n4nd Please allow me a number of questions regarding the issue tickets.
Consider an anomaly detector that learns to detect incidents/outages/failures
based on an analysis of a stream of log messages being produced by a system. This
anomaly detector learns in an unsupervised manner. However, we need some
labels (sets of log messages that are indicative of specific issues) to tune
and test the anomaly detector.
The overarching questions I would like to try and clarify are:
can occur in MOC/OCP?
log messages?
Let me try to confirm my understanding.
Are alerts configured based on data available in Prometheus? Are alerts based on
metrics only (i.e. not on logs)?
Which components are being monitored and have alerts defined for them?
As far as I understand, the alerts currently configured will generate issue
tickets for un/availability events. Is this correct? Do we currently alert on
any other issues?
Do we have, or is there, some kind of fault model or something like a set of
alert templates for monitoring an OCP deployment that we could use to expend the
set of issues tickets are created for?
What metadata do tickets make available? For example, I suppose that the
outage start time is represented by the ticket creation time. How are we going
to track the end time of an outage?
Which metadata fields are going to be generated automatically and which other
fields do we expect to be added manually, possibly as free text? In other words, is there a 'schema' for the fields in an issue ticket?
Are all tickets going to be generated automatically or will we also have
manually created tickets (e.g., based on user complaints)? What is the template
for manually created tickets?
/cc @davidohana @eranra @ronenschafferibm
The text was updated successfully, but these errors were encountered: