From 68bf414def4fde87b786c73adc3f5a0e57ac9b2c Mon Sep 17 00:00:00 2001 From: DeDe Morton Date: Tue, 26 Nov 2024 12:30:16 -0800 Subject: [PATCH] Fix paths in serverless procedures (#4579) * Fix paths in serverless procedures * Make docs match menu capitalization (cherry picked from commit 26b6a2f98c10873e548681dd8f32b065287f3ff7) # Conflicts: # docs/en/serverless/alerting/aiops-generate-anomaly-alerts.asciidoc # docs/en/serverless/alerting/create-inventory-threshold-alert-rule.asciidoc # docs/en/serverless/apm/apm-integrate-with-machine-learning.asciidoc # docs/en/serverless/infra-monitoring/analyze-hosts.asciidoc # docs/en/serverless/infra-monitoring/configure-infra-settings.asciidoc # docs/en/serverless/infra-monitoring/detect-metric-anomalies.asciidoc # docs/en/serverless/infra-monitoring/get-started-with-metrics.asciidoc # docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc # docs/en/serverless/machine-learning/aiops-analyze-spikes.asciidoc # docs/en/serverless/machine-learning/aiops-detect-anomalies.asciidoc # docs/en/serverless/machine-learning/aiops-detect-change-points.asciidoc # docs/en/serverless/machine-learning/aiops-tune-anomaly-detection-job.asciidoc # docs/en/serverless/observability-overview.asciidoc # docs/en/serverless/reference/metrics-app-fields.asciidoc --- .../monitor-infra/analyze-hosts.asciidoc | 2 +- .../monitor-infrastructure-and-hosts.asciidoc | 2 +- ...itor-k8s-explore-logs-and-metrics.asciidoc | 2 +- .../aiops-generate-anomaly-alerts.asciidoc | 197 +++++++++++ ...te-inventory-threshold-alert-rule.asciidoc | 175 ++++++++++ ...m-integrate-with-machine-learning.asciidoc | 71 ++++ .../infra-monitoring/analyze-hosts.asciidoc | 318 ++++++++++++++++++ .../configure-infra-settings.asciidoc | 36 ++ .../detect-metric-anomalies.asciidoc | 77 +++++ .../get-started-with-metrics.asciidoc | 61 ++++ .../view-infrastructure-metrics.asciidoc | 148 ++++++++ .../aiops-analyze-spikes.asciidoc | 71 ++++ .../aiops-detect-anomalies.asciidoc | 272 +++++++++++++++ .../aiops-detect-change-points.asciidoc | 68 ++++ .../aiops-tune-anomaly-detection-job.asciidoc | 182 ++++++++++ .../observability-overview.asciidoc | 149 ++++++++ .../reference/metrics-app-fields.asciidoc | 295 ++++++++++++++++ 17 files changed, 2123 insertions(+), 3 deletions(-) create mode 100644 docs/en/serverless/alerting/aiops-generate-anomaly-alerts.asciidoc create mode 100644 docs/en/serverless/alerting/create-inventory-threshold-alert-rule.asciidoc create mode 100644 docs/en/serverless/apm/apm-integrate-with-machine-learning.asciidoc create mode 100644 docs/en/serverless/infra-monitoring/analyze-hosts.asciidoc create mode 100644 docs/en/serverless/infra-monitoring/configure-infra-settings.asciidoc create mode 100644 docs/en/serverless/infra-monitoring/detect-metric-anomalies.asciidoc create mode 100644 docs/en/serverless/infra-monitoring/get-started-with-metrics.asciidoc create mode 100644 docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc create mode 100644 docs/en/serverless/machine-learning/aiops-analyze-spikes.asciidoc create mode 100644 docs/en/serverless/machine-learning/aiops-detect-anomalies.asciidoc create mode 100644 docs/en/serverless/machine-learning/aiops-detect-change-points.asciidoc create mode 100644 docs/en/serverless/machine-learning/aiops-tune-anomaly-detection-job.asciidoc create mode 100644 docs/en/serverless/observability-overview.asciidoc create mode 100644 docs/en/serverless/reference/metrics-app-fields.asciidoc diff --git a/docs/en/observability/monitor-infra/analyze-hosts.asciidoc b/docs/en/observability/monitor-infra/analyze-hosts.asciidoc index cbeb26050e..78966c736d 100644 --- a/docs/en/observability/monitor-infra/analyze-hosts.asciidoc +++ b/docs/en/observability/monitor-infra/analyze-hosts.asciidoc @@ -197,7 +197,7 @@ The host details overlay contains the following tabs: include::host-details-partial.asciidoc[] -NOTE: The metrics shown on the **Hosts** page are also available when viewing hosts on the **Inventory** page. +NOTE: The metrics shown on the **Hosts** page are also available when viewing hosts on the **Infrastructure inventory** page. [discrete] [[analyze-hosts-why-dashed-lines]] diff --git a/docs/en/observability/monitor-infra/monitor-infrastructure-and-hosts.asciidoc b/docs/en/observability/monitor-infra/monitor-infrastructure-and-hosts.asciidoc index b7ba9a48b7..487514c655 100644 --- a/docs/en/observability/monitor-infra/monitor-infrastructure-and-hosts.asciidoc +++ b/docs/en/observability/monitor-infra/monitor-infrastructure-and-hosts.asciidoc @@ -17,7 +17,7 @@ The {infrastructure-app} provides a few different views of your data. [cols="1,1"] |=== -| **Inventory** +| **Infrastructure inventory** |Provides a metrics-driven view of your entire infrastructure grouped by the resources that you are monitoring. < >> diff --git a/docs/en/observability/monitor-k8s/monitor-k8s-explore-logs-and-metrics.asciidoc b/docs/en/observability/monitor-k8s/monitor-k8s-explore-logs-and-metrics.asciidoc index f694320028..615de18e79 100644 --- a/docs/en/observability/monitor-k8s/monitor-k8s-explore-logs-and-metrics.asciidoc +++ b/docs/en/observability/monitor-k8s/monitor-k8s-explore-logs-and-metrics.asciidoc @@ -15,7 +15,7 @@ Refer to the following sections for more on viewing your data. To view the performance and health metrics collected by {agent}, find **Infrastructure** in the main menu or use the {kibana-ref}/introduction.html#kibana-navigation-search[global search field]. -On the **Inventory** page, you can switch between different views to see an +On the **Infrastructure inventory** page, you can switch between different views to see an overview of the containers and pods running on Kubernetes: [role="screenshot"] diff --git a/docs/en/serverless/alerting/aiops-generate-anomaly-alerts.asciidoc b/docs/en/serverless/alerting/aiops-generate-anomaly-alerts.asciidoc new file mode 100644 index 0000000000..c39c296ba9 --- /dev/null +++ b/docs/en/serverless/alerting/aiops-generate-anomaly-alerts.asciidoc @@ -0,0 +1,197 @@ +[[observability-aiops-generate-anomaly-alerts]] += Create an anomaly detection rule + +// :description: Get alerts when anomalies match specific conditions. +// :keywords: serverless, observability, how-to + +++++ +Anomaly detection +++++ + +:role: Editor +:goal: create anomaly detection rules +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +:feature: Anomaly detection alerting +include::../partials/feature-beta.asciidoc[] +:feature!: + +Create an anomaly detection rule to check for anomalies in one or more anomaly detection jobs. +If the conditions of the rule are met, an alert is created, and any actions specified in the rule are triggered. +For example, you can create a rule to check every fifteen minutes for critical anomalies and then alert you by email when they are detected. + +To create an anomaly detection rule: + +. In your {obs-serverless} project, go to **Machine learning** → **Jobs**. +. In the list of anomaly detection jobs, find the job you want to check for anomalies. +Haven't created a job yet? <>. +. From the **Actions** menu next to the job, select **Create alert rule**. +. Specify a name and optional tags for the rule. You can use these tags later to filter alerts. +. Verify that the correct job is selected and configure the alert details: ++ +[role="screenshot"] +image::images/anomaly-detection-alert.png[Anomaly detection alert settings ] +. For the result type: ++ +|=== +| Choose... | To generate an alert based on... + +| **Bucket** +| How unusual the anomaly was within the bucket of time + +| **Record** +| What individual anomalies are present in a time range + +| **Influencer** +| The most unusual entities in a time range +|=== +. Adjust the **Severity** to match the anomaly score that will trigger the action. +The anomaly score indicates the significance of a given anomaly compared to previous anomalies. +The default severity threshold is 75, which means every anomaly with an anomaly score of 75 or higher will trigger the associated action. +. (Optional) Turn on **Include interim results** to include results that are created by the anomaly detection job _before_ a bucket is finalized. These results might disappear after the bucket is fully processed. +Include interim results if you want to be notified earlier about a potential anomaly even if it might be a false positive. +. (Optional) Expand and change **Advanced settings**: ++ +|=== +| Setting | Description + +| **Lookback interval** +| The interval used to query previous anomalies during each condition check. Setting the lookback interval lower than the default value might result in missed anomalies. + +| **Number of latest buckets** +| The number of buckets to check to obtain the highest anomaly from all the anomalies that are found during the Lookback interval. An alert is created based on the anomaly with the highest anomaly score from the most anomalous bucket. +|=== +. (Optional) Under **Check the rule condition with an interval**, specify an interval, then click **Test** to check the rule condition with the interval specified. +The button is grayed out if the datafeed is not started. +To test the rule, start the data feed. +. (Optional) If you want to change how often the condition is evaluated, adjust the **Check every** setting. +. (Optional) Set up **Actions**. +. **Save** your rule. + +[NOTE] +==== +Anomaly detection rules are defined as part of a job. +Alerts generated by these rules do not appear on the **Alerts** page. +==== + +[discrete] +[[observability-aiops-generate-anomaly-alerts-add-actions]] +== Add actions + +You can extend your rules with actions that interact with third-party systems, write to logs or indices, or send user notifications. You can add an action to a rule at any time. You can create rules without adding actions, and you can also define multiple actions for a single rule. + +To add actions to rules, you must first create a connector for that service (for example, an email or external incident management system), which you can then use for different rules, each with their own action frequency. + +.Connector types +[%collapsible] +===== +Connectors provide a central place to store connection information for services and integrations with third party systems. +The following connectors are available when defining actions for alerting rules: + +include::./alerting-connectors.asciidoc[] + +For more information on creating connectors, refer to <>. +===== + +.Action frequency +[%collapsible] +===== +After you select a connector, you must set the action frequency. You can choose to create a **Summary of alerts** on each check interval or on a custom interval. For example, you can send email notifications that summarize the new, ongoing, and recovered alerts every twelve hours. + +Alternatively, you can set the action frequency to **For each alert** and specify the conditions each alert must meet for the action to run. For example, you can send an email only when alert status changes to critical. + +[role="screenshot"] +image::images/alert-action-frequency.png[Configure when a rule is triggered] + +With the **Run when** menu you can choose if an action runs when the the anomaly score matched the condition or was recovered. For example, you can add a corresponding action for each state to ensure you are alerted when the anomaly score was matched and also when it recovers. + +[role="screenshot"] +image::images/alert-anomaly-action-frequency-recovered.png[Choose between anomaly score matched condition or recovered] +===== + +.Action variables +[%collapsible] +===== +Use the default notification message or customize it. +You can add more context to the message by clicking the Add variable icon image:images/icons/indexOpen.svg[Add variable] and selecting from a list of available variables. + +[role="screenshot"] +image::images/action-variables-popup.png[Action variables list] + +The following variables are specific to this rule type. +You can also specify {kibana-ref}/rule-action-variables.html[variables common to all rules]. + +`context.anomalyExplorerUrl`:: +URL to open in the Anomaly Explorer. + +`context.isInterim`:: +Indicate if top hits contain interim results. + +`context.jobIds`:: +List of job IDs that triggered the alert. + +`context.message`:: +Alert info message. + +`context.score`:: +Anomaly score at the time of the notification action. + +`context.timestamp`:: +The bucket timestamp of the anomaly. + +`context.timestampIso8601`:: +The bucket timestamp of the anomaly in ISO8601 format. + +`context.topInfluencers`:: +The list of top influencers. Properties include: ++ +`influencer_field_name`::: +The field name of the influencer. + +`influencer_field_value`::: +The entity that influenced, contributed to, or was to blame for the anomaly. + +`score`::: +The influencer score. A normalized score between 0-100 which shows the influencer’s overall contribution to the anomalies. + +`context.topRecords`:: +The list of top records. Properties include: ++ +`actual`::: +The actual value for the bucket. + +`by_field_value`::: +The value of the by field. + +`field_name`::: +Certain functions require a field to operate on, for example, `sum()`. For those functions, this value is the name of the field to be analyzed. + +`function`::: +The function in which the anomaly occurs, as specified in the detector configuration. For example, `max`. + +`over_field_name`::: +The field used to split the data. + +`partition_field_value`::: +The field used to segment the analysis. + +`score`::: +A normalized score between 0-100, which is based on the probability of the anomalousness of this record. + +`typical`::: +The typical value for the bucket, according to analytical modeling. + +===== + +[discrete] +[[observability-aiops-generate-anomaly-alerts-edit-an-anomaly-detection-rule]] +== Edit an anomaly detection rule + +To edit an anomaly detection rule: + +. In your {obs-serverless} project, go to **Machine learning** → **Jobs**. +. Expand the job that uses the rule you want to edit. +. On the **Job settings** tab, under **Alert rules**, click the rule to edit it. diff --git a/docs/en/serverless/alerting/create-inventory-threshold-alert-rule.asciidoc b/docs/en/serverless/alerting/create-inventory-threshold-alert-rule.asciidoc new file mode 100644 index 0000000000..708de3bdc6 --- /dev/null +++ b/docs/en/serverless/alerting/create-inventory-threshold-alert-rule.asciidoc @@ -0,0 +1,175 @@ +[[observability-create-inventory-threshold-alert-rule]] += Create an inventory rule + +// :description: Get alerts when the infrastructure inventory exceeds a defined threshold. +// :keywords: serverless, observability, how-to, alerting + +++++ +Inventory +++++ + +:role: Editor +:goal: create inventory threshold rules +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +Based on the resources listed on the **Infrastructure inventory** page within the {infrastructure-app}, +you can create a threshold rule to notify you when a metric has reached or exceeded a value for a specific +resource or a group of resources within your infrastructure. + +Additionally, each rule can be defined using multiple +conditions that combine metrics and thresholds to create precise notifications and reduce false positives. + +. To access this page, go to **{observability}** -> **Infrastructure**. +. On the **Infrastructure inventory** page or the **Metrics Explorer** page, click **Alerts and rules** -> **Infrastructure**. +. Select **Create inventory rule**. + +[TIP] +==== +When you select **Create inventory alert**, the parameters you configured on the **Infrastructure inventory** page will automatically +populate the rule. You can use the Inventory first to view which nodes in your infrastructure you'd +like to be notified about and then quickly create a rule in just a few clicks. +==== + +[discrete] +[[inventory-conditions]] +== Inventory conditions + +Conditions for each rule can be applied to specific metrics relating to the inventory type you select. +You can choose the aggregation type, the metric, and by including a warning threshold value, you can be +alerted on multiple threshold values based on severity scores. When creating the rule, you can still get +notified if no data is returned for the specific metric or if the rule fails to query {es}. + +In this example, Kubernetes Pods is the selected inventory type. The conditions state that you will receive +a critical alert for any pods within the `ingress-nginx` namespace with a memory usage of 95% or above +and a warning alert if memory usage is 90% or above. +The chart shows the results of applying the rule to the last 20 minutes of data. +Note that the chart time range is 20 times the value of the look-back window specified in the `FOR THE LAST` field. + +[role="screenshot"] +image::images/inventory-alert.png[Inventory rule] + +[discrete] +[[action-types-infrastructure]] +== Add actions + +You can extend your rules with actions that interact with third-party systems, write to logs or indices, or send user notifications. You can add an action to a rule at any time. You can create rules without adding actions, and you can also define multiple actions for a single rule. + +To add actions to rules, you must first create a connector for that service (for example, an email or external incident management system), which you can then use for different rules, each with their own action frequency. + +.Connector types +[%collapsible] +===== +Connectors provide a central place to store connection information for services and integrations with third party systems. +The following connectors are available when defining actions for alerting rules: + +include::./alerting-connectors.asciidoc[] + +For more information on creating connectors, refer to <>. +===== + +.Action frequency +[%collapsible] +===== +After you select a connector, you must set the action frequency. You can choose to create a summary of alerts on each check interval or on a custom interval. For example, send email notifications that summarize the new, ongoing, and recovered alerts each hour: + +[role="screenshot"] +image::images/action-alert-summary.png[Action types] + +// NOTE: This is an autogenerated screenshot. Do not edit it directly. + +Alternatively, you can set the action frequency such that you choose how often the action runs (for example, at each check interval, only when the alert status changes, or at a custom action interval). In this case, you define precisely when the alert is triggered by selecting a specific +threshold condition: `Alert`, `Warning`, or `Recovered` (a value that was once above a threshold has now dropped below it). + +[role="screenshot"] +image::images/inventory-threshold-run-when-selection.png[Configure when an alert is triggered] + +// NOTE: This is an autogenerated screenshot. Do not edit it directly. + +You can also further refine the conditions under which actions run by specifying that actions only run when they match a KQL query or when an alert occurs within a specific time frame: + +* **If alert matches query**: Enter a KQL query that defines field-value pairs or query conditions that must be met for notifications to send. The query only searches alert documents in the indices specified for the rule. +* **If alert is generated during timeframe**: Set timeframe details. Notifications are only sent if alerts are generated within the timeframe you define. + +[role="screenshot"] +image::images/conditional-alerts.png[Configure a conditional alert] +===== + +.Action variables +[%collapsible] +===== +Use the default notification message or customize it. +You can add more context to the message by clicking the Add variable icon image:images/icons/indexOpen.svg[Add variable] and selecting from a list of available variables. + +[role="screenshot"] +image::images/action-variables-popup.png[Action variables list] + +The following variables are specific to this rule type. +You can also specify {kibana-ref}/rule-action-variables.html[variables common to all rules]. + +`context.alertDetailsUrl`:: +Link to the alert troubleshooting view for further context and details. This will be an empty string if the `server.publicBaseUrl` is not configured. + +`context.alertState`:: +Current state of the alert. + +`context.cloud`:: +The cloud object defined by ECS if available in the source. + +`context.container`:: +The container object defined by ECS if available in the source. + +`context.group`:: +Name of the group reporting data. + +`context.host`:: +The host object defined by ECS if available in the source. + +`context.labels`:: +List of labels associated with the entity where this alert triggered. + +`context.metric`:: +The metric name in the specified condition. Usage: (`ctx.metric.condition0`, `ctx.metric.condition1`, and so on). + +`context.orchestrator`:: +The orchestrator object defined by ECS if available in the source. + +`context.originalAlertState`:: +The state of the alert before it recovered. This is only available in the recovery context. + +`context.originalAlertStateWasALERT`:: +Boolean value of the state of the alert before it recovered. This can be used for template conditions. This is only available in the recovery context. + +`context.originalAlertStateWasWARNING`:: +Boolean value of the state of the alert before it recovered. This can be used for template conditions. This is only available in the recovery context. + +`context.reason`:: +A concise description of the reason for the alert. + +`context.tags`:: +List of tags associated with the entity where this alert triggered. + +`context.threshold`:: +The threshold value of the metric for the specified condition. Usage: (`ctx.threshold.condition0`, `ctx.threshold.condition1`, and so on) + +`context.timestamp`:: +A timestamp of when the alert was detected. + +`context.value`:: +The value of the metric in the specified condition. Usage: (`ctx.value.condition0`, `ctx.value.condition1`, and so on). + +`context.viewInAppUrl`:: +Link to the alert source. + +===== + +[discrete] +[[infra-alert-settings]] +== Settings + +With infrastructure threshold rules, it's not possible to set an explicit index pattern as part of the configuration. The index pattern +is instead inferred from **Metrics indices** on the <> page of the {infrastructure-app}. + +With each execution of the rule check, the **Metrics indices** setting is checked, but it is not stored when the rule is created. diff --git a/docs/en/serverless/apm/apm-integrate-with-machine-learning.asciidoc b/docs/en/serverless/apm/apm-integrate-with-machine-learning.asciidoc new file mode 100644 index 0000000000..464161470e --- /dev/null +++ b/docs/en/serverless/apm/apm-integrate-with-machine-learning.asciidoc @@ -0,0 +1,71 @@ +[[observability-apm-integrate-with-machine-learning]] += Integrate with machine learning + +// :keywords: serverless, observability, how-to + +The Machine learning integration initiates a new job predefined to calculate anomaly scores on APM transaction durations. +With this integration, you can quickly pinpoint anomalous transactions and see the health of +any upstream and downstream services. + +Machine learning jobs are created per environment and are based on a service's average response time. +Because jobs are created at the environment level, +you can add new services to your existing environments without the need for additional machine learning jobs. + +Results from machine learning jobs are shown in multiple places throughout the Applications UI: + +* The **Services overview** provides a quick-glance view of the general health of all of your services. ++ +//// +/* TODO: Take this screenshot (no data in oblt now) +![Example view of anomaly scores on response times in the Applications UI](images/machine-learning-integration/apm-service-quick-health.png) */ +//// +* The transaction duration chart will show the expected bounds and add an annotation when the anomaly score is 75 or above. ++ +//// +/* TODO: Take this screenshot (no data in oblt now) +![Example view of anomaly scores on response times in the Applications UI](images/machine-learning-integration/apm-apm-ml-integration.png) */ +//// +* Service Maps will display a color-coded anomaly indicator based on the detected anomaly score. ++ +[role="screenshot"] +image::images/service-maps/service-map-anomaly.png[Example view of anomaly scores on service maps in the Applications UI] + +[discrete] +[[observability-apm-integrate-with-machine-learning-enable-anomaly-detection]] +== Enable anomaly detection + +To enable machine learning anomaly detection: + +. In your {obs-serverless} project, go to any **Applications** page. +. Click **Anomaly detection**. +. Click **Create Job**. +. Machine learning jobs are created at the environment level. +Select all of the service environments that you want to enable anomaly detection in. +Anomalies will surface for all services and transaction types within the selected environments. +. Click **Create Jobs**. + +That's it! After a few minutes, the job will begin calculating results; +it might take additional time for results to appear on your service maps. +To manage existing jobs, click **Manage jobs** (or go to **Machine learning** → **Jobs**). + +[discrete] +[[observability-apm-integrate-with-machine-learning-anomaly-detection-warning]] +== Anomaly detection warning + +To make machine learning as easy as possible to set up, +Elastic will warn you when filtered to an environment without a machine learning job. + +//// +/* TODO: Take this screenshot (no data in oblt now) +![Example view of anomaly alert in the Applications UI](images/machine-learning-integration/apm-apm-anomaly-alert.png) */ +//// + +[discrete] +[[observability-apm-integrate-with-machine-learning-unknown-service-health]] +== Unknown service health + +After enabling anomaly detection, service health may display as "Unknown". Here are some reasons why this can occur: + +. No machine learning job exists. See <> to enable anomaly detection and create a machine learning job. +. There is no machine learning data for the job. If you just created the machine learning job you'll need to wait a few minutes for data to be available. Alternatively, if the service or its environment are new, you'll need to wait for more trace data. +. No "request" or "page-load" transaction type exists for this service; service health is only available for these transaction types. diff --git a/docs/en/serverless/infra-monitoring/analyze-hosts.asciidoc b/docs/en/serverless/infra-monitoring/analyze-hosts.asciidoc new file mode 100644 index 0000000000..efa194b710 --- /dev/null +++ b/docs/en/serverless/infra-monitoring/analyze-hosts.asciidoc @@ -0,0 +1,318 @@ +[[observability-analyze-hosts]] += Analyze and compare hosts + +// :description: Get a metrics-driven view of your hosts backed by an easy-to-use interface called Lens. +// :keywords: serverless, observability, how to + +We'd love to get your feedback! +https://docs.google.com/forms/d/e/1FAIpQLScRHG8TIVb1Oq8ZhD4aks3P1TmgiM58TY123QpDCcBz83YC6w/viewform[Tell us what you think!] + +The **Hosts** page provides a metrics-driven view of your infrastructure backed +by an easy-to-use interface called Lens. On the **Hosts** page, you can view +health and performance metrics to help you quickly: + +* Analyze and compare hosts without having to build new dashboards. +* Identify which hosts trigger the most alerts. +* Troubleshoot and resolve issues quickly. +* View historical data to rule out false alerts and identify root causes. +* Filter and search the data to focus on the hosts you care about the most. + +To access the **Hosts** page, in your {obs-serverless} project, go to +**Infrastructure** → **Hosts**. + +[role="screenshot"] +image::images/hosts.png[Screenshot of the Hosts page] + +To learn more about the metrics shown on this page, refer to the <> documentation. + +.Don't see any metrics? +[NOTE] +==== +If you haven't added data yet, click **Add data** to search for and install an Elastic integration. + +Need help getting started? Follow the steps in +<>. +==== + +The **Hosts** page provides several ways to view host metrics: + +* Overview tiles show the number of hosts returned by your search plus +averages of key metrics, including CPU usage, normalized load, and memory usage. +Max disk usage is also shown. +* The Host limit controls the maximum number of hosts shown on the page. The +default is 50, which means the page shows data for the top 50 hosts based on the +most recent timestamps. You can increase the host limit to see data for more +hosts, but doing so may impact query performance. +* The Hosts table shows a breakdown of metrics for each host along with an alert count +for any hosts with active alerts. You may need to page through the list +or change the number of rows displayed on each page to see all of your hosts. +* Each host name is an active link to a <> page, +where you can explore enhanced metrics and other observability data related to the selected host. +* Table columns are sortable, but note that the sorting behavior is applied to +the already returned data set. +* The tabs at the bottom of the page show an overview of the metrics, logs, +and alerts for all hosts returned by your search. + +[TIP] +==== +For more information about creating and viewing alerts, refer to <>. +==== + +[discrete] +[[analyze-hosts-filter-view]] +== Filter the Hosts view + +The **Hosts** page provides several mechanisms for filtering the data on the +page: + +* Enter a search query using {kibana-ref}/kuery-query.html[{kib} Query Language] to show metrics that match your search criteria. For example, +to see metrics for hosts running on linux, enter `host.os.type : "linux"`. +Otherwise you’ll see metrics for all your monitored hosts (up to the number of +hosts specified by the host limit). +* Select additional criteria to filter the view: ++ +** In the **Operating System** list, select one or more operating systems +to include (or exclude) metrics for hosts running the selected operating systems. +** In the **Cloud Provider** list, select one or more cloud providers to +include (or exclude) metrics for hosts running on the selected cloud providers. +** In the **Service Name** list, select one or more service names to +include (or exclude) metrics for the hosts running the selected services. +Services must be instrumented by APM to be filterable. +This filter is useful for comparing different hosts to determine whether a problem lies +with a service or the host that it is running on. ++ +[TIP] +==== +Filtered results are sorted by _document count_. +Document count is the number of events received by Elastic for the hosts that match your filter criteria. +==== +* Change the date range in the time filter, or click and drag on a +visualization to change the date range. +* Within a visualization, click a point on a line and apply filters to set other +visualizations on the page to the same time and/or host. + +[discrete] +[[analyze-hosts-inspect-data]] +== View metrics + +On the **Metrics** tab, view metrics trending over time, including CPU usage, +normalized load, memory usage, disk usage, and other metrics related to disk IOPs and throughput. +Place your cursor over a line to view metrics at a specific +point in time. From within each visualization, you can choose to open the visualization in Lens. + +To see metrics for a specific host, refer to <>. + +//// +/* TODO: Uncomment this section if/when the inspect option feature is added back in. +
+ +### Inspect and download metrics + +You can access a text-based view of the data underlying +your metrics visualizations and optionally download the data to a +comma-separated (CSV) file. + +Hover your cursor over a visualization, then in the upper-right corner, click +the ellipsis icon to inspect the data. + +![Screenshot showing option to inspect data](../images/hosts-inspect.png) + +In the flyout, click **Download CSV** to download formatted or raw data to a CSV +file. + +Click **View: Data** and notice that you can change the view to **Requests** to explore the request +used to fetch the data and the response returned from {es}. On the **Request** tab, click links +to further inspect and analyze the request in the Dev Console or Search Profiler. */ +//// + +[discrete] +[[analyze-hosts-open-in-lens]] +=== Open in Lens + +Metrics visualizations are powered by Lens, meaning you can continue your +analysis in Lens if you require more flexibility. Hover your cursor over a +visualization, then click the ellipsis icon in the upper-right corner to open +the visualization in Lens. + +[role="screenshot"] +image::images/hosts-open-in-lens.png[Screenshot showing option to open in Lens] + +In Lens, you can examine all the fields and formulas used to create the +visualization, make modifications to the visualization, and save your changes. + +For more information about using Lens, refer to the +{kibana-ref}/lens.html[{kib} documentation about Lens]. + +[discrete] +[[analyze-hosts-view-logs]] +== View logs + +On the **Logs** tab of the **Hosts** page, view logs for the systems you are monitoring and search +for specific log entries. This view shows logs for all of the hosts returned by +the current query. + +[role="screenshot"] +image::images/hosts-logs.png[Screenshot showing Logs view] + +To see logs for a specific host, refer to <>. + +[discrete] +[[analyze-hosts-view-alerts]] +== View alerts + +On the **Alerts** tab of the **Hosts** page, view active alerts to pinpoint problems. Use this view +to figure out which hosts triggered alerts and identify root causes. This view +shows alerts for all of the hosts returned by the current query. + +From the **Actions** menu, you can choose to: + +* Add the alert to a new or existing case. +* View rule details. +* View alert details. + +[role="screenshot"] +image::images/hosts-view-alerts.png[Screenshot showing Alerts view] + +To see alerts for a specific host, refer to <>. + +.Why are alerts missing from the Hosts page? +[NOTE] +==== +If your rules are triggering alerts that don't appear on the **Hosts** page, +edit the rules and make sure they are correctly configured to associate the host name with the alert: + +* For Metric threshold or Custom threshold rules, select `host.name` in the **Group alerts by** field. +* For Inventory rules, select **Host** for the node type under **Conditions**. + +To learn more about creating and managing rules, refer to <>. +==== + +[discrete] +[[view-host-details]] +== View host details + +Without leaving the **Hosts** page, you can view enhanced metrics relating to +each host running in your infrastructure. In the list of hosts, find the host +you want to monitor, then click the **Toggle dialog with details** +icon image:images/expand-icon.png[] to display the host details overlay. + +[TIP] +==== +To expand the overlay and view more detail, click **Open as page** in the upper-right corner. +==== + +The host details overlay contains the following tabs: + +include::../transclusion/host-details.asciidoc[] + +[NOTE] +==== +The metrics shown on the **Hosts** page are also available when viewing hosts on the **Infrastructure inventory** page. +==== + +[discrete] +[[analyze-hosts-why-dashed-lines]] +== Why am I seeing dashed lines in charts? + +There are a few reasons why you may see dashed lines in your charts. + +* <> +* <> +* <> + +[discrete] +[[dashed-interval]] +=== The chart interval is too short + +In this example, the data emission rate is lower than the Lens chart interval. +A dashed line connects the known data points to make it easier to visualize trends in the data. + +[role="screenshot"] +image::images/hosts-dashed.png[Screenshot showing dashed chart] + +The chart interval is automatically set depending on the selected time duration. +To fix this problem, change the selected time range at the top of the page. + +[TIP] +==== +Want to dig in further while maintaining the selected time duration? +Hover over the chart you're interested in and select **Options** → **Open in Lens**. +Once in Lens, you can adjust the chart interval temporarily. +Note that this change is not persisted in the **Hosts** view. +==== + +[discrete] +[[dashed-missing]] +=== Data is missing + +A solid line indicates that the chart interval is set appropriately for the data transmission rate. +In this example, a solid line turns into a dashed line—indicating missing data. +You may want to investigate this time period to determine if there is an outage or issue. + +[role="screenshot"] +image::images/hosts-missing-data.png[Screenshot showing missing data] + +[discrete] +[[observability-analyze-hosts-the-chart-interval-is-too-short-and-data-is-missing]] +=== The chart interval is too short and data is missing + +In the example shown in the screenshot, +the data emission rate is lower than the Lens chart interval **and** there is missing data. + +This missing data can be hard to spot at first glance. +The green boxes outline regular data emissions, while the missing data is outlined in pink. +Similar to the above scenario, you may want to investigate the time period with the missing data +to determine if there is an outage or issue. + +[role="screenshot"] +image::images/hosts-dashed-and-missing.png[Screenshot showing dashed lines and missing data] + +[discrete] +[[observability-analyze-hosts-troubleshooting]] +== Troubleshooting + +//// +/* +Troubleshooting topic template: +Title: Brief description of what the user sees/experiences +Content: +1. What the user sees/experiences (error message, UI, behavior, etc) +2. Why it happens +3. How to fix it +*/ +//// + +[discrete] +[[observability-analyze-hosts-what-does-mean]] +=== What does _this host has been detected by APM_ mean? + +// What the user sees/experiences (error message, UI, behavior, etc) + +In the Hosts view, you might see a question mark icon (image:images/icons/questionInCircle.svg[Question mark icon]) +before a host name with a tooltip note stating that the host has been detected by APM. + +// Why it happens + +When a host is detected by APM, but is not collecting full metrics +(for example, through the https://www.elastic.co/docs/current/integrations/system[system integration]), +it will be listed as a host with the partial metrics collected by APM. + +// How to fix it + +// N/A? + +// What the user sees/experiences (error message, UI, behavior, etc) + +[discrete] +[[observability-analyze-hosts-i-dont-recognize-a-host-name-and-i-see-a-question-mark-icon-next-to-it]] +=== I don't recognize a host name and I see a question mark icon next to it + +// Why it happens + +This could mean that the APM agent has not been configured to use the correct host name. +Instead, the host name might be the container name or the Kubernetes pod name. + +// How to fix it + +To get the correct host name, you need to set some additional configuration options, +specifically `system.kubernetes.node.name` as described in <>. diff --git a/docs/en/serverless/infra-monitoring/configure-infra-settings.asciidoc b/docs/en/serverless/infra-monitoring/configure-infra-settings.asciidoc new file mode 100644 index 0000000000..14779f3e48 --- /dev/null +++ b/docs/en/serverless/infra-monitoring/configure-infra-settings.asciidoc @@ -0,0 +1,36 @@ +[[observability-configure-intra-settings]] += Configure settings + +// :description: Learn how to configure infrastructure UI settings. +// :keywords: serverless, observability, how to + +:role: Editor +:goal: configure settings +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +From the main {obs-serverless} menu, go to **Infrastructure** → **Infrastructure inventory** or **Hosts**, +and click the **Settings** link at the top of the page. +The following settings are available: + +|=== +| Setting | Description + +| **Name** +| Name of the source configuration. + +| **Indices** +| {ipm-cap} or patterns used to match {es} indices that contain metrics. The default patterns are `metrics-*,metricbeat-*`. + +| **Machine Learning** +| The minimum severity score required to display anomalies in the Infrastructure UI. The default is 50. + +| **Features** +| Turn new features on and off. +|=== + +Click **Apply** to save your changes. + +If the fields are grayed out and cannot be edited, you may not have sufficient privileges to change the source configuration. diff --git a/docs/en/serverless/infra-monitoring/detect-metric-anomalies.asciidoc b/docs/en/serverless/infra-monitoring/detect-metric-anomalies.asciidoc new file mode 100644 index 0000000000..30c50f2b84 --- /dev/null +++ b/docs/en/serverless/infra-monitoring/detect-metric-anomalies.asciidoc @@ -0,0 +1,77 @@ +[[observability-detect-metric-anomalies]] += Detect metric anomalies + +// :description: Detect and inspect memory usage and network traffic anomalies for hosts and Kubernetes pods. +// :keywords: serverless, observability, how to + +:role: Editor +:goal: create {ml} jobs +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +You can create {ml} jobs to detect and inspect memory usage and network traffic anomalies for hosts and Kubernetes pods. + +You can model system memory usage, along with inbound and outbound network traffic across hosts or pods. +You can detect unusual increases in memory usage and unusually high inbound or outbound traffic across hosts or pods. + +[discrete] +[[ml-jobs-hosts]] +== Enable {ml} jobs for hosts or Kubernetes pods + +Create a {ml} job to detect anomalous memory usage and network traffic automatically. + +After creating {ml} jobs, you cannot change the settings. +You can recreate these jobs later. +However, you will remove any previously detected anomalies. + +// lint ignore anomaly-detection observability + +. In your {obs-serverless} project, go to **Infrastructure** → **Infrastructure inventory** +and click the **Anomaly detection** link at the top of the page. +. Under **Hosts** or **Kubernetes Pods**, click **Enable** to create a {ml} job. +. Choose a start date for the {ml} analysis. {ml-cap} jobs analyze the last four weeks of data and continue to run indefinitely. +. Select a partition field. +Partitions allow you to create independent models for different groups of data that share similar behavior. +For example, you may want to build separate models for machine type or cloud availability zone so that anomalies are not weighted equally across groups. +. By default, {ml} jobs analyze all of your metric data. +You can filter this list to view only the jobs or metrics that you are interested in. +For example, you can filter by job name and node name to view specific {anomaly-detect} jobs for that host. +. Click **Enable jobs**. +. You're now ready to explore your metric anomalies. Click **Anomalies**. + +[role="screenshot"] +image::images/metrics-ml-jobs.png[Infrastructure {ml-app} anomalies] + +The **Anomalies** table displays a list of each single metric {anomaly-detect} job for the specific host or Kubernetes pod. +By default, anomaly jobs are sorted by time to show the most recent job. + +Along with each anomaly job and the node name, +detected anomalies with a severity score equal to 50 or higher are listed. +These scores represent a severity of "warning" or higher in the selected time period. +The **summary** value represents the increase between the actual value and the expected ("typical") value of the metric in the anomaly record result. + +To drill down and analyze the metric anomaly, +select **Actions → Open in Anomaly Explorer** to view the Anomaly Explorer. +You can also select **Actions** → **Show in Inventory** to view the host or Kubernetes pods Inventory page, +filtered by the specific metric. + +[NOTE] +==== +These predefined {anomaly-jobs} use {ml-docs}/ml-rules.html[custom rules]. +To update the rules in the Anomaly Explorer, select **Actions** → **Configure rules**. +The changes only take effect for new results. +If you want to apply the changes to existing results, clone and rerun the job. +==== + +[discrete] +[[history-chart]] +== History chart + +On the **Infrastructure inventory** page, click **Show history** to view the metric values within the selected time frame. +Detected anomalies with an anomaly score equal to 50 or higher are highlighted in red. +To examine the detected anomalies, use the Anomaly Explorer. + +[role="screenshot"] +image::images/metrics-history-chart.png[History] diff --git a/docs/en/serverless/infra-monitoring/get-started-with-metrics.asciidoc b/docs/en/serverless/infra-monitoring/get-started-with-metrics.asciidoc new file mode 100644 index 0000000000..cb26b26674 --- /dev/null +++ b/docs/en/serverless/infra-monitoring/get-started-with-metrics.asciidoc @@ -0,0 +1,61 @@ +[[observability-get-started-with-metrics]] += Get started with system metrics + +// :description: Learn how to onboard your system metrics data quickly. +// :keywords: serverless, observability, how-to + +:role: Admin +:goal: onboard system metrics data +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +In this guide you'll learn how to onboard system metrics data from a machine or server, +then observe the data in {obs-serverless}. + +To onboard system metrics data: + +. <>, or open an existing one. +. In your {obs-serverless} project, go to **Project Settings** → **Integrations**. +. Type **System** in the search bar, then select the integration to see more details about it. +. Click **Add System**. +. Follow the in-product steps to install the System integration and deploy an {agent}. +The sequence of steps varies depending on whether you have already installed an integration. ++ +** When configuring the System integration, make sure that **Collect metrics from System instances** is turned on. +** Expand each configuration section to verify that the settings are correct for your host. +For example, you may want to turn on **System core metrics** to get a complete view of your infrastructure. + +Notice that you can also configure the integration to collect logs. + +.What if {agent} is already running on my host? +[NOTE] +==== +Do not try to deploy a second {agent} to the same system. +You have a couple options: + +* **Use the System integration to collect system logs and metrics.** To do this, +uninstall the standalone agent you deployed previously, +then follow the in-product steps to install the System integration and deploy an {agent}. +* **Configure your existing standalone agent to collect metrics.** To do this, +edit the deployed {agent}'s YAML file and add metric inputs to the configuration manually. +Manual configuration is a time-consuming process. +To save time, you can follow the in-product steps that describe how to deploy a standalone {agent}, +and use the generated configuration as source for the input configurations that you need to add to your standalone config file. +==== + +After the agent is installed and successfully streaming metrics data, +go to **Infrastructure** → **Infrastructure inventory** or **Hosts** to see a metrics-driven view of your infrastructure. +To learn more, refer to <> or <>. + +[discrete] +[[observability-get-started-with-metrics-next-steps]] +== Next steps + +Now that you've added metrics and explored your data, +learn how to onboard other types of data: + +* <> +* <> +* <> diff --git a/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc b/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc new file mode 100644 index 0000000000..a8e48620e5 --- /dev/null +++ b/docs/en/serverless/infra-monitoring/view-infrastructure-metrics.asciidoc @@ -0,0 +1,148 @@ +[[observability-view-infrastructure-metrics]] += View infrastructure metrics by resource type + +// :description: Get a metrics-driven view of your infrastructure grouped by resource type. +// :keywords: serverless, observability, how to + +The **Infrastructure Inventory** page provides a metrics-driven view of your entire infrastructure grouped by +the resources you are monitoring. All monitored resources emitting +a core set of infrastructure metrics are displayed to give you a quick view of the overall health +of your infrastructure. + +To access the **Infrastructure Inventory** page, in your {obs-serverless} project, +go to **Infrastructure inventory**. + +[role="screenshot"] +image::images/metrics-app.png[Infrastructure UI in {kib}] + +To learn more about the metrics shown on this page, refer to the <>. + +.Don't see any metrics? +[NOTE] +==== +If you haven't added data yet, click **Add data** to search for and install an Elastic integration. + +Need help getting started? Follow the steps in +<>. +==== + +[discrete] +[[filter-resources]] +== Filter the Inventory view + +To get started with your analysis, select the type of resources you want to show +in the high-level view. From the **Show** menu, select one of the following: + +* **Hosts** — the default +* **Kubernetes Pods** +* **Docker Containers** — shows _all_ containers, not just Docker +* **AWS** — includes EC2 instances, S3 buckets, RDS databases, and SQS queues + +When you hover over each resource in the waffle map, the metrics specific to +that resource are displayed. + +You can sort by resource, group the resource by specific fields related to it, and sort by +either name or metric value. For example, you can filter the view to display the memory usage +of your Kubernetes pods, grouped by namespace, and sorted by the memory usage value. + +[role="screenshot"] +image::images/kubernetes-filter.png[Kubernetes pod filtering] + +You can also use the search bar to create structured queries using {kibana-ref}/kuery-query.html[{kib} Query Language]. +For example, enter `host.hostname : "host1"` to view only the information for `host1`. + +To examine the metrics for a specific time, use the time filter to select the date and time. + +[discrete] +[[analyze-hosts-inventory]] +== View host metrics + +By default the **Infrastructure Inventory** page displays a waffle map that shows the hosts you +are monitoring and the current CPU usage for each host. +Alternatively, you can click the **Table view** icon image:images/table-view-icon.png[Table view icon] +to switch to a table view. + +Without leaving the **Infrastructure Inventory** page, you can view enhanced metrics relating to each host +running in your infrastructure. On the waffle map, select a host to display the host details +overlay. + +[TIP] +==== +To expand the overlay and view more detail, click **Open as page** in the upper-right corner. +==== + +The host details overlay contains the following tabs: + +include::../transclusion/host-details.asciidoc[] + +[NOTE] +==== +These metrics are also available when viewing hosts on the **Hosts** +page. +==== + +[discrete] +[[analyze-containers-inventory]] +== View container metrics + +When you select **Docker containers**, the **Infrastructure inventory** page displays a waffle map that shows the containers you +are monitoring and the current CPU usage for each container. +Alternatively, you can click the **Table view** icon image:images/table-view-icon.png[Table view icon] +to switch to a table view. + +Without leaving the **Infrastructure inventory** page, you can view enhanced metrics relating to each container +running in your infrastructure. + +.Why do some containers report 0% or null (-) values in the waffle map? +[NOTE] +==== +The waffle map shows _all_ monitored containers, including containerd, +provided that the data collected from the container has the `container.id` field. +However, the waffle map currently only displays metrics for Docker fields. +This display problem will be resolved in a future release. +==== + +On the waffle map, select a container to display the container details +overlay. + +[TIP] +==== +To expand the overlay and view more detail, click **Open as page** in the upper-right corner. +==== + +The container details overlay contains the following tabs: + +include::../transclusion/container-details.asciidoc[] + +[discrete] +[[analyze-resource-metrics]] +== View metrics for other resources + +When you have searched and filtered for a specific resource, you can drill down to analyze the +metrics relating to it. For example, when viewing Kubernetes Pods in the high-level view, +click the Pod you want to analyze and select **Kubernetes Pod metrics** to see detailed metrics: + +[role="screenshot"] +image::images/pod-metrics.png[Kubernetes pod metrics] + +[discrete] +[[custom-metrics]] +== Add custom metrics + +If the predefined metrics displayed on the Inventory page for each resource are not +sufficient for your specific use case, you can add and define custom metrics. + +Select your resource, and from the **Metric** filter menu, click **Add metric**. + +[role="screenshot"] +image::images/add-custom-metric.png[Add custom metrics] + +[discrete] +[[apm-uptime-integration]] +== Integrate with Logs and APM + +Depending on the features you have installed and configured, you can view logs or traces relating to a specific resource. +For example, in the high-level view, when you click a Kubernetes Pod resource, you can choose: + +* **Kubernetes Pod logs** to <> in the {logs-app}. +* **Kubernetes Pod APM traces** to <> in the {apm-app}. diff --git a/docs/en/serverless/machine-learning/aiops-analyze-spikes.asciidoc b/docs/en/serverless/machine-learning/aiops-analyze-spikes.asciidoc new file mode 100644 index 0000000000..8222dedfb5 --- /dev/null +++ b/docs/en/serverless/machine-learning/aiops-analyze-spikes.asciidoc @@ -0,0 +1,71 @@ +[[observability-aiops-analyze-spikes]] += Analyze log spikes and drops + +// :description: Find and investigate the causes of unusual spikes or drops in log rates. +// :keywords: serverless, observability, how-to + +// + +{obs-serverless} provides built-in log rate analysis capabilities, +based on advanced statistical methods, +to help you find and investigate the causes of unusual spikes or drops in log rates. + +To analyze log spikes and drops: + +. In your {obs-serverless} project, go to **Machine learning** → **Log rate analysis**. +. Choose a data view or saved search to access the log data you want to analyze. +. In the histogram chart, click a spike (or drop) and then run the analysis. ++ +[role="screenshot"] +image::images/log-rate-histogram.png[Histogram showing log spikes and drops ] ++ +When the analysis runs, it identifies statistically significant field-value combinations that contribute to the spike or drop, +and then displays them in a table: ++ +[role="screenshot"] +image::images/log-rate-analysis-results.png[Histogram showing log spikes and drops ] ++ +Notice that you can optionally turn on **Smart grouping** to summarize the results into groups. +You can also click **Filter fields** to remove fields that are not relevant. ++ +The table shows an indicator of the level of impact and a sparkline showing the shape of the impact in the chart. +. Select a row to display the impact of the field on the histogram chart. +. From the **Actions** menu in the table, you can choose to view the field in **Discover**, +view it in <>, +or copy the table row information to the clipboard as a query filter. + +To pin a table row, click the row, then move the cursor to the histogram chart. +It displays a tooltip with exact count values for the pinned field which enables closer investigation. + +Brushes in the chart show the baseline time range and the deviation in the analyzed data. +You can move the brushes to redefine both the baseline and the deviation and rerun the analysis with the modified values. + +[discrete] +[[log-pattern-analysis]] +== Log pattern analysis + +// + +Use log pattern analysis to find patterns in unstructured log messages and examine your data. +When you run a log pattern analysis, it performs categorization analysis on a selected field, +creates categories based on the data, and then displays them together in a chart. +The chart shows the distribution of each category and an example document that matches the category. +Log pattern analysis is useful when you want to examine how often different types of logs appear in your data set. +It also helps you group logs in ways that go beyond what you can achieve with a terms aggregation. + +To run log pattern analysis: + +. Follow the steps under <> to run a log rate analysis. +. From the **Actions** menu, choose **View in Log Pattern Analysis**. +. Select a category field and optionally apply any filters that you want. +. Click **Run pattern analysis**. ++ +The results of the analysis are shown in a table: ++ +[role="screenshot"] +image::images/log-pattern-analysis.png[Log pattern analysis of the message field ] +. From the **Actions** menu, click the plus (or minus) icon to open **Discover** and show (or filter out) the given category there, which helps you to further examine your log messages. + +// TODO: Question: Is the log pattern analysis only available through the log rate analysis UI? + +// TODO: Add some good examples to this topic taken from existing docs or recommendations from reviewers. diff --git a/docs/en/serverless/machine-learning/aiops-detect-anomalies.asciidoc b/docs/en/serverless/machine-learning/aiops-detect-anomalies.asciidoc new file mode 100644 index 0000000000..f9ff6fc044 --- /dev/null +++ b/docs/en/serverless/machine-learning/aiops-detect-anomalies.asciidoc @@ -0,0 +1,272 @@ +[[observability-aiops-detect-anomalies]] += Detect anomalies + +// :description: Detect anomalies by comparing real-time and historical data from different sources to look for unusual, problematic patterns. +// :keywords: serverless, observability, how-to + +:role: Editor +:goal: create, run, and view {anomaly-job}s +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +The anomaly detection feature in {obs-serverless} automatically models the normal behavior of your time series data — learning trends, +periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. + +To set up anomaly detection, you create and run anomaly detection jobs. +Anomaly detection jobs use proprietary {ml} algorithms to detect anomalous events or patterns, such as: + +* Anomalies related to temporal deviations in values, counts, or frequencies +* Anomalies related to unusual locations in geographic data +* Statistical rarity +* Unusual behaviors for a member of a population + +To learn more about anomaly detection algorithms, refer to the {ml-docs}/ml-ad-algorithms.html[{ml}] documentation. +Note that the {ml} documentation may contain details that are not valid when using a serverless project. + +.Some terms you might need to know +[NOTE] +==== +A _datafeed_ retrieves time series data from {es} and provides it to an +anomaly detection job for analysis. + +The job uses _buckets_ to divide the time series into batches for processing. +For example, a job may use a bucket span of 1 hour. + +Each {anomaly-job} contains one or more _detectors_, which define the type of +analysis that occurs (for example, `max`, `average`, or `rare` analytical +functions) and the fields that are analyzed. Some of the analytical functions +look for single anomalous data points. For example, `max` identifies the maximum +value that is seen within a bucket. Others perform some aggregation over the +length of the bucket. For example, `mean` calculates the mean of all the data +points seen within the bucket. + +To learn more about anomaly detection, refer to the {ml-docs}/ml-ad-overview.html[{ml}] documentation. +==== + +[discrete] +[[create-anomaly-detection-job]] +== Create and run an anomaly detection job + +. In your {obs-serverless} project, go to **Machine learning** → **Jobs**. +. Click **Create anomaly detection job** (or **Create job** if other jobs exist). +. Choose a data view or saved search to access the data you want to analyze. +. Select the wizard for the type of job you want to create. +The following wizards are available. +You might also see specialized wizards based on the type of data you are analyzing. ++ +[TIP] +==== +In general, it is a good idea to start with single metric anomaly detection jobs for your key performance indicators. +After you examine these simple analysis results, you will have a better idea of what the influencers might be. +Then you can create multi-metric jobs and split the data or create more complex analysis functions as necessary. +==== ++ +-- +Single metric:: +Creates simple jobs that have a single detector. A _detector_ applies an analytical function to specific fields in your data. In addition to limiting the number of detectors, the single metric wizard omits many of the more advanced configuration options. + +Multi-metric:: +Creates jobs that can have more than one detector, which is more efficient than running multiple jobs against the same data. + +Population:: +Creates jobs that detect activity that is unusual compared to the behavior of the population. + +Advanced:: +Creates jobs that can have multiple detectors and enables you to configure all job settings. + +Categorization:: +Creates jobs that group log messages into categories and use `count` or `rare` functions to detect anomalies within them. + +Rare:: +Creates jobs that detect rare occurrences in time series data. Rare jobs use the `rare` or `freq_rare` functions and also detect rare occurrences in populations. + +Geo:: +Creates jobs that detect unusual occurrences in the geographic locations of your data. Your data set must contain geo data. +-- ++ +For more information about job types, refer to the {ml-docs}/ml-anomaly-detection-job-types.html[{ml}] documentation. ++ +.Not sure what type of job to create? +[NOTE] +==== +Before selecting a wizard, click **Data Visualizer** to explore the fields and metrics in your data. +To get the best results, you must understand your data, including its data types and the range and distribution of values. + +In the **Data Visualizer**, use the time filter to select a time period that you’re interested in exploring, +or click **Use full data** to view the full time range of data. +Expand the fields to see details about the range and distribution of values. +When you're done, go back to the first step and create your job. +==== +. Step through the instructions in the job creation wizard to configure your job. +You can accept the default settings for most settings now and <> later. +. If you want the job to start immediately when the job is created, make sure that option is selected on the summary page. +. When you're done, click **Create job**. +When the job runs, the {ml} features analyze the input stream of data, model its behavior, and perform analysis based on the detectors in each job. +When an event occurs outside of the baselines of normal behavior, that event is identified as an anomaly. +. After the job is started, click **View results**. + +[discrete] +[[observability-aiops-detect-anomalies-view-the-results]] +== View the results + +After the anomaly detection job has processed some data, +you can view the results in {obs-serverless}. + +[TIP] +==== +Depending on the capacity of your machine, +you might need to wait a few seconds for the analysis to generate initial results. +==== + +If you clicked **View results** after creating the job, the results open in either the **Single Metric Viewer** or **Anomaly Explorer**. +To switch between these tools, click the icons in the upper-left corner of each tool. + +Read the following sections to learn more about these tools: + +* <> +* <> + +[discrete] +[[view-single-metric]] +== View single metric job results + +The **Single Metric Viewer** contains a chart that represents the actual and expected values over time: + +[role="screenshot"] +image::images/anomaly-detection-single-metric-viewer.png[Single Metric Viewer showing analysis ] + +* The line in the chart represents the actual data values. +* The shaded area represents the bounds for the expected values. +* The area between the upper and lower bounds are the most likely values for the model, using a 95% confidence level. +That is to say, there is a 95% chance of the actual value falling within these bounds. +If a value is outside of this area then it will usually be identified as anomalous. + +[TIP] +==== +Expected values are available only if **Enable model plot** was selected under Job Details +when you created the job. +==== + +To explore your data: + +. If the **Single Metric Viewer** is not already open, go to **Machine learning** → **Single metric viewer** and select the job you created. +. In the time filter, specify a time range that covers the majority of the analyzed data points. +. Notice that the model improves as it processes more data. +At the beginning, the expected range of values is pretty broad, and the model is not capturing the periodicity in the data. +But it quickly learns and begins to reflect the patterns in your data. +The duration of the learning process heavily depends on the characteristics and complexity of the input data. +. Look for anomaly data points, depicted by colored dots or cross symbols, and hover over a data point to see more details about the anomaly. +Note that anomalies with medium or high multi-bucket impact are depicted with a cross symbol instead of a dot. ++ +.How are anomaly scores calculated? +[NOTE] +==== +Any data points outside the range that was predicted by the model are marked +as anomalies. In order to provide a sensible view of the results, an +_anomaly score_ is calculated for each bucket time interval. The anomaly score +is a value from 0 to 100, which indicates the significance of the anomaly +compared to previously seen anomalies. The highly anomalous values are shown in +red and the low scored values are shown in blue. An interval with a high +anomaly score is significant and requires investigation. +For more information about anomaly scores, refer to the {ml-docs}/ml-ad-explain.html[{ml}] documentation. +==== +. (Optional) Annotate your job results by drag-selecting a period of time and entering annotation text. +Annotations are notes that refer to events in a specific time period. +They can be created by the user or generated automatically by the anomaly detection job to reflect model changes and noteworthy occurrences. +. Under **Anomalies**, expand each anomaly to see key details, such as the time, the actual and expected ("typical") values, and their probability. +The **Anomaly explanation** section gives you further insights about each anomaly, such as its type and impact, to make it easier to interpret the job results: ++ +[role="screenshot"] +image::images/anomaly-detection-details.png[Single Metric Viewer showing anomaly details ] ++ +By default, the **Anomalies** table contains all anomalies that have a severity of "warning" or higher in the selected section of the timeline. +If you are only interested in critical anomalies, for example, you can change the severity threshold for this table. +. (Optional) From the **Actions** menu in the **Anomalies** table, you can choose to view relevant documents in **Discover** or create a job rule. +Job rules instruct anomaly detectors to change their behavior based on domain-specific knowledge that you provide. +To learn more, refer to <> + +After you have identified anomalies, often the next step is to try to determine +the context of those situations. For example, are there other factors that are +contributing to the problem? Are the anomalies confined to particular +applications or servers? You can begin to troubleshoot these situations by +layering additional jobs or creating multi-metric jobs. + +[discrete] +[[anomaly-explorer]] +== View advanced or multi-metric job results + +Conceptually, you can think of _multi-metric anomaly detection jobs_ as running multiple independent single metric jobs. +By bundling them together in a multi-metric job, however, +you can see an overall score and shared influencers for all the metrics and all the entities in the job. +Multi-metric jobs therefore scale better than having many independent single metric jobs. +They also provide better results when you have influencers that are shared across the detectors. + +.What is an influencer? +[NOTE] +==== +When you create an anomaly detection job, you can identify fields as _influencers_. +These are fields that you think contain information about someone or something that influences or contributes to anomalies. +As a best practice, do not pick too many influencers. +For example, you generally do not need more than three. +If you pick many influencers, the results can be overwhelming, and there is some overhead to the analysis. + +To learn more about influencers, refer to the {ml-docs}/ml-ad-run-jobs.html#ml-ad-influencers[{ml}] documentation. +==== + +You can also configure your anomaly detection jobs to split a single time series into multiple time series based on a categorical field. +For example, you could create a job for analyzing response code rates that has a single detector that splits the data based on the `response.keyword`, +and uses the `count` function to determine when the number of events is anomalous. +You might use a job like this if you want to look at both high and low request rates partitioned by response code. + +To view advanced or multi-metric results in the +**Anomaly Explorer**: + +. If the **Anomaly Explorer** is not already open, go to **Machine learning** → **Anomaly explorer** and select the job you created. +. In the time filter, specify a time range that covers the majority of the analyzed data points. +. If you specified influencers during job creation, the view includes a list of the top influencers for all of the detected anomalies in that same time period. +The list includes maximum anomaly scores, which in this case are aggregated for each influencer, for each bucket, across all detectors. +There is also a total sum of the anomaly scores for each influencer. +Use this list to help you narrow down the contributing factors and focus on the most anomalous entities. +. Under **Anomaly timeline**, click a section in the swim lanes to obtain more information about the anomalies in that time period. ++ +[role="screenshot"] +image::images/anomaly-explorer.png[Anomaly Explorer showing swim lanes with anomaly selected ] ++ +You can see exact times when anomalies occurred. +If there are multiple detectors or metrics in the job, you can see which caught the anomaly. +You can also switch to viewing this time series in the **Single Metric Viewer** by selecting **View series** in the **Actions** menu. +. Under **Anomalies** (in the **Anomaly Explorer**), expand an anomaly to see key details, such as the time, +the actual and expected ("typical") values, and the influencers that contributed to the anomaly: ++ +[role="screenshot"] +image::images/anomaly-detection-multi-metric-details.png[Anomaly Explorer showing anomaly details ] ++ +By default, the **Anomalies** table contains all anomalies that have a severity of "warning" or higher in the selected section of the timeline. +If you are only interested in critical anomalies, for example, you can change the severity threshold for this table. ++ +If your job has multiple detectors, the table aggregates the anomalies to show the highest severity anomaly per detector and entity, +which is the field value that is displayed in the **found for** column. ++ +To view all the anomalies without any aggregation, set the **Interval** to **Show all**. + +[TIP] +==== +The anomaly scores that you see in each section of the **Anomaly Explorer** might differ slightly. +This disparity occurs because for each job there are bucket results, influencer results, and record results. +Anomaly scores are generated for each type of result. +The anomaly timeline uses the bucket-level anomaly scores. +The list of top influencers uses the influencer-level anomaly scores. +The list of anomalies uses the record-level anomaly scores. +==== + +[discrete] +[[observability-aiops-detect-anomalies-next-steps]] +== Next steps + +After setting up an anomaly detection job, you may want to: + +* <> +* <> +* <> diff --git a/docs/en/serverless/machine-learning/aiops-detect-change-points.asciidoc b/docs/en/serverless/machine-learning/aiops-detect-change-points.asciidoc new file mode 100644 index 0000000000..19d1e0420a --- /dev/null +++ b/docs/en/serverless/machine-learning/aiops-detect-change-points.asciidoc @@ -0,0 +1,68 @@ +[[observability-aiops-detect-change-points]] += Detect change points + +// :description: Detect distribution changes, trend changes, and other statistically significant change points in a metric of your time series data. +// :keywords: serverless, observability, how-to + +// + +The change point detection feature in {obs-serverless} detects distribution changes, +trend changes, and other statistically significant change points in time series data. +Unlike anomaly detection, change point detection does not require you to configure a job or generate a model. +Instead you select a metric and immediately see a visual representation that splits the time series into two parts, before and after the change point. + +{obs-serverless} uses a {ref}/search-aggregations-change-point-aggregation.html[change point aggregation] +to detect change points. This aggregation can detect change points when: + +* a significant dip or spike occurs +* the overall distribution of values has changed significantly +* there was a statistically significant step up or down in value distribution +* an overall trend change occurs + +To detect change points: + +. In your {obs-serverless} project, go to **Machine learning** → **Change point detection**. +. Choose a data view or saved search to access the data you want to analyze. +. Select a function: **avg**, **max**, **min**, or **sum**. +. In the time filter, specify a time range over which you want to detect change points. +. From the **Metric field** list, select a field you want to check for change points. +. (Optional) From the **Split field** list, select a field to split the data by. +If the cardinality of the split field exceeds 10,000, only the first 10,000 values, sorted by document count, are analyzed. +Use this option when you want to investigate the change point across multiple instances, pods, clusters, and so on. +For example, you may want to view CPU utilization split across multiple instances without having to jump across multiple dashboards and visualizations. + +[NOTE] +==== +You can configure a maximum of six combinations of a function applied to a metric field, partitioned by a split field, to identify change points. +==== + +The change point detection feature automatically dissects the time series into multiple points within the given time window, +tests whether the behavior is statistically different before and after each point in time, and then detects a change point if one exists: + +[role="screenshot"] +image::images/change-point-detection.png[Change point detection UI showing change points split by process ] + +The resulting view includes: + +* The timestamp of the change point +* A preview chart +* The type of change point and its p-value. The p-value indicates the magnitude of the change; lower values indicate more significant changes. +* The name and value of the split field, if used. + +If the analysis is split by a field, a separate chart is shown for every partition that has a detected change point. +The chart displays the type of change point, its value, and the timestamp of the bucket where the change point has been detected. + +On the **Change point detection page**, you can also: + +* Select a subset of charts and click **View selected** to view only the selected charts. ++ +[role="screenshot"] +image::images/change-point-detection-view-selected.png[View selected change point detection charts ] +* Filter the results by specific types of change points by using the change point type selector: ++ +[role="screenshot"] +image::images/change-point-detection-filter-by-type.png[Change point detection filter by type list] +* Attach change points to a chart or dashboard by using the context menu: ++ +[role="screenshot"] +image::images/change-point-detection-attach-charts.png[Change point detection add to charts menu] diff --git a/docs/en/serverless/machine-learning/aiops-tune-anomaly-detection-job.asciidoc b/docs/en/serverless/machine-learning/aiops-tune-anomaly-detection-job.asciidoc new file mode 100644 index 0000000000..0de2dae79f --- /dev/null +++ b/docs/en/serverless/machine-learning/aiops-tune-anomaly-detection-job.asciidoc @@ -0,0 +1,182 @@ +[[observability-aiops-tune-anomaly-detection-job]] += Tune your anomaly detection job + +// :description: Tune your job by creating calendars, adding job rules, and defining custom URLs. +// :keywords: serverless, observability, how-to + +:role: Editor +:goal: create calendars, add job rules, and define custom URLs +include::../partials/roles.asciidoc[] +:role!: + +:goal!: + +After you run an anomaly detection job and view the results, +you might find that you need to alter the job configuration or settings. + +To further tune your job, you can: + +* <> that contain a list of scheduled events for which you do not want to generate anomalies, such as planned system outages or public holidays. +* <> that instruct anomaly detectors to change their behavior based on domain-specific knowledge that you provide. +Your job rules can use filter lists, which contain values that you can use to include or exclude events from the {ml} analysis. +* <> to make dashboards and other resources readily available when viewing job results. + +For more information about tuning your job, +refer to the how-to guides in the {ml-docs}/anomaly-how-tos.html[{ml}] documentation. +Note that the {ml} documentation may contain details that are not valid when using a fully-managed Elastic project. + +[TIP] +==== +You can also create calendars and add URLs when configuring settings during job creation, +but generally it's easier to start with a simple job and add complexity later. +==== + +[discrete] +[[create-calendars]] +== Create calendars + +Sometimes there are periods when you expect unusual activity to take place, +such as bank holidays, "Black Friday", or planned system outages. +If you identify these events in advance, no anomalies are generated during that period. +The {ml} model is not ill-affected, and you do not receive spurious results. + +To create a calendar and add scheduled events: + +. In your {obs-serverless} project, go to **Machine learning** → **Settings**. +. Under **Calendars**, click **Create**. +. Enter an ID and description for the calendar. +. Select the jobs you want to apply the calendar to, or turn on **Apply calendar to all jobs**. +. Under **Events**, click **New event** or click **Import events** to import events from an iCalendar (ICS) file: ++ +[role="screenshot"] +image::images/anomaly-detection-create-calendar.png[Create new calendar page] ++ +A scheduled event must have a start time, end time, and calendar ID. +In general, scheduled events are short in duration (typically lasting from a few hours to a day) and occur infrequently. +If you have regularly occurring events, such as weekly maintenance periods, +you do not need to create scheduled events for these circumstances; +they are already handled by the {ml} analytics. +If your ICS file contains recurring events, only the first occurrence is imported. +. When you're done adding events, save your calendar. + +You must identify scheduled events _before_ your anomaly detection job analyzes the data for that time period. +{ml-cap} results are not updated retroactively. +Bucket results are generated during scheduled events, but they have an anomaly score of zero. + +[TIP] +==== +If you use long or frequent scheduled events, +it might take longer for the {ml} analytics to learn to model your data, +and some anomalous behavior might be missed. +==== + +[discrete] +[[create-job-rules]] +== Create job rules and filters + +By default, anomaly detection is unsupervised, +and the {ml} models have no awareness of the domain of your data. +As a result, anomaly detection jobs might identify events that are statistically significant but are uninteresting when you know the larger context. + +You can customize anomaly detection by creating custom job rules. +_Job rules_ instruct anomaly detectors to change their behavior based on domain-specific knowledge that you provide. +When you create a rule, you can specify conditions, scope, and actions. +When the conditions of a rule are satisfied, its actions are triggered. + +.Example use case for creating a job rule +[NOTE] +==== +If you have an anomaly detector that is analyzing CPU usage, +you might decide you are only interested in anomalies where the CPU usage is greater than a certain threshold. +You can define a rule with conditions and actions that instruct the detector to refrain from generating {ml} results when there are anomalous events related to low CPU usage. +You might also decide to add a scope for the rule so that it applies only to certain machines. +The scope is defined by using {ml} filters. +==== + +_Filters_ contain a list of values that you can use to include or exclude events from the {ml} analysis. +You can use the same filter in multiple anomaly detection jobs. + +.Example use case for creating a filter list +[NOTE] +==== +If you are analyzing web traffic, you might create a filter that contains a list of IP addresses. +The list could contain IP addresses that you trust to upload data to your website or to send large amounts of data from behind your firewall. +You can define the rule's scope so that the action triggers only when a specific field in your data matches (or doesn't match) a value in the filter. +This gives you much greater control over which anomalous events affect the {ml} model and appear in the {ml} results. +==== + +To create a job rule, first create any filter lists you want to use in the rule, then configure the rule: + +. In your {obs-serverless} project, go to **Machine learning** → **Settings**. +. (Optional) Create one or more filter lists: ++ +.. Under **Filter lists**, click **Create**. +.. Enter the filter list ID. This is the ID you will select when you want to use the filter list in a job rule. +.. Click **Add item** and enter one item per line. +.. Click **Add** then save the filter list: ++ +[role="screenshot"] +image::images/anomaly-detection-create-filter-list.png[Create filter list] +. Open the job results in the **Single Metric Viewer** or **Anomaly Explorer**. +. From the **Actions** menu in the **Anomalies** table, select **Configure job rules**. ++ +[role="screenshot"] +image::images/anomaly-detection-configure-job-rules.png[Configure job rules menu selection] +. Choose which actions to take when the job rule matches the anomaly: **Skip result**, **Skip model update**, or both. +. Under **Conditions**, add one or more conditions that must be met for the action to be triggered. +. Under **Scope** (if available), add one or more filter lists to limit where the job rule applies. +. Save the job rule. +Note that changes to job rules take effect for new results only. +To apply these changes to existing results, you must clone and rerun the job. + +[discrete] +[[define-custom-urls]] +== Define custom URLs + +You can optionally attach one or more custom URLs to your anomaly detection jobs. +Links for these URLs will appear in the **Actions** menu of the anomalies table when viewing job results in the **Single Metric Viewer** or **Anomaly Explorer**. +Custom URLs can point to dashboards, the Discover app, or external websites. +For example, you can define a custom URL that enables users to drill down to the source data from the results set. + +To add a custom URL to the **Actions** menu: + +. In your {obs-serverless} project, go to **Machine learning** → **Jobs**. +. From the **Actions** menu in the job list, select **Edit job**. +. Select the **Custom URLs** tab, then click **Add custom URL**. +. Enter the label to use for the link text. +. Choose the type of resource you want to link to: ++ +|=== +| If you select... | Do this... + +| {kib} dashboard +| Select the dashboard you want to link to. + +| Discover +| Select the data view to use. + +| Other +| Specify the URL for the external website. +|=== +. Click **Test** to test your link. +. Click **Add**, then save your changes. + +Now when you view job results in **Single Metric Viewer** or **Anomaly Explorer**, +the **Actions** menu includes the custom link: + +[role="screenshot"] +image::images/anomaly-detection-custom-url.png[Create filter list] + +[TIP] +==== +It is also possible to use string substitution in custom URLs. +For example, you might have a **Raw data** URL defined as: + +`discover#/?_g=(time:(from:'$earliest$',mode:absolute,to:'$latest$'))&_a=(index:ff959d40-b880-11e8-a6d9-e546fe2bba5f,query:(language:kuery,query:'customer_full_name.keyword:"$customer_full_name.keyword$"'))`. + +The value of the `customer_full_name.keyword` field is passed to the target page when the link is clicked. + +For more information about using string substitution, +refer to the {ml-docs}/ml-configuring-url.html#ml-configuring-url-strings[{ml}] documentation. +Note that the {ml} documentation may contain details that are not valid when using a fully-managed Elastic project. +==== diff --git a/docs/en/serverless/observability-overview.asciidoc b/docs/en/serverless/observability-overview.asciidoc new file mode 100644 index 0000000000..45b4819cbc --- /dev/null +++ b/docs/en/serverless/observability-overview.asciidoc @@ -0,0 +1,149 @@ +[[observability-serverless-observability-overview]] += Observability overview + +// :description: Learn how to accelerate problem resolution with open, flexible, and unified observability powered by advanced machine learning and analytics. +// :keywords: serverless, observability, overview + +{obs-serverless} provides granular insights and context into the behavior of applications running in your environments. +It's an important part of any system that you build and want to monitor. +Being able to detect and fix root cause events quickly within an observable system is a minimum requirement for any analyst. + +{obs-serverless} provides a single stack to unify your logs, metrics, and application traces. +Ingest your data directly to your Observability project, where you can further process and enhance the data, +before visualizing it and adding alerts. + +image::images/serverless-capabilities.svg[{obs-serverless} overview diagram] + +[discrete] +[[apm-overview]] +== Log monitoring + +Analyze log data from your hosts, services, Kubernetes, Apache, and many more. + +In **Logs Explorer** (powered by Discover), you can quickly search and filter your log data, +get information about the structure of the fields, and display your findings in a visualization. + +[role="screenshot"] +image::images/log-explorer-overview.png[Logs Explorer showing log events] + +<> + +// RUM is not supported for this release. + +// Synthetic monitoring is not supported for this release. + +// Universal Profiling is not supported for this release. + +[discrete] +[[observability-serverless-observability-overview-application-performance-monitoring-apm]] +== Application performance monitoring (APM) + +Instrument your code and collect performance data and errors at runtime by installing APM agents like Java, Go, .NET, and many more. +Then use {obs-serverless} to monitor your software services and applications in real time: + +* Visualize detailed performance information on your services. +* Identify and analyze errors. +* Monitor host-level and APM agent-specific metrics like JVM and Go runtime metrics. + +The **Service** inventory provides a quick, high-level overview of the health and general performance of all instrumented services. + +[role="screenshot"] +image::images/services-inventory.png[Service inventory showing health and performance of instrumented services] + +<> + +[discrete] +[[metrics-overview]] +== Infrastructure monitoring + +Monitor system and service metrics from your servers, Docker, Kubernetes, Prometheus, and other services and applications. + +The **Infrastructure** UI provides a couple ways to view and analyze metrics across your infrastructure: + +The **Infrastructure inventory** page provides a view of your infrastructure grouped by resource type. + +[role="screenshot"] +image::images/metrics-app.png[{infrastructure-app} in {kib}] + +The **Hosts** page provides a dashboard-like view of your infrastructure and is backed by an easy-to-use interface called Lens. + +[role="screenshot"] +image::images/hosts.png[Screenshot of the Hosts page] + +From either page, you can view health and performance metrics to get visibility into the overall health of your infrastructure. +You can also drill down into details about a specific host, including performance metrics, host metadata, running processes, +and logs. + +<> + +[discrete] +[[observability-serverless-observability-overview-synthetic-monitoring]] +== Synthetic monitoring + +Simulate actions and requests that an end user would perform on your site at predefined intervals and in a controlled environment. +The end result is rich, consistent, and repeatable data that you can trend and alert on. + +For more information, see <>. + +[discrete] +[[observability-serverless-observability-overview-alerting]] +== Alerting + +Stay aware of potential issues in your environments with {obs-serverless}’s alerting +and actions feature that integrates with log monitoring and APM. +It provides a set of built-in actions and specific threshold rules +and enables central management of all rules. + +On the **Alerts** page, the **Alerts** table provides a snapshot of alerts occurring within the specified time frame. The table includes the alert status, when it was last updated, the reason for the alert, and more. + +[role="screenshot"] +image::images/observability-alerts-overview.png[Summary of Alerts on the {obs-serverless} overview page] + +<> + +[discrete] +[[observability-serverless-observability-overview-service-level-objectives-slos]] +== Service-level objectives (SLOs) + +Set clear, measurable targets for your service performance, +based on factors like availability, response times, error rates, and other key metrics. +Then monitor and track your SLOs in real time, +using detailed dashboards and alerts that help you quickly identify and troubleshoot issues. + +From the SLO overview list, you can see all of your SLOs and a quick summary of what’s happening in each one: + +[role="screenshot"] +image::images/slo-dashboard.png[Dashboard showing list of SLOs] + +<> + +[discrete] +[[observability-serverless-observability-overview-cases]] +== Cases + +Collect and share information about observability issues by creating cases. +Cases allow you to track key investigation details, +add assignees and tags to your cases, set their severity and status, and add alerts, +comments, and visualizations. You can also send cases to third-party systems, +such as ServiceNow and Jira. + +[role="screenshot"] +image::images/cases.png[Screenshot showing list of cases] + +<> + +[discrete] +[[observability-serverless-observability-overview-aiops]] +== Machine learning and AIOps + +Reduce the time and effort required to detect, understand, investigate, and resolve incidents at scale +by leveraging predictive analytics and machine learning: + +* Detect anomalies by comparing real-time and historical data from different sources to look for unusual, problematic patterns. +* Find and investigate the causes of unusual spikes or drops in log rates. +* Detect distribution changes, trend changes, and other statistically significant change points in a metric of your time series data. + +[role="screenshot"] +image::images/log-rate-analysis.png[Log rate analysis page showing log rate spike ] + +<> diff --git a/docs/en/serverless/reference/metrics-app-fields.asciidoc b/docs/en/serverless/reference/metrics-app-fields.asciidoc new file mode 100644 index 0000000000..7da2ceccce --- /dev/null +++ b/docs/en/serverless/reference/metrics-app-fields.asciidoc @@ -0,0 +1,295 @@ +[[observability-infrastructure-monitoring-required-fields]] += Infrastructure app fields + +// :description: Learn about the fields required to display data in the Infrastructure UI. +// :keywords: serverless, observability, reference + +This section lists the fields the Infrastructure UI uses to display data. +Please note that some of the fields listed here are not {ecs-ref}/ecs-reference.html#_what_is_ecs[ECS fields]. + +[discrete] +[[observability-infrastructure-monitoring-required-fields-additional-field-details]] +== Additional field details + +The `event.dataset` field is required to display data properly in some views. This field +is a combination of `metricset.module`, which is the {metricbeat} module name, and `metricset.name`, +which is the metricset name. + +To determine each metric's optimal time interval, all charts use `metricset.period`. +If `metricset.period` is not available, then it falls back to 1 minute intervals. + +[discrete] +[[base-fields]] +== Base fields + +The `base` field set contains all fields which are on the top level. These fields are common across all types of events. + +|=== +| Field | Description | Type + +| `@timestamp` +a| Date/time when the event originated. + +This is the date/time extracted from the event, typically representing when the source generated the event. +If the event source has no original timestamp, this value is typically populated by the first time the pipeline received the event. +Required field for all events. + +Example: `May 27, 2020 @ 15:22:27.982` +| date + +| `message` +a| For log events the message field contains the log message, optimized for viewing in a log viewer. + +For structured logs without an original message field, other fields can be concatenated to form a human-readable summary of the event. + +If multiple messages exist, they can be combined into one message. + +Example: `Hello World` +| text +|=== + +[discrete] +[[host-fields]] +== Hosts fields + +These fields must be mapped to display host data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `host.name` +a| Name of the host. + +It can contain what `hostname` returns on Unix systems, the fully qualified domain name, or a name specified by the user. The sender decides which value to use. + +Example: `MacBook-Elastic.local` +| keyword + +| `host.ip` +| IP of the host that records the event. +| ip +|=== + +[discrete] +[[docker-fields]] +== Docker container fields + +These fields must be mapped to display Docker container data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `container.id` +a| Unique container id. + +Example: `data` +| keyword + +| `container.name` +| Container name. +| keyword + +| `container.ip_address` +a| IP of the container. + +_Not an ECS field_ +| ip +|=== + +[discrete] +[[kubernetes-fields]] +== Kubernetes pod fields + +These fields must be mapped to display Kubernetes pod data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `kubernetes.pod.uid` +a| Kubernetes Pod UID. + +Example: `8454328b-673d-11ea-7d80-21010a840123` + +_Not an ECS field_ +| keyword + +| `kubernetes.pod.name` +a| Kubernetes pod name. + +Example: `nginx-demo` + +_Not an ECS field_ +| keyword + +| `kubernetes.pod.ip` +a| IP of the Kubernetes pod. + +_Not an ECS field_ +| keyword +|=== + +[discrete] +[[aws-ec2-fields]] +== AWS EC2 instance fields + +These fields must be mapped to display EC2 instance data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `cloud.instance.id` +a| Instance ID of the host machine. + +Example: `i-1234567890abcdef0` +| keyword + +| `cloud.instance.name` +| Instance name of the host machine. +| keyword + +| `aws.ec2.instance.public.ip` +a| Instance public IP of the host machine. + +_Not an ECS field_ +| keyword +|=== + +[discrete] +[[aws-s3-fields]] +== AWS S3 bucket fields + +These fields must be mapped to display S3 bucket data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `aws.s3.bucket.name` +a| The name or ID of the AWS S3 bucket. + +_Not an ECS field_ +| keyword +|=== + +[discrete] +[[aws-sqs-fields]] +== AWS SQS queue fields + +These fields must be mapped to display SQS queue data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `aws.sqs.queue.name` +a| The name or ID of the AWS SQS queue. + +_Not an ECS field_ +| keyword +|=== + +[discrete] +[[aws-rds-fields]] +== AWS RDS database fields + +These fields must be mapped to display RDS database data in the {infrastructure-app}. + +|=== +| Field | Description | Type + +| `aws.rds.db_instance.arn` +a| Amazon Resource Name (ARN) for each RDS. + +_Not an ECS field_ +| keyword + +| `aws.rds.db_instance.identifier` +a| Contains a user-supplied database identifier. This identifier is the unique key that identifies a DB instance. + +_Not an ECS field_ +| keyword +|=== + +[discrete] +[[group-inventory-fields]] +== Additional grouping fields + +Depending on which entity you select in the **Infrastructure inventory** view, these additional fields can be mapped to group entities by. + +|=== +| Field | Description | Type + +| `cloud.availability_zone` +a| Availability zone in which this host is running. + +Example: `us-east-1c` +| keyword + +| `cloud.machine.type` +a| Machine type of the host machine. + +Example: `t2.medium` +| keyword + +| `cloud.region` +a| Region in which this host is running. + +Example: `us-east-1` +| keyword + +| `cloud.instance.id` +a| Instance ID of the host machine. + +Example: `i-1234567890abcdef0` +| keyword + +| `cloud.provider` +a| Name of the cloud provider. Example values are `aws`, `azure`, `gcp`, or `digitalocean`. + +Example: `aws` +| keyword + +| `cloud.instance.name` +| Instance name of the host machine. +| keyword + +| `cloud.project.id` +a| Name of the project in Google Cloud. + +_Not an ECS field_ +| keyword + +| `service.type` +a| The type of service data is collected from. + +The type can be used to group and correlate logs and metrics from one service type. + +For example, the service type for metrics collected from {es} is `elasticsearch`. + +Example: `elasticsearch` + +_Not an ECS field_ +| keyword + +| `host.hostname` +a| Name of the host. This field is required if you want to use {ml-features} + +It normally contains what the `hostname` command returns on the host machine. + +Example: `Elastic.local` +| keyword + +| `host.os.name` +a| Operating system name, without the version. + +Multi-fields: + +os.name.text (type: text) + +Example: `Mac OS X` +| keyword + +| `host.os.kernel` +a| Operating system kernel version as a raw string. + +Example: `4.4.0-112-generic` +| keyword +|===