Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Histogram feature implementation of Prometheus Server Integration #5042

Open
gizas opened this issue Jan 18, 2023 · 9 comments
Open
Assignees
Labels
enhancement New feature or request Integration:prometheus Prometheus Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]

Comments

@gizas
Copy link
Contributor

gizas commented Jan 18, 2023

Context

Histograms are a one of the supported types of metrics of Prometheus toolkit. In general Histograms provide pre-aggregated numerical values in the form of groups.

In our Prometheus Integration we currently provide support of histogram type by enabling Use types option. By enabling this option, we retrieve prometheus metrics categorised as histograms and index those inside Elasticsearch. We have identified that support of histograms through Elasticsearch needs specific pre-processing on index time in our integration package. Additionally, relevant efforts (1165, 26903) revealed possible enhancements that can be added to our code.

Diagnosis

Users have reported differences between the histograms scraped from Prometheus comparing to the ones we save inside Elasticsearch . This revealed the extra calculation we during ingestion time and also the need to document and explain the procedure to our users.

Prometheus Buckets scraped:

Bucket Value
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="+Inf"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="0.1"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="0.2"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="0.4"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="1"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="120"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="20"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="3"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="60"} 1
prometheus_http_request_duration_seconds_bucket{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus", le="8"} 1

Elasticsearch Histograms we ingest (retrieved from Kibana Discovery):

"prometheus": {
      "prometheus_http_request_duration_seconds": {
        "histogram": {
          "counts": [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ],
          "values": [
            0.05,
            0.15000000000000002,
            0.30000000000000004,
            0.7,
            2,
            5.5,
            14,
            40,
            90,
            180
          ]
        }
      },
      "labels": {
        "handler": "/api/v1/label/:name/values",
        "instance": "prometheus-server-server.kube-system:80",
        "job": "prometheus"
      }

Questions that we need to answer:

  • Why le Bucket values are different than the ones we see in Elastic?
  • What is the value in Elasticsearch of the +Inf bucket ?
    (Code Ref) In our example: 120 + (120-60) = 180, so it matches with 180 Value.

Additionally for http_request_duration_seconds, Prometheus offers prometheus_http_request_duration_seconds_count:1 and http_request_duration_seconds_sum:1.

prometheus_http_request_duration_seconds_count{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus"}:1

prometheus_http_request_duration_seconds_sum{handler="/api/v1/label/:name/values", instance="localhost:9090", job="prometheus"} : 1

Count and Sum values are not returned in from our code, so not present in Elasticsearch. Is there any valid scenario where those might needed?

Also prometheus_http_request_duration_seconds_histogram field is not available to search and provide filters in Kibana Discovery

Screenshot 2023-01-19 at 2 59 27 PM

Comparing to other fields:
Screenshot 2023-01-19 at 3 01 45 PM

Action

This story summarizes all the actions we have categorised that are needed in order to enhance the Prometheus Histogram support in our integration:

Code Enhancements:

  • Account for negative count values inside initial buckets
  • Use the preceding bucket's value for +Inf "le"
  • for the first bucket only: if it has a negative "le", use the value as-is; otherwise use half its value (midpoint to zero)
  • Investigate if we need to provide sum and count values additonally to the ones we provide now
  • Can we retrieve and index histogram buckets exactly as retrieved from Prometheus? If no we need to document this but if yes we need to evaluate if we need to support this as a new enhancement in the code. Is there any Elasticsearch limitations that prevent us from doing this?

Kibana Support:

  • We need to create a visualisation based on histograms. Understand all the different functions that are suggested to be used with histograms like aggregations, buckets etc.

Check available Use Cases of histograms here

Documentation Enhancement:

Deliverables

  • Relevant code improvements in Prometheus code base
  • Documentation updates that will explain the end-to-end user journey and support of Histogram type

Relevant Links

Useful External links

@gizas gizas added enhancement New feature or request Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring] labels Jan 18, 2023
@ruflin
Copy link
Contributor

ruflin commented Jan 23, 2023

For the histograms in Elasticsearch that you showed above as sample outputs, are these stored with histogram field type? If yes, is this done by a dynamic template? In the case of APM Agents AFAIK there is a dynamic template name that gets assigned in the ingest pipeline: https://github.com/elastic/apm-server/blob/main/apmpackage/apm/data_stream/app_metrics/elasticsearch/ingest_pipeline/default.yml#L16-L35

Can you get the exact mapping of the index that is created for the fields you listed above. This will also help investigate why the fields do not show up in the query.

@gizas
Copy link
Contributor Author

gizas commented Jan 27, 2023

Indeed they are stored as .histogram type, when Use Types enabled

Also to add some more details after some more testing:

All the histograms do not appear on the filter search box. So I can not create a filter with histogram type fields maybe?

@ruflin
Copy link
Contributor

ruflin commented Jan 30, 2023

then bucket fields are present in Elasticsearch although they come as Unmapped

What does Unmapped mean in this context? Is it a keyword?

All the histograms do not appear on the filter search box. So I can not create a filter with histogram type fields maybe?

@gizas Can you share the query / filter you would want to run on this histogram. You are correct that you can't filter on a histogram value but only run aggregations.

@gizas
Copy link
Contributor Author

gizas commented Jan 31, 2023

Unmapped= Fields that are not explicitly matched to a field data type

There is no reference for .bucket fields that are mentioned above when Use_types is disabled, in our mappings file.

And as for the filter I was trying the simple exists filter, like: prometheus_http_request_duration_seconds.histogram: *

@ruflin
Copy link
Contributor

ruflin commented Feb 1, 2023

Unmapped= Fields that are not explicitly matched to a field data type

Ok, I assume all these fields fall back to keyword as it is set as default mapping?

And as for the filter I was trying the simple exists filter, like: prometheus_http_request_duration_seconds.histogram: *

Is this just for testing or is it what you will want to use in some visualisations? Exists query only works when the field is indexed which is not the case for histogram and likely not the case for most metrics in the future. The assumption we are following is that most metrics are not used for filtering but aggregations. Does this apply here too?

@gizas
Copy link
Contributor Author

gizas commented Feb 1, 2023

_bucket fields are coming as Numbers, just double checked.

And yes the filter was just for testing. I guess for now indeed does not seem a valid user scenario.

@gizas
Copy link
Contributor Author

gizas commented Feb 1, 2023

I was performing the tests with managed agent and was switching on/off the Use_types and Rate_Counters. So that is why the fields come as Unmapped until we refresh whole browser. Seems like Kibana has some kind of cache or does some mapping in the background and until you refresh you dont take the updates.

So bottom line _bukets come as Numbers, which are actually matched as double based on here

And _histograms based on here as histograms.

Documentation for not indexing histograms see here

@ruflin
Copy link
Contributor

ruflin commented Feb 2, 2023

_bucket fields are coming as Numbers, just double checked.

For these mappings, lets make sure we always refer to the Elasticsearch types / mappings and not kibana data views as the ES ones are the ones that count. Glad to see the follow up details on the actual mapping in Elasticsearch.

@gizas Let me know if there anything further you need on my end.

@tetianakravchenko
Copy link
Contributor

Code Enhancements:

  • Account for negative count values inside initial buckets
  • Use the preceding bucket's value for +Inf "le"
  • for the first bucket only: if it has a negative "le", use the value as-is; otherwise use half its value (midpoint to zero)

3 points above are covered by elastic/beats#36647

  • Investigate if we need to provide sum and count values additonally to the ones we provide now
  • Can we retrieve and index histogram buckets exactly as retrieved from Prometheus? If no we need to document this but if yes we need to evaluate if we need to support this as a new enhancement in the code. Is there any Elasticsearch limitations that prevent us from doing this?

this need investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:prometheus Prometheus Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team [elastic/obs-cloudnative-monitoring]
Projects
None yet
Development

No branches or pull requests

4 participants