Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit OTLP metric histogram buckets to 100 #204

Closed
wants to merge 5 commits into from
Closed

Conversation

thpierce
Copy link
Contributor

@thpierce thpierce commented Jun 10, 2024

Description

Per AWS EMF specifications, metric histograms (numeric arrays) may contain no more than 100 values. Exceeding this results in invalid metric records. For the short term, we have decided that we will simply limit the number of histogram buckets to 100, to avoid this issue.

In this pull request, we are applying this limitation in aws_opentelemetry_configurator by passing in a preferred_aggregation to the OTLPMetricExporters . However, we are also doing a lot more in this PR. We never used to pass this information to the exporters because in the OTEL Python SDK release we currently use, the preferred_aggregation was not used. To work around this, in aws_opentelemetry_distro, we set OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION to base2_exponential_bucket_histogram so that ALL created histograms would use base2exponential. However, this would not allow us to modify the max number of buckets. Fortunately, the root issue was fixed in the upstream in this commit. In this PR, we are creating a patch for OTLPMetricExporter that applies this commit - the diff should be identical except where we had to apply linters (which was separated out in a separate commit to make review easier). Because we no longer require the default histogram config, we have removed it from aws_opentelemtry_distro. A few other notes:

  1. I disabled code coverage for the new patch as there is a lot of patched code that is unchanged and it is not valuable to test
  2. test_instrumentation_patch.py had to be significantly re-written for reasons that are made clear in the class-level comments.
  3. The patch can and should be removed when we take new versions from the upstream.

Testing Done

Modified AwsSpanMetricsProcessor and contract tests to generate data that will reliably produce 160 histogram buckets (essentially in AwsSpanMetricsProcessor, add a loop that adds a bunch of values that we saw trigger this issue, then print everything in contract tests).

Scenario 1: Without any changes (note 161 buckets):

metrics {
  name: "Latency"
  unit: "Milliseconds"
  exponential_histogram {
    data_points {
      attributes {
        key: "aws.local.service"
        value {
          string_value: "aws-application-signals-tests-requests-app"
        }
      }
      attributes {
        key: "aws.local.operation"
        value {
          string_value: "InternalOperation"
        }
      }
      attributes {
        key: "aws.span.kind"
        value {
          string_value: "LOCAL_ROOT"
        }
      }
      start_time_unix_nano: 1718035808451029914
      time_unix_nano: 1718035817029199502
      count: 4629
      sum: 46.733796195585931
      scale: 4
      positive {
        offset: -134, bucket_counts: 88, bucket_counts: 1001, bucket_counts: 285, bucket_counts: 241, bucket_counts: 224, bucket_counts: 109, bucket_counts: 73, bucket_counts: 128, bucket_counts: 127, bucket_counts: 88, bucket_counts: 85, bucket_counts: 104, bucket_counts: 119, bucket_counts: 187, bucket_counts: 155, bucket_counts: 172, bucket_counts: 102, bucket_counts: 109, bucket_counts: 120, bucket_counts: 94, bucket_counts: 81, bucket_counts: 104, bucket_counts: 79, bucket_counts: 72, bucket_counts: 65, bucket_counts: 71, bucket_counts: 62, bucket_counts: 63, bucket_counts: 44, bucket_counts: 54, bucket_counts: 31, bucket_counts: 40, bucket_counts: 23, bucket_counts: 21, bucket_counts: 13, bucket_counts: 8, bucket_counts: 14, bucket_counts: 5, bucket_counts: 4, bucket_counts: 4, bucket_counts: 3, bucket_counts: 2, bucket_counts: 1, bucket_counts: 4, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 2, bucket_counts: 6, bucket_counts: 2, bucket_counts: 5, bucket_counts: 3, bucket_counts: 3, bucket_counts: 4, bucket_counts: 6, bucket_counts: 6, bucket_counts: 6, bucket_counts: 3, bucket_counts: 6, bucket_counts: 4, bucket_counts: 1, bucket_counts: 3, bucket_counts: 3, bucket_counts: 0, bucket_counts: 3, bucket_counts: 1, bucket_counts: 4, bucket_counts: 1, bucket_counts: 2, bucket_counts: 5, bucket_counts: 2, bucket_counts: 4, bucket_counts: 3, bucket_counts: 3, bucket_counts: 1, bucket_counts: 3, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 2, bucket_counts: 3, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 1, bucket_counts: 5, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
      }
      negative { 
        bucket_counts: 0
      }
      min: 0.0030788097239816811
      max: 2.0442737824274104
    }
    data_points {
      attributes {
        key: "aws.local.service"
        value {
          string_value: "aws-application-signals-tests-requests-app"
        }
      }
      attributes {
        key: "aws.local.operation"
        value {
          string_value: "InternalOperation"
        }
      }
      attributes {
        key: "aws.remote.service"
        value {
          string_value: "backend:8080"
        }
      }
      attributes {
        key: "aws.remote.operation"
        value {
          string_value: "GET /backend"
        }
      }
      attributes {
        key: "aws.span.kind"
        value {
          string_value: "CLIENT"
        }
      }
      start_time_unix_nano: 1718035808451029914
      time_unix_nano: 1718035817029199502
      count: 4629
      sum: 46.733796195585931
      scale: 4
      positive {
        offset: -134, bucket_counts: 88, bucket_counts: 1001, bucket_counts: 285, bucket_counts: 241, bucket_counts: 224, bucket_counts: 109, bucket_counts: 73, bucket_counts: 128, bucket_counts: 127, bucket_counts: 88, bucket_counts: 85, bucket_counts: 104, bucket_counts: 119, bucket_counts: 187, bucket_counts: 155, bucket_counts: 172, bucket_counts: 102, bucket_counts: 109, bucket_counts: 120, bucket_counts: 94, bucket_counts: 81, bucket_counts: 104, bucket_counts: 79, bucket_counts: 72, bucket_counts: 65, bucket_counts: 71, bucket_counts: 62, bucket_counts: 63, bucket_counts: 44, bucket_counts: 54, bucket_counts: 31, bucket_counts: 40, bucket_counts: 23, bucket_counts: 21, bucket_counts: 13, bucket_counts: 8, bucket_counts: 14, bucket_counts: 5, bucket_counts: 4, bucket_counts: 4, bucket_counts: 3, bucket_counts: 2, bucket_counts: 1, bucket_counts: 4, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 2, bucket_counts: 6, bucket_counts: 2, bucket_counts: 5, bucket_counts: 3, bucket_counts: 3, bucket_counts: 4, bucket_counts: 6, bucket_counts: 6, bucket_counts: 6, bucket_counts: 3, bucket_counts: 6, bucket_counts: 4, bucket_counts: 1, bucket_counts: 3, bucket_counts: 3, bucket_counts: 0, bucket_counts: 3, bucket_counts: 1, bucket_counts: 4, bucket_counts: 1, bucket_counts: 2, bucket_counts: 5, bucket_counts: 2, bucket_counts: 4, bucket_counts: 3, bucket_counts: 3, bucket_counts: 1, bucket_counts: 3, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 2, bucket_counts: 3, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 1, bucket_counts: 5, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
      }
      negative { 
        bucket_counts: 0
      }
      min: 0.0030788097239816811
      max: 2.0442737824274104
    }
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
  }
}

Scenario 2: Without patch applied (note it's using explicit buckets because we removed OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION logic):

    metrics {
      name: "Latency"
      unit: "Milliseconds"
      histogram {
        data_points {
          attributes {
            key: "aws.local.service"
            value {
              string_value: "aws-application-signals-tests-requests-app"
            }
          }
          attributes {
            key: "aws.local.operation"
            value {
              string_value: "InternalOperation"
            }
          }
          attributes {
            key: "aws.span.kind"
            value {
              string_value: "LOCAL_ROOT"
            }
          }
          start_time_unix_nano: 1718048103153902275
          time_unix_nano: 1718048106929937849
          count: 109
          sum: 15.14611171299984, 
          bucket_counts: 0, bucket_counts: 109, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
          explicit_bounds: 0, explicit_bounds: 5, explicit_bounds: 10, explicit_bounds: 25, explicit_bounds: 50, explicit_bounds: 75, explicit_bounds: 100, explicit_bounds: 250, explicit_bounds: 500, explicit_bounds: 750, explicit_bounds: 1000, explicit_bounds: 2500, explicit_bounds: 5000, explicit_bounds: 7500, explicit_bounds: 10000
          min: 0.0030788097239816811
          max: 2.0442737824274104
        }
        data_points {
          attributes {
            key: "aws.local.service"
            value {
              string_value: "aws-application-signals-tests-requests-app"
            }
          }
          attributes {
            key: "aws.local.operation"
            value {
              string_value: "InternalOperation"
            }
          }
          attributes {
            key: "aws.remote.service"
            value {
              string_value: "backend:8080"
            }
          }
          attributes {
            key: "aws.remote.operation"
            value {
              string_value: "GET /backend"
            }
          }
          attributes {
            key: "aws.span.kind"
            value {
              string_value: "CLIENT"
            }
          }
          start_time_unix_nano: 1718048103153902275
          time_unix_nano: 1718048106929937849
          count: 109
          sum: 15.14611171299984
          bucket_counts: 0, bucket_counts: 109, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
          explicit_bounds: 0, explicit_bounds: 5, explicit_bounds: 10, explicit_bounds: 25, explicit_bounds: 50, explicit_bounds: 75, explicit_bounds: 100, explicit_bounds: 250, explicit_bounds: 500, explicit_bounds: 750, explicit_bounds: 1000, explicit_bounds: 2500, explicit_bounds: 5000, explicit_bounds: 7500, explicit_bounds: 10000
          min: 0.0030788097239816811
          max: 2.0442737824274104
        }
        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
      }
    }
  }

Scenario 3: With changes and patch (note 100 buckets):

metrics {
  name: "Latency"
  unit: "Milliseconds"
  exponential_histogram {
    data_points {
      attributes {
        key: "aws.local.service"
        value {
          string_value: "aws-application-signals-tests-requests-app"
        }
      }
      attributes {
        key: "aws.local.operation"
        value {
          string_value: "InternalOperation"
        }
      }
      attributes {
        key: "aws.span.kind"
        value {
          string_value: "LOCAL_ROOT"
        }
      }
      start_time_unix_nano: 1718036741890154426
      time_unix_nano: 1718036750422319159
      count: 4629
      sum: 46.733796195585931
      scale: 3
      positive {
        offset: -67, bucket_counts: 1089, bucket_counts: 526, bucket_counts: 333, bucket_counts: 201, bucket_counts: 215, bucket_counts: 189, bucket_counts: 306, bucket_counts: 327, bucket_counts: 211, bucket_counts: 214, bucket_counts: 185, bucket_counts: 151, bucket_counts: 136, bucket_counts: 125, bucket_counts: 98, bucket_counts: 71, bucket_counts: 44, bucket_counts: 21, bucket_counts: 19, bucket_counts: 8, bucket_counts: 5, bucket_counts: 5, bucket_counts: 4, bucket_counts: 3, bucket_counts: 2, bucket_counts: 1, bucket_counts: 8, bucket_counts: 7, bucket_counts: 6, bucket_counts: 10, bucket_counts: 12, bucket_counts: 9, bucket_counts: 5, bucket_counts: 6, bucket_counts: 3, bucket_counts: 5, bucket_counts: 3, bucket_counts: 7, bucket_counts: 7, bucket_counts: 4, bucket_counts: 5, bucket_counts: 3, bucket_counts: 2, bucket_counts: 3, bucket_counts: 0, bucket_counts: 4, bucket_counts: 4, bucket_counts: 0, bucket_counts: 3, bucket_counts: 5, bucket_counts: 1, bucket_counts: 1, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
      }
      negative { 
        bucket_counts: 0
      }
      min: 0.0030788097239816811
      max: 2.0442737824274104
    }
    data_points {
      attributes {
        key: "aws.local.service"
        value {
          string_value: "aws-application-signals-tests-requests-app"
        }
      }
      attributes {
        key: "aws.local.operation"
        value {
          string_value: "InternalOperation"
        }
      }
      attributes {
        key: "aws.remote.service"
        value {
          string_value: "backend:8080"
        }
      }
      attributes {
        key: "aws.remote.operation"
        value {
          string_value: "GET /backend"
        }
      }
      attributes {
        key: "aws.span.kind"
        value {
          string_value: "CLIENT"
        }
      }
      start_time_unix_nano: 1718036741890154426
      time_unix_nano: 1718036750422319159
      count: 4629
      sum: 46.733796195585931
      scale: 3
      positive {
        offset: -67, bucket_counts: 1089, bucket_counts: 526, bucket_counts: 333, bucket_counts: 201, bucket_counts: 215, bucket_counts: 189, bucket_counts: 306, bucket_counts: 327, bucket_counts: 211, bucket_counts: 214, bucket_counts: 185, bucket_counts: 151, bucket_counts: 136, bucket_counts: 125, bucket_counts: 98, bucket_counts: 71, bucket_counts: 44, bucket_counts: 21, bucket_counts: 19, bucket_counts: 8, bucket_counts: 5, bucket_counts: 5, bucket_counts: 4, bucket_counts: 3, bucket_counts: 2, bucket_counts: 1, bucket_counts: 8, bucket_counts: 7, bucket_counts: 6, bucket_counts: 10, bucket_counts: 12, bucket_counts: 9, bucket_counts: 5, bucket_counts: 6, bucket_counts: 3, bucket_counts: 5, bucket_counts: 3, bucket_counts: 7, bucket_counts: 7, bucket_counts: 4, bucket_counts: 5, bucket_counts: 3, bucket_counts: 2, bucket_counts: 3, bucket_counts: 0, bucket_counts: 4, bucket_counts: 4, bucket_counts: 0, bucket_counts: 3, bucket_counts: 5, bucket_counts: 1, bucket_counts: 1, bucket_counts: 2, bucket_counts: 2, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 1, bucket_counts: 1, bucket_counts: 2, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 2, bucket_counts: 0, bucket_counts: 1, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0, bucket_counts: 0
      }
      negative { 
        bucket_counts: 0
      }
      min: 0.0030788097239816811
      max: 2.0442737824274104
    }
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
  }
}

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thpierce thpierce requested a review from a team as a code owner June 10, 2024 19:42
@thpierce
Copy link
Contributor Author

We decided against this approach. Closing out.

@thpierce thpierce closed this Jun 13, 2024
@thpierce thpierce deleted the histogram branch June 13, 2024 18:54
@thpierce thpierce mentioned this pull request Jun 15, 2024
thpierce added a commit that referenced this pull request Jun 17, 2024
Follow up from
#204,
which had to be discarded, but changes to
`test_instrumentation_patch.py` still have value. Class had to be
significantly re-written for reasons that are made clear in the
class-level comments.

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant