Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split EMF log with larger than 100 buckets. #242

Merged
merged 11 commits into from
Nov 20, 2024

Conversation

zzhlogin
Copy link

@zzhlogin zzhlogin commented Oct 18, 2024

Description:
In Application Signals, we utilize Base2 Exponential Bucket Histogram to aggregate and send latency data, with a default max number of buckets 160. In EMF exporter, these buckets are mapped to "Target members" in EMF log entries.
However, CloudWatch EMF logs impose a limit of 100 target members, beyond which EMF processors will mark the record as invalid, resulting in missing metrics and customer-facing errors reported via the EMFValidationErrors metric.

In this PR, we split histograms to two sub EMF logs with the following change:

  1. Add an extra attribute metricIndex to groupedMetricMetadata : Current EMF exporter aggregate incoming metrics into groupedMetrics before converting to log events, where the groupKey is generated based on the groupedMetricMetadata including: metric namespace, timestamp, log group name, etc. After splitting, the two new metrics will share exactly the same key. Adding an extra metric metadata for key generation can prevent the second metric from dropping.
  2. If the total buckets exceed 100, the exponential histogram metric are split into into multiple data points as needed,
    each containing a maximum of 100 buckets, to comply with CloudWatch EMF log constraints.
    For each split data point:
  • Min and Max values are recalculated based on the bucket boundary within that specific split.
  • Sum is only assigned to the first split to ensure the total sum of the datapoints after aggregation is correct.
  • Count is accumulated based on the bucket counts within each split.

Testing:
The change is tested by generating traffic with more than 100 buckets, and the emf log with larger than 100 values are eliminated after the change:
Screenshot 2024-10-18 at 3 04 45 PM

Also in E2E test, logging the Min/Max/Count to confirm the behaviour is expected.

@zzhlogin zzhlogin requested a review from mxiamxia as a code owner October 18, 2024 20:52
Copy link

@bjrara bjrara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Some high level comments:

  1. Intuitively, I think that the first split would contain as many datapoints as possible. Let's say in the future, if the EMF exporter is gonna support "transactional exporting", i.e. the second batch will only export if the first one succeeds. In this case, we want the majority to be exported in the first to reduce the risk of failures in the second. Switching the number should not impact how many data you would traverse in the splitting logics. As for now, it's not a big concern because each exporting is independent.
  2. Shall we also include histogramDataPointSlice?
  3. The test cases don't seem to cover enough use cases, e.g. with positive empty, with negative empty, with 0 empty, with non-splitting validation when the total size is smaller than 100. Please add more cases to verify the logics.
  4. Regarding the only one test case you added, having a large array base for result validation is good, but how did we confirm that the expectedDatapoint is what we really want? It's hard to judge the accuracy.

exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/metric_translator.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint_test.go Outdated Show resolved Hide resolved
@zzhlogin
Copy link
Author

Thanks for the contribution. Some high level comments:

  1. Intuitively, I think that the first split would contain as many datapoints as possible. Let's say in the future, if the EMF exporter is gonna support "transactional exporting", i.e. the second batch will only export if the first one succeeds. In this case, we want the majority to be exported in the first to reduce the risk of failures in the second. Switching the number should not impact how many data you would traverse in the splitting logics. As for now, it's not a big concern because each exporting is independent.

Updated the code to set the first bucket has in max on 100 values, and return the datapoint in the right order to avoid confusion.

  1. Shall we also include histogramDataPointSlice?

The histogramDataPointSlice.CalculateDeltaDatapoints only return metrics Stats, including count, sum, max, min, there is no values and counts returned, the members won't exceed 100.

  1. The test cases don't seem to cover enough use cases, e.g. with positive empty, with negative empty, with 0 empty, with non-splitting validation when the total size is smaller than 100. Please add more cases to verify the logics.

Added all these test cases. (The verification principle follows the explanation in 4. below)

  1. Regarding the only one test case you added, having a large array base for result validation is good, but how did we confirm that the expectedDatapoint is what we really want? It's hard to judge the accuracy.

In the test case, we generate in total of 121 different buckets, with different number of values counts. The total count is 3662 (the result is calculated locally by iterating over all buckets), and the sum is hard coded to 1000 (to verify we directly use metric sum in the first split).

  • For the first split: Max is retrieved from Metric.Max value, and the Min is retrieved from the 100th bucket lower boundary (I confirmed by locally print out the 100th bucket and retrieve the value). It has exactly 100 values it is checked in expectedDatapoint1: value and Counts.
  • Similar For second split, Min is retrieved from Metric.Min value, Max is the same as first split's Min. It has exactly 21 values it is checked in expectedDatapoint2: value and Counts.

exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
jefchien
jefchien previously approved these changes Nov 8, 2024
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/metric_translator.go Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
exporter/awsemfexporter/datapoint.go Outdated Show resolved Hide resolved
Copy link

@vastin vastin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jefchien
jefchien previously approved these changes Nov 11, 2024
jefchien
jefchien previously approved these changes Nov 12, 2024
mxiamxia
mxiamxia previously approved these changes Nov 12, 2024
@zzhlogin zzhlogin dismissed stale reviews from mxiamxia and jefchien via 8aa6806 November 13, 2024 04:38
@jefchien jefchien merged commit 9c1ddd2 into amazon-contributing:aws-cwa-dev Nov 20, 2024
141 of 146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants