-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metricbeat] [gcp] group metrics by dimensions #36682
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
a6937fe
to
c083a9d
Compare
@gpop63, I can't add you as a reviewer since you're the original author of this PR, but please consider yourself a reviewer! 😇 |
@kaiyan-sheng, I learned you are reviewing the PR under the radar! 😄 You suggest delaying the longest ingest delay metrics collection to avoid adding the batch ID. It makes sense, and it's something we evaluated during development. Since some metrics have a 5-minute delay, we opted for the batch ID, but we're open to taking a different trade-off. Please add here your thoughts on this topic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on reviewing this PR! I was thinking of finding the largest ingest delay among all metrics and then using that largest delay to calculate startTime and endTime in getTimeIntervalAligner
function. This way we should get all the data points in a single collection to avoid using metric_names_fingerprint
.
--- a/x-pack/metricbeat/module/gcp/metrics/metrics_requester.go
+++ b/x-pack/metricbeat/module/gcp/metrics/metrics_requester.go
@@ -79,6 +79,14 @@ func (r *metricsRequester) Metrics(ctx context.Context, serviceName string, alig
var wg sync.WaitGroup
results := make([]timeSeriesWithAligner, 0)
+ largestDelay := 0 * time.Second
+ for _, meta := range metricsToCollect {
+ metricMeta := meta
+ if metricMeta.ingestDelay > largestDelay {
+ largestDelay = metricMeta.ingestDelay
+ }
+ }
+
for mt, meta := range metricsToCollect {
wg.Add(1)
@@ -87,7 +95,7 @@ func (r *metricsRequester) Metrics(ctx context.Context, serviceName string, alig
defer wg.Done()
r.logger.Debugf("For metricType %s, metricMeta = %d, aligner = %s", mt, metricMeta, aligner)
- interval, aligner := getTimeIntervalAligner(metricMeta.ingestDelay, metricMeta.samplePeriod, r.config.period, aligner)
+ interval, aligner := getTimeIntervalAligner(largestDelay, metricMeta.samplePeriod, r.config.period, aligner)
--- a/x-pack/metricbeat/module/gcp/metrics/timeseries.go
+++ b/x-pack/metricbeat/module/gcp/metrics/timeseries.go
@@ -9,8 +9,6 @@ import (
"crypto/sha256"
"encoding/hex"
"fmt"
- "strings"
-
"github.com/elastic/beats/v7/metricbeat/mb"
"github.com/elastic/beats/v7/x-pack/metricbeat/module/gcp"
"github.com/elastic/elastic-agent-libs/mapstr"
@@ -145,9 +143,9 @@ func createEventsFromGroups(service string, groups map[string][]KeyValuePoint) [
// Hashes metric names string using SHA-256 to always have
// a constant length value and avoid overflowing the
// current TSDB dimension field limit (1024).
- metricNamesHash := hash(strings.Join(metricNames, ","))
-
- _, _ = event.RootFields.Put("event.metric_names_hash", metricNamesHash)
+ //metricNamesHash := hash(strings.Join(metricNames, ","))
+ //
+ //_, _ = event.ModuleFields.Put("metric_names_fingerprint", metricNamesHash)
@gpop63 tried out the largestDelay
and seems to work for TSDB (Thank you so much for testing!):
All 39971 documents taken from index .ds-metrics-gcp.gke-default-2023.10.31-000001 were successfully placed to index tsdb-index-enabled.
All 1792 documents taken from index .ds-metrics-gcp.compute-default-2023.10.31-000001 were successfully placed to index tsdb-index-enabled.
Just trying to figure out a way of fixing this issue without introducing a new field 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the changes are sound and reasonable. As I'm not working on this actively and I have little familiarity with TSDB I'm just commenting with my 👍, not adding my approval.
Yep, I get it. As said, this was one of the options on the table 😇 The single aspect that made me opt-in to adding the metrics names field is that Elasticsearch will use the same approach to handle this issue transparently for the users; some other metricsets—like Prometheus—have the same problem of collecting metrics for the same data point over multiple collections. However, GCP reliably gives us the metadata (delay and sampling timings), and it provides us with an opportunity other metricsets do not have. Since time intervals are calculated per-metric basis (using the ingest delay and sampling period), I expect the same results. And @gpop63's tests confirm this is the case. If we all agree it's the best option, I will apply the change to the branch and rebuild a custom agent so we can run a final round of tests on all the metric sets to double-check. @gpop63, WDYT?
@kaiyan-sheng I see, and I really appreciate it! Multiple perspectives and evaluation criteria constantly improve the final result! |
The dimensionsKey contains all dimension fields value we want to use to group the time series. We need to add the timestamp to the key, so we only group together time series with the same timestamp.
# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.
We cannot use the `event.created` field because TSDB does not allow the `date` field type for dimension.
The `event.batch_id` with its random values is a wrong choice as a dimension field for a time series database. It would create a new time series at each iteration, which is terrible. The `event.metric_names` will keep the values to a recurring set of field names all having the same ingest delay.
A single GCP metric name can be quite long, for example: subscription.streaming_pull_mod_ack_deadline_message_operation.count we may collect enough metrics to overflow the current length limit for dimension fields. We hash the metric names value using SHA-256 to have a constant length value. I'm not 100% sure SHA-256 is the best option for this use case; we don't have cryptographic solid needs, so we can probably use a simpler algorithm to save computing cycles while getting a shorter hash value. We also drop `event.metric_names` because is no longer needed.
Replace it with `gcp.metric_names_fingerprint` to align this metricset with what we are doing on other metricsets (for example, the prometheus metricset).
ff3be23
to
f655e50
Compare
The metricset now collects metrics values using the largest ingest delay interval instead of the individual shortest ingest. By using the largest ingest delay, the metricset gets all the metrics values for each data point in a single collection. Drop the `gcp.metric_names_fingerprint` because it's no longer needed.
f655e50
to
0d984da
Compare
All tests seem to pass for the metrics data streams we target. Tests: gke
compute
redis
pubsub
cloudrun
dataproc
storage
loadbalancing
cloudsql postgresql
cloudsql mysql
cloudsql sqlserver
firestore
|
* group metrics * Add timestamp to the grouping key The dimensionsKey contains all dimension fields value we want to use to group the time series. We need to add the timestamp to the key, so we only group together time series with the same timestamp. * Update grouping key, add event.created, drop ID() # Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity. --------- Co-authored-by: Maurizio Branca <[email protected]>
Why
The existing metrics grouping logic does not play well with TSDB, and we need to adjust the grouping logic to avoid data loss when we enable TSDB.
What
Overview
event.batch_id
fieldUnify the grouping logic
Unify the grouping logic for all metricsets to group metrics by the fields
@timestamp
and a selection of ECS and label fields.Here's the complete list of the fields we are using for grouping:
@timestamp
cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region
Labels
fieldsWe can use these fields as dimensions in the TSDB configuration.
Add a new
event.batch_id
fieldEach GCP metric has a variable ingest delay. For example, container memory usage is available immediately, with a zero ingest delay; instead, container CPU usage is available with a 2-minute ingest delay. So, collecting memory and CPU usage for the same timestamp requires multiple collections (by default, the metricset collects metrics every 60 seconds).
Metrics like memory and CPU usage have identical dimension values for the same container. However, the metricset can't group them since it collected them over different collections due to the ingest delay. If unhandled, this situation can cause data loss when TSDB is enabled. More details are available in a GitHub issue comment.
To address this problem, the metricset adds a new
event.batch_id
field that we can use as a dimension. The metricset generates a UUID as batch ID during each collection and stores it in theevent.batch_id
field.Renaming
I changed the names of some structs, functions, and variables. I intended to clarify the role and purpose, but I may be biased. I will pick different names to revert to the original values if they don't improve clarity.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
Related issues