[metricbeat] [gcp] group metrics by dimensions #36682

gpop63 · 2023-09-26T21:38:36Z

Why

The existing metrics grouping logic does not play well with TSDB, and we need to adjust the grouping logic to avoid data loss when we enable TSDB.

What

Overview

Unify the grouping logic
Add a new event.batch_id field
Renaming

Unify the grouping logic

Unify the grouping logic for all metricsets to group metrics by the fields @timestamp and a selection of ECS and label fields.

Here's the complete list of the fields we are using for grouping:

@timestamp
cloud.account.id
cloud.availability_zone
cloud.instance.id
cloud.provider
cloud.region
All of Labels fields

We can use these fields as dimensions in the TSDB configuration.

Add a new `event.batch_id` field

Each GCP metric has a variable ingest delay. For example, container memory usage is available immediately, with a zero ingest delay; instead, container CPU usage is available with a 2-minute ingest delay. So, collecting memory and CPU usage for the same timestamp requires multiple collections (by default, the metricset collects metrics every 60 seconds).

Metrics like memory and CPU usage have identical dimension values for the same container. However, the metricset can't group them since it collected them over different collections due to the ingest delay. If unhandled, this situation can cause data loss when TSDB is enabled. More details are available in a GitHub issue comment.

To address this problem, the metricset adds a new event.batch_id field that we can use as a dimension. The metricset generates a UUID as batch ID during each collection and stores it in the event.batch_id field.

Renaming

I changed the names of some structs, functions, and variables. I intended to clarify the role and purpose, but I may be biased. I will pick different names to revert to the original values if they don't improve clarity.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

Related issues

Relates [GCP] Metrics are not grouped by dimension integrations#6568

mergify · 2023-09-26T21:39:11Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gpop63? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-09-26T22:30:07Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-11-02T22:34:49.107+0000
Duration: 53 min 15 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1566
Skipped	96
Total	1662

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

zmoog · 2023-10-04T18:38:53Z

@gpop63, I can't add you as a reviewer since you're the original author of this PR, but please consider yourself a reviewer! 😇

zmoog · 2023-10-13T10:25:19Z

@kaiyan-sheng, I learned you are reviewing the PR under the radar! 😄

You suggest delaying the longest ingest delay metrics collection to avoid adding the batch ID. It makes sense, and it's something we evaluated during development.

Since some metrics have a 5-minute delay, we opted for the batch ID, but we're open to taking a different trade-off.

Please add here your thoughts on this topic!

kaiyan-sheng

Sorry for the delay on reviewing this PR! I was thinking of finding the largest ingest delay among all metrics and then using that largest delay to calculate startTime and endTime in getTimeIntervalAligner function. This way we should get all the data points in a single collection to avoid using metric_names_fingerprint.

--- a/x-pack/metricbeat/module/gcp/metrics/metrics_requester.go
+++ b/x-pack/metricbeat/module/gcp/metrics/metrics_requester.go
@@ -79,6 +79,14 @@ func (r *metricsRequester) Metrics(ctx context.Context, serviceName string, alig
        var wg sync.WaitGroup
        results := make([]timeSeriesWithAligner, 0)
 
+       largestDelay := 0 * time.Second
+       for _, meta := range metricsToCollect {
+               metricMeta := meta
+               if metricMeta.ingestDelay > largestDelay {
+                       largestDelay = metricMeta.ingestDelay
+               }
+       }
+
        for mt, meta := range metricsToCollect {
                wg.Add(1)
 
@@ -87,7 +95,7 @@ func (r *metricsRequester) Metrics(ctx context.Context, serviceName string, alig
                        defer wg.Done()
 
                        r.logger.Debugf("For metricType %s, metricMeta = %d,  aligner = %s", mt, metricMeta, aligner)
-                       interval, aligner := getTimeIntervalAligner(metricMeta.ingestDelay, metricMeta.samplePeriod, r.config.period, aligner)
+                       interval, aligner := getTimeIntervalAligner(largestDelay, metricMeta.samplePeriod, r.config.period, aligner)

--- a/x-pack/metricbeat/module/gcp/metrics/timeseries.go
+++ b/x-pack/metricbeat/module/gcp/metrics/timeseries.go
@@ -9,8 +9,6 @@ import (
        "crypto/sha256"
        "encoding/hex"
        "fmt"
-       "strings"
-
        "github.com/elastic/beats/v7/metricbeat/mb"
        "github.com/elastic/beats/v7/x-pack/metricbeat/module/gcp"
        "github.com/elastic/elastic-agent-libs/mapstr"
@@ -145,9 +143,9 @@ func createEventsFromGroups(service string, groups map[string][]KeyValuePoint) [
                // Hashes metric names string using SHA-256 to always have
                // a constant length value and avoid overflowing the
                // current TSDB dimension field limit (1024).
-               metricNamesHash := hash(strings.Join(metricNames, ","))
-
-               _, _ = event.RootFields.Put("event.metric_names_hash", metricNamesHash)
+               //metricNamesHash := hash(strings.Join(metricNames, ","))
+               //
+               //_, _ = event.ModuleFields.Put("metric_names_fingerprint", metricNamesHash)

@gpop63 tried out the largestDelay and seems to work for TSDB (Thank you so much for testing!):

All 39971 documents taken from index .ds-metrics-gcp.gke-default-2023.10.31-000001 were successfully placed to index tsdb-index-enabled.
All 1792 documents taken from index .ds-metrics-gcp.compute-default-2023.10.31-000001 were successfully placed to index tsdb-index-enabled.

Just trying to figure out a way of fixing this issue without introducing a new field 🙂

endorama

I think all the changes are sound and reasonable. As I'm not working on this actively and I have little familiarity with TSDB I'm just commenting with my 👍, not adding my approval.

zmoog · 2023-11-02T10:26:51Z

I was thinking of finding the largest ingest delay among all metrics and then using that largest delay to calculate startTime and endTime in getTimeIntervalAligner function. This way we should get all the data points in a single collection to avoid using metric_names_fingerprint.

Yep, I get it. As said, this was one of the options on the table 😇

The single aspect that made me opt-in to adding the metrics names field is that Elasticsearch will use the same approach to handle this issue transparently for the users; some other metricsets—like Prometheus—have the same problem of collecting metrics for the same data point over multiple collections.

However, GCP reliably gives us the metadata (delay and sampling timings), and it provides us with an opportunity other metricsets do not have.

Since time intervals are calculated per-metric basis (using the ingest delay and sampling period), I expect the same results. And @gpop63's tests confirm this is the case.

If we all agree it's the best option, I will apply the change to the branch and rebuild a custom agent so we can run a final round of tests on all the metric sets to double-check. @gpop63, WDYT?

Just trying to figure out a way of fixing this issue without introducing a new field 🙂

@kaiyan-sheng I see, and I really appreciate it! Multiple perspectives and evaluation criteria constantly improve the final result!

The dimensionsKey contains all dimension fields value we want to use to group the time series. We need to add the timestamp to the key, so we only group together time series with the same timestamp.

@timestamp

# Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Add `event.created` field We need to add an extra dimension to avoid data loss on TSDB since GCP metrics with the same @timestamp become visible with different "ingest delay". For the full context, read elastic/integrations#6568 (comment) # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity.

We cannot use the `event.created` field because TSDB does not allow the `date` field type for dimension.

The `event.batch_id` with its random values is a wrong choice as a dimension field for a time series database. It would create a new time series at each iteration, which is terrible. The `event.metric_names` will keep the values to a recurring set of field names all having the same ingest delay.

A single GCP metric name can be quite long, for example: subscription.streaming_pull_mod_ack_deadline_message_operation.count we may collect enough metrics to overflow the current length limit for dimension fields. We hash the metric names value using SHA-256 to have a constant length value. I'm not 100% sure SHA-256 is the best option for this use case; we don't have cryptographic solid needs, so we can probably use a simpler algorithm to save computing cycles while getting a shorter hash value. We also drop `event.metric_names` because is no longer needed.

Replace it with `gcp.metric_names_fingerprint` to align this metricset with what we are doing on other metricsets (for example, the prometheus metricset).

The metricset now collects metrics values using the largest ingest delay interval instead of the individual shortest ingest. By using the largest ingest delay, the metricset gets all the metrics values for each data point in a single collection. Drop the `gcp.metric_names_fingerprint` because it's no longer needed.

gpop63 · 2023-11-06T23:08:46Z

All tests seem to pass for the metrics data streams we target.

Tests:

gke

Testing data stream metrics-gcp.gke-default.
Index being used for the documents is .ds-metrics-gcp.gke-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.gke-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - counter (9 fields):
                - gcp.gke.container.cpu.core_usage_time.sec
                - gcp.gke.container.memory.page_fault.count
                - gcp.gke.container.restart.count
                - gcp.gke.node.cpu.core_usage_time.sec
                - gcp.gke.node.network.received_bytes.count
                - gcp.gke.node.network.sent_bytes.count
                - gcp.gke.node_daemon.cpu.core_usage_time.sec
                - gcp.gke.pod.network.received.bytes
                - gcp.gke.pod.network.sent.bytes
        - gauge (31 fields):
                - gcp.gke.container.cpu.limit_cores.value
                - gcp.gke.container.cpu.limit_utilization.pct
                - gcp.gke.container.cpu.request_cores.value
                - gcp.gke.container.cpu.request_utilization.pct
                - gcp.gke.container.ephemeral_storage.limit.bytes
                - gcp.gke.container.ephemeral_storage.request.bytes
                - gcp.gke.container.ephemeral_storage.used.bytes
                - gcp.gke.container.memory.limit.bytes
                - gcp.gke.container.memory.limit_utilization.pct
                - gcp.gke.container.memory.request.bytes
                - gcp.gke.container.memory.request_utilization.pct
                - gcp.gke.container.memory.used.bytes
                - gcp.gke.container.uptime.sec
                - gcp.gke.node.cpu.allocatable_cores.value
                - gcp.gke.node.cpu.allocatable_utilization.pct
                - gcp.gke.node.cpu.total_cores.value
                - gcp.gke.node.ephemeral_storage.allocatable.bytes
                - gcp.gke.node.ephemeral_storage.inodes_free.value
                - gcp.gke.node.ephemeral_storage.inodes_total.value
                - gcp.gke.node.ephemeral_storage.total.bytes
                - gcp.gke.node.ephemeral_storage.used.bytes
                - gcp.gke.node.memory.allocatable.bytes
                - gcp.gke.node.memory.allocatable_utilization.pct
                - gcp.gke.node.memory.total.bytes
                - gcp.gke.node.memory.used.bytes
                - gcp.gke.node.pid_limit.value
                - gcp.gke.node.pid_used.value
                - gcp.gke.node_daemon.memory.used.bytes
                - gcp.gke.pod.volume.total.bytes
                - gcp.gke.pod.volume.used.bytes
                - gcp.gke.pod.volume.utilization.pct
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.gke-default-2023.11.06-000001 to tsdb-index-enabled...
All 22295 documents taken from index .ds-metrics-gcp.gke-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

compute

Testing data stream metrics-gcp.compute-default.
Index being used for the documents is .ds-metrics-gcp.compute-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.compute-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (9 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - cloud.availability_zone
                - cloud.instance.id
                - cloud.instance.name
                - cloud.machine.type
                - cloud.region
                - gcp.labels_fingerprint
        - gauge (19 fields):
                - gcp.compute.firewall.dropped.bytes
                - gcp.compute.firewall.dropped_packets_count.value
                - gcp.compute.instance.cpu.reserved_cores.value
                - gcp.compute.instance.cpu.usage.pct
                - gcp.compute.instance.cpu.usage_time.sec
                - gcp.compute.instance.disk.read.bytes
                - gcp.compute.instance.disk.read_ops_count.value
                - gcp.compute.instance.disk.write.bytes
                - gcp.compute.instance.disk.write_ops_count.value
                - gcp.compute.instance.memory.balloon.ram_size.value
                - gcp.compute.instance.memory.balloon.ram_used.value
                - gcp.compute.instance.memory.balloon.swap_in.bytes
                - gcp.compute.instance.memory.balloon.swap_out.bytes
                - gcp.compute.instance.network.egress.bytes
                - gcp.compute.instance.network.egress.packets.count
                - gcp.compute.instance.network.ingress.bytes
                - gcp.compute.instance.network.ingress.packets.count
                - gcp.compute.instance.uptime.sec
                - gcp.compute.instance.uptime_total.sec
        - routing_path (9 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - cloud.availability_zone
                - cloud.instance.id
                - cloud.instance.name
                - cloud.machine.type
                - cloud.region
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.compute-default-2023.11.06-000001 to tsdb-index-enabled...
All 6948 documents taken from index .ds-metrics-gcp.compute-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

redis

Testing data stream metrics-gcp.redis-default.
Index being used for the documents is .ds-metrics-gcp.redis-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.redis-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (7 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - cloud.instance.id
                - cloud.instance.name
                - cloud.machine.type
                - gcp.labels_fingerprint
        - gauge (31 fields):
                - gcp.redis.clients.blocked.count
                - gcp.redis.clients.connected.count
                - gcp.redis.commands.calls.count
                - gcp.redis.commands.total_time.us
                - gcp.redis.commands.usec_per_call.sec
                - gcp.redis.keyspace.avg_ttl.sec
                - gcp.redis.keyspace.keys.count
                - gcp.redis.keyspace.keys_with_expiration.count
                - gcp.redis.persistence.rdb.bgsave_in_progress
                - gcp.redis.replication.master.slaves.lag.sec
                - gcp.redis.replication.master.slaves.offset.bytes
                - gcp.redis.replication.master_repl_offset.bytes
                - gcp.redis.replication.offset_diff.bytes
                - gcp.redis.replication.role
                - gcp.redis.server.uptime.sec
                - gcp.redis.stats.cache_hit_ratio
                - gcp.redis.stats.connections.total.count
                - gcp.redis.stats.cpu_utilization.sec
                - gcp.redis.stats.evicted_keys.count
                - gcp.redis.stats.expired_keys.count
                - gcp.redis.stats.keyspace_hits.count
                - gcp.redis.stats.keyspace_misses.count
                - gcp.redis.stats.memory.maxmemory.mb
                - gcp.redis.stats.memory.system_memory_overload_duration.us
                - gcp.redis.stats.memory.system_memory_usage_ratio
                - gcp.redis.stats.memory.usage.bytes
                - gcp.redis.stats.memory.usage_ratio
                - gcp.redis.stats.network_traffic.bytes
                - gcp.redis.stats.pubsub.channels.count
                - gcp.redis.stats.pubsub.patterns.count
                - gcp.redis.stats.reject_connections.count
        - routing_path (7 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - cloud.instance.id
                - cloud.instance.name
                - cloud.machine.type
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.redis-default-2023.11.06-000001 to tsdb-index-enabled...
All 2821 documents taken from index .ds-metrics-gcp.redis-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

pubsub

Testing data stream metrics-gcp.pubsub-default.
Index being used for the documents is .ds-metrics-gcp.pubsub-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.pubsub-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (46 fields):
                - gcp.pubsub.snapshot.backlog.bytes
                - gcp.pubsub.snapshot.backlog_bytes_by_region.bytes
                - gcp.pubsub.snapshot.config_updates.count
                - gcp.pubsub.snapshot.num_messages.value
                - gcp.pubsub.snapshot.num_messages_by_region.value
                - gcp.pubsub.snapshot.oldest_message_age.sec
                - gcp.pubsub.snapshot.oldest_message_age_by_region.sec
                - gcp.pubsub.subscription.ack_message.count
                - gcp.pubsub.subscription.backlog.bytes
                - gcp.pubsub.subscription.byte_cost.bytes
                - gcp.pubsub.subscription.config_updates.count
                - gcp.pubsub.subscription.dead_letter_message.count
                - gcp.pubsub.subscription.mod_ack_deadline_message.count
                - gcp.pubsub.subscription.mod_ack_deadline_message_operation.count
                - gcp.pubsub.subscription.mod_ack_deadline_request.count
                - gcp.pubsub.subscription.num_outstanding_messages.value
                - gcp.pubsub.subscription.num_undelivered_messages.value
                - gcp.pubsub.subscription.oldest_retained_acked_message_age.sec
                - gcp.pubsub.subscription.oldest_retained_acked_message_age_by_region.value
                - gcp.pubsub.subscription.oldest_unacked_message_age.sec
                - gcp.pubsub.subscription.oldest_unacked_message_age_by_region.value
                - gcp.pubsub.subscription.pull_ack_message_operation.count
                - gcp.pubsub.subscription.pull_ack_request.count
                - gcp.pubsub.subscription.pull_message_operation.count
                - gcp.pubsub.subscription.pull_request.count
                - gcp.pubsub.subscription.push_request.count
                - gcp.pubsub.subscription.retained_acked.bytes
                - gcp.pubsub.subscription.retained_acked_bytes_by_region.bytes
                - gcp.pubsub.subscription.seek_request.count
                - gcp.pubsub.subscription.sent_message.count
                - gcp.pubsub.subscription.streaming_pull_ack_message_operation.count
                - gcp.pubsub.subscription.streaming_pull_ack_request.count
                - gcp.pubsub.subscription.streaming_pull_message_operation.count
                - gcp.pubsub.subscription.streaming_pull_mod_ack_deadline_message_operation.count
                - gcp.pubsub.subscription.streaming_pull_mod_ack_deadline_request.count
                - gcp.pubsub.subscription.streaming_pull_response.count
                - gcp.pubsub.subscription.unacked_bytes_by_region.bytes
                - gcp.pubsub.topic.byte_cost.bytes
                - gcp.pubsub.topic.config_updates.count
                - gcp.pubsub.topic.oldest_retained_acked_message_age_by_region.value
                - gcp.pubsub.topic.oldest_unacked_message_age_by_region.value
                - gcp.pubsub.topic.retained_acked_bytes_by_region.bytes
                - gcp.pubsub.topic.send_message_operation.count
                - gcp.pubsub.topic.send_request.count
                - gcp.pubsub.topic.streaming_pull_response.count
                - gcp.pubsub.topic.unacked_bytes_by_region.bytes
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.pubsub-default-2023.11.06-000001 to tsdb-index-enabled...
All 12254 documents taken from index .ds-metrics-gcp.pubsub-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

cloudrun

Testing data stream metrics-gcp.cloudrun_metrics-default.
Index being used for the documents is .ds-metrics-gcp.cloudrun_metrics-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.cloudrun_metrics-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (7 fields):
                - gcp.cloudrun_metrics.container.billable_instance_time
                - gcp.cloudrun_metrics.container.cpu.allocation_time.sec
                - gcp.cloudrun_metrics.container.instance.count
                - gcp.cloudrun_metrics.container.memory.allocation_time
                - gcp.cloudrun_metrics.container.network.received.bytes
                - gcp.cloudrun_metrics.container.network.sent.bytes
                - gcp.cloudrun_metrics.request.count
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.cloudrun_metrics-default-2023.11.06-000001 to tsdb-index-enabled...
All 819 documents taken from index .ds-metrics-gcp.cloudrun_metrics-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

dataproc

Testing data stream metrics-gcp.dataproc-default.
Index being used for the documents is .ds-metrics-gcp.dataproc-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.dataproc-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (18 fields):
                - gcp.dataproc.batch.spark.executors.count
                - gcp.dataproc.cluster.hdfs.datanodes.count
                - gcp.dataproc.cluster.hdfs.storage_capacity.value
                - gcp.dataproc.cluster.hdfs.storage_utilization.value
                - gcp.dataproc.cluster.hdfs.unhealthy_blocks.count
                - gcp.dataproc.cluster.job.failed.count
                - gcp.dataproc.cluster.job.running.count
                - gcp.dataproc.cluster.job.submitted.count
                - gcp.dataproc.cluster.operation.failed.count
                - gcp.dataproc.cluster.operation.running.count
                - gcp.dataproc.cluster.operation.submitted.count
                - gcp.dataproc.cluster.yarn.allocated_memory_percentage.value
                - gcp.dataproc.cluster.yarn.apps.count
                - gcp.dataproc.cluster.yarn.containers.count
                - gcp.dataproc.cluster.yarn.memory_size.value
                - gcp.dataproc.cluster.yarn.nodemanagers.count
                - gcp.dataproc.cluster.yarn.pending_memory_size.value
                - gcp.dataproc.cluster.yarn.virtual_cores.count
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.dataproc-default-2023.11.06-000001 to tsdb-index-enabled...
All 2268 documents taken from index .ds-metrics-gcp.dataproc-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

storage

Testing data stream metrics-gcp.storage-default.
Index being used for the documents is .ds-metrics-gcp.storage-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.storage-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (9 fields):
                - gcp.storage.api.request.count
                - gcp.storage.authz.acl_based_object_access.count
                - gcp.storage.authz.acl_operations.count
                - gcp.storage.authz.object_specific_acl_mutation.count
                - gcp.storage.network.received.bytes
                - gcp.storage.network.sent.bytes
                - gcp.storage.storage.object.count
                - gcp.storage.storage.total.bytes
                - gcp.storage.storage.total_byte_seconds.bytes
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.storage-default-2023.11.06-000001 to tsdb-index-enabled...
All 954 documents taken from index .ds-metrics-gcp.storage-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

loadbalancing

Testing data stream metrics-gcp.loadbalancing_metrics-default.
Index being used for the documents is .ds-metrics-gcp.loadbalancing_metrics-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.loadbalancing_metrics-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (19 fields):
                - gcp.loadbalancing_metrics.https.backend_request.bytes
                - gcp.loadbalancing_metrics.https.backend_request.count
                - gcp.loadbalancing_metrics.https.backend_response.bytes
                - gcp.loadbalancing_metrics.https.request.bytes
                - gcp.loadbalancing_metrics.https.request.count
                - gcp.loadbalancing_metrics.https.response.bytes
                - gcp.loadbalancing_metrics.l3.external.egress.bytes
                - gcp.loadbalancing_metrics.l3.external.egress_packets.count
                - gcp.loadbalancing_metrics.l3.external.ingress.bytes
                - gcp.loadbalancing_metrics.l3.external.ingress_packets.count
                - gcp.loadbalancing_metrics.l3.internal.egress.bytes
                - gcp.loadbalancing_metrics.l3.internal.egress_packets.count
                - gcp.loadbalancing_metrics.l3.internal.ingress.bytes
                - gcp.loadbalancing_metrics.l3.internal.ingress_packets.count
                - gcp.loadbalancing_metrics.tcp_ssl_proxy.closed_connections.value
                - gcp.loadbalancing_metrics.tcp_ssl_proxy.egress.bytes
                - gcp.loadbalancing_metrics.tcp_ssl_proxy.ingress.bytes
                - gcp.loadbalancing_metrics.tcp_ssl_proxy.new_connections.value
                - gcp.loadbalancing_metrics.tcp_ssl_proxy.open_connections.value
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.loadbalancing_metrics-default-2023.11.06-000001 to tsdb-index-enabled...
All 896 documents taken from index .ds-metrics-gcp.loadbalancing_metrics-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

cloudsql postgresql

Testing data stream metrics-gcp.cloudsql_postgresql-default.
Index being used for the documents is .ds-metrics-gcp.cloudsql_postgresql-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.cloudsql_postgresql-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - counter (16 fields):
                - gcp.cloudsql_postgresql.database.insights.aggregate.execution_time
                - gcp.cloudsql_postgresql.database.insights.aggregate.io_time
                - gcp.cloudsql_postgresql.database.insights.aggregate.latencies
                - gcp.cloudsql_postgresql.database.insights.aggregate.lock_time
                - gcp.cloudsql_postgresql.database.insights.aggregate.row.count
                - gcp.cloudsql_postgresql.database.insights.aggregate.shared_blk_access.count
                - gcp.cloudsql_postgresql.database.insights.perquery.execution_time
                - gcp.cloudsql_postgresql.database.insights.perquery.io_time
                - gcp.cloudsql_postgresql.database.insights.perquery.lock_time
                - gcp.cloudsql_postgresql.database.insights.perquery.row.count
                - gcp.cloudsql_postgresql.database.insights.perquery.shared_blk_access.count
                - gcp.cloudsql_postgresql.database.insights.pertag.execution_time
                - gcp.cloudsql_postgresql.database.insights.pertag.io_time
                - gcp.cloudsql_postgresql.database.insights.pertag.lock_time
                - gcp.cloudsql_postgresql.database.insights.pertag.row.count
                - gcp.cloudsql_postgresql.database.insights.pertag.shared_blk_access.count
        - gauge (27 fields):
                - gcp.cloudsql_postgresql.database.auto_failover_request.count
                - gcp.cloudsql_postgresql.database.available_for_failover
                - gcp.cloudsql_postgresql.database.cpu.reserved_cores.count
                - gcp.cloudsql_postgresql.database.cpu.usage_time.sec
                - gcp.cloudsql_postgresql.database.cpu.utilization.pct
                - gcp.cloudsql_postgresql.database.disk.bytes_used.bytes
                - gcp.cloudsql_postgresql.database.disk.quota.bytes
                - gcp.cloudsql_postgresql.database.disk.read_ops.count
                - gcp.cloudsql_postgresql.database.disk.utilization.pct
                - gcp.cloudsql_postgresql.database.disk.write_ops.count
                - gcp.cloudsql_postgresql.database.memory.quota.bytes
                - gcp.cloudsql_postgresql.database.memory.total_usage.bytes
                - gcp.cloudsql_postgresql.database.memory.usage.bytes
                - gcp.cloudsql_postgresql.database.memory.utilization.pct
                - gcp.cloudsql_postgresql.database.network.connections.count
                - gcp.cloudsql_postgresql.database.network.received_bytes.count
                - gcp.cloudsql_postgresql.database.network.sent_bytes.count
                - gcp.cloudsql_postgresql.database.num_backends.count
                - gcp.cloudsql_postgresql.database.replication.network_lag.sec
                - gcp.cloudsql_postgresql.database.replication.replica_byte_lag.bytes
                - gcp.cloudsql_postgresql.database.replication.replica_lag.sec
                - gcp.cloudsql_postgresql.database.transaction.count
                - gcp.cloudsql_postgresql.database.transaction_id.count
                - gcp.cloudsql_postgresql.database.transaction_id_utilization.pct
                - gcp.cloudsql_postgresql.database.up
                - gcp.cloudsql_postgresql.database.uptime.sec
                - gcp.cloudsql_postgresql.database.vacuum.oldest_transaction_age
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.cloudsql_postgresql-default-2023.11.06-000001 to tsdb-index-enabled...
All 760 documents taken from index .ds-metrics-gcp.cloudsql_postgresql-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

cloudsql mysql

Testing data stream metrics-gcp.cloudsql_mysql-default.
Index being used for the documents is .ds-metrics-gcp.cloudsql_mysql-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.cloudsql_mysql-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (35 fields):
                - gcp.cloudsql_mysql.database.auto_failover_request.count
                - gcp.cloudsql_mysql.database.available_for_failover
                - gcp.cloudsql_mysql.database.cpu.reserved_cores.count
                - gcp.cloudsql_mysql.database.cpu.usage_time.sec
                - gcp.cloudsql_mysql.database.cpu.utilization.pct
                - gcp.cloudsql_mysql.database.disk.bytes_used.bytes
                - gcp.cloudsql_mysql.database.disk.quota.bytes
                - gcp.cloudsql_mysql.database.disk.read_ops.count
                - gcp.cloudsql_mysql.database.disk.utilization.pct
                - gcp.cloudsql_mysql.database.disk.write_ops.count
                - gcp.cloudsql_mysql.database.innodb_buffer_pool_pages_dirty.count
                - gcp.cloudsql_mysql.database.innodb_buffer_pool_pages_free.count
                - gcp.cloudsql_mysql.database.innodb_buffer_pool_pages_total.count
                - gcp.cloudsql_mysql.database.innodb_data_fsyncs.count
                - gcp.cloudsql_mysql.database.innodb_os_log_fsyncs.count
                - gcp.cloudsql_mysql.database.innodb_pages_read.count
                - gcp.cloudsql_mysql.database.innodb_pages_written.count
                - gcp.cloudsql_mysql.database.memory.quota.bytes
                - gcp.cloudsql_mysql.database.memory.total_usage.bytes
                - gcp.cloudsql_mysql.database.memory.usage.bytes
                - gcp.cloudsql_mysql.database.memory.utilization.pct
                - gcp.cloudsql_mysql.database.network.connections.count
                - gcp.cloudsql_mysql.database.network.received_bytes.count
                - gcp.cloudsql_mysql.database.network.sent_bytes.count
                - gcp.cloudsql_mysql.database.queries.count
                - gcp.cloudsql_mysql.database.questions.count
                - gcp.cloudsql_mysql.database.received_bytes.count
                - gcp.cloudsql_mysql.database.replication.last_io_errno
                - gcp.cloudsql_mysql.database.replication.last_sql_errno
                - gcp.cloudsql_mysql.database.replication.network_lag.sec
                - gcp.cloudsql_mysql.database.replication.replica_lag.sec
                - gcp.cloudsql_mysql.database.replication.seconds_behind_master.sec
                - gcp.cloudsql_mysql.database.sent_bytes.count
                - gcp.cloudsql_mysql.database.up
                - gcp.cloudsql_mysql.database.uptime.sec
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.cloudsql_mysql-default-2023.11.06-000001 to tsdb-index-enabled...
All 344 documents taken from index .ds-metrics-gcp.cloudsql_mysql-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

cloudsql sqlserver

Testing data stream metrics-gcp.cloudsql_sqlserver-default.
Index being used for the documents is .ds-metrics-gcp.cloudsql_sqlserver-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.cloudsql_sqlserver-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (23 fields):
                - gcp.cloudsql_sqlserver.database.audits_size.bytes
                - gcp.cloudsql_sqlserver.database.audits_upload.count
                - gcp.cloudsql_sqlserver.database.auto_failover_request.count
                - gcp.cloudsql_sqlserver.database.available_for_failover
                - gcp.cloudsql_sqlserver.database.cpu.reserved_cores.count
                - gcp.cloudsql_sqlserver.database.cpu.usage_time.sec
                - gcp.cloudsql_sqlserver.database.cpu.utilization.pct
                - gcp.cloudsql_sqlserver.database.disk.bytes_used.bytes
                - gcp.cloudsql_sqlserver.database.disk.quota.bytes
                - gcp.cloudsql_sqlserver.database.disk.read_ops.count
                - gcp.cloudsql_sqlserver.database.disk.utilization.pct
                - gcp.cloudsql_sqlserver.database.disk.write_ops.count
                - gcp.cloudsql_sqlserver.database.memory.quota.bytes
                - gcp.cloudsql_sqlserver.database.memory.total_usage.bytes
                - gcp.cloudsql_sqlserver.database.memory.usage.bytes
                - gcp.cloudsql_sqlserver.database.memory.utilization.pct
                - gcp.cloudsql_sqlserver.database.network.connections.count
                - gcp.cloudsql_sqlserver.database.network.received_bytes.count
                - gcp.cloudsql_sqlserver.database.network.sent_bytes.count
                - gcp.cloudsql_sqlserver.database.replication.network_lag.sec
                - gcp.cloudsql_sqlserver.database.replication.replica_lag.sec
                - gcp.cloudsql_sqlserver.database.up
                - gcp.cloudsql_sqlserver.database.uptime.sec
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.cloudsql_sqlserver-default-2023.11.06-000001 to tsdb-index-enabled...
All 344 documents taken from index .ds-metrics-gcp.cloudsql_sqlserver-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

firestore

Testing data stream metrics-gcp.firestore-default.
Index being used for the documents is .ds-metrics-gcp.firestore-default-2023.11.06-000001.
Index being used for the settings and mappings is .ds-metrics-gcp.firestore-default-2023.11.06-000001.

The time series fields for the TSDB index are: 
        - dimension (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint
        - gauge (3 fields):
                - gcp.firestore.document.delete.count
                - gcp.firestore.document.read.count
                - gcp.firestore.document.write.count
        - routing_path (4 fields):
                - agent.id
                - cloud.account.id
                - cloud.account.name
                - gcp.labels_fingerprint

Index tsdb-index-enabled successfully created.

Copying documents from .ds-metrics-gcp.firestore-default-2023.11.06-000001 to tsdb-index-enabled...
All 171 documents taken from index .ds-metrics-gcp.firestore-default-2023.11.06-000001 were successfully placed to index tsdb-index-enabled.

* group metrics * Add timestamp to the grouping key The dimensionsKey contains all dimension fields value we want to use to group the time series. We need to add the timestamp to the key, so we only group together time series with the same timestamp. * Update grouping key, add event.created, drop ID() # Update grouping key The dimensionsKey contains all dimension fields values we want to use to group the time series. We need to add the timestamp to the key, so we only group time series with the same timestamp. # Drop ID() function Remove the `ID()` function from the Metadata Collector. Since we are unifying the metric grouping logic for all metric types, we don't need to keep the `ID()` function anymore. # Renaming I also renamed some structs, functions, and variables with the purpose of making their role and purpose more clear. We can remove this part if it does not improve clarity. --------- Co-authored-by: Maurizio Branca <[email protected]>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 26, 2023

mergify bot assigned gpop63 Sep 26, 2023

zmoog mentioned this pull request Oct 4, 2023

[GCP] Metrics are not grouped by dimension elastic/integrations#6568

Closed

zmoog force-pushed the gcp_group-metrics branch 3 times, most recently from a6937fe to c083a9d Compare October 4, 2023 11:33

zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Oct 4, 2023

zmoog requested a review from a team October 4, 2023 18:31

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 4, 2023

zmoog marked this pull request as ready for review October 4, 2023 18:36

zmoog requested a review from a team as a code owner October 4, 2023 18:36

zmoog requested review from endorama and kaiyan-sheng October 4, 2023 18:36

kaiyan-sheng approved these changes Oct 31, 2023

View reviewed changes

kaiyan-sheng self-requested a review November 1, 2023 02:04

endorama reviewed Nov 2, 2023

View reviewed changes

gpop63 and others added 9 commits November 2, 2023 18:36

group metrics

cbcdc24

Add timestamp to the grouping key

e329b58

The dimensionsKey contains all dimension fields value we want to use to group the time series. We need to add the timestamp to the key, so we only group together time series with the same timestamp.

Replace event.created with event.batch_id

d859308

We cannot use the `event.created` field because TSDB does not allow the `date` field type for dimension.

Address linter complaints

a2ad84f

Add event.batch_id field mapping

c42772a

Fix event.metric_names field description

2325301

Update tests

d1a257d

zmoog added 2 commits November 2, 2023 18:36

Replace event.metric_names_hash

d36f7a0

Replace it with `gcp.metric_names_fingerprint` to align this metricset with what we are doing on other metricsets (for example, the prometheus metricset).

zmoog force-pushed the gcp_group-metrics branch 3 times, most recently from ff3be23 to f655e50 Compare November 2, 2023 22:28

zmoog force-pushed the gcp_group-metrics branch from f655e50 to 0d984da Compare November 2, 2023 22:34

kaiyan-sheng approved these changes Nov 9, 2023

View reviewed changes

zmoog merged commit 66fd810 into elastic:main Nov 14, 2023
7 checks passed

zmoog mentioned this pull request Nov 14, 2023

[GCP] Group metrics by dimensions #35882

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metricbeat] [gcp] group metrics by dimensions #36682

[metricbeat] [gcp] group metrics by dimensions #36682

gpop63 commented Sep 26, 2023 •

edited by zmoog

Loading

mergify bot commented Sep 26, 2023

elasticmachine commented Sep 26, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

zmoog commented Oct 4, 2023

zmoog commented Oct 13, 2023

kaiyan-sheng left a comment •

edited

Loading

endorama left a comment

zmoog commented Nov 2, 2023

gpop63 commented Nov 6, 2023

[metricbeat] [gcp] group metrics by dimensions #36682

[metricbeat] [gcp] group metrics by dimensions #36682

Conversation

gpop63 commented Sep 26, 2023 • edited by zmoog Loading

Why

What

Overview

Unify the grouping logic

Add a new event.batch_id field

Renaming

Checklist

Author's Checklist

Related issues

mergify bot commented Sep 26, 2023

elasticmachine commented Sep 26, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

zmoog commented Oct 4, 2023

zmoog commented Oct 13, 2023

kaiyan-sheng left a comment • edited Loading

Choose a reason for hiding this comment

endorama left a comment

Choose a reason for hiding this comment

zmoog commented Nov 2, 2023

gpop63 commented Nov 6, 2023

gpop63 commented Sep 26, 2023 •

edited by zmoog

Loading

Add a new `event.batch_id` field

elasticmachine commented Sep 26, 2023 •

edited by jenkins-beats-ci bot

Loading

kaiyan-sheng left a comment •

edited

Loading