Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter time series errors do not include metric names #443

Open
jsirianni opened this issue Jun 23, 2022 · 13 comments
Open

Exporter time series errors do not include metric names #443

jsirianni opened this issue Jun 23, 2022 · 13 comments
Labels
enhancement New feature or request priority: p2

Comments

@jsirianni
Copy link

jsirianni commented Jun 23, 2022

When working with Google Exporter, it would be nice if time series errors returned the name of the metric(s) being rejected by the API, as sometimes a system will have hundreds of metric, with only a subset of them being rejected by the API. This is very difficult to track down as it requires the user to use metrics explorer and look at every single metric to try and find one that has spotty data.

Error

Jun 23 14:28:14 oiq-otelcollector-1 observiq-otel-collector[1673]: 2022-06-23T14:28:14.721Z error exporterhelper/queued_retry.go:149 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: generic_node{location:global,node_id:oiq-otelcollector-1,namespace:oiq-otelcollector-1} timeSeries[0-26,28,30-138]: custom.googleapis.com/node_cpu_seconds_total{cpu0,appfluentbit2,modeidle,hostnamefluent-bit2}; Field timeSeries[27].points[0].interval.end_time had an invalid value of \"2022-05-23T03:12:21.352793-07:00\": Data points cannot be written more than 25h10s in the past.; Field timeSeries[29].points[0].interval.end_time had an invalid value of \"2022-06-06T02:45:37.375261-07:00\": Data points cannot be written more than 25h10s in the past.\nerror details: name = Unknown desc = total_point_count:139 success_point_count:117 errors:{status:{code:9} point_count:20} errors:{status:{code:3} point_count:2}"}

This error indicates a real problem, but does not include the name of the metrics being rejected.

@jsirianni
Copy link
Author

We see this frequently as well, which is caused by duplicate metrics when really it is just multiple systems sending identifcal metrics without adding uniquely identifiable resources such as a host.name.

{"level":"error","ts":"2022-06-17T13:01:26.349-0400","caller":"exporterhelper/queued_retry.go:149","msg":"Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors","kind":"exporter","name":"googlecloud","error":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[1] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[3] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[5] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.\nerror details: name = Unknown desc = total_point_count:6 success_point_count:3 errors:{status:{code:3} point_count:3}","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:149\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/metrics.go:132\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:119\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:82\ngo.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2\n\t/opt/homebrew/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:69"}

@Dylan-M
Copy link

Dylan-M commented Jun 28, 2022

Regarding that second one, and really all errors in general, it is extremely beneficial to know which metric is having the error. Especially in environments with hundreds of metrics.

@jsirianni
Copy link
Author

I am running into this again today. The issue is clearly in my configuration, but impossible to narrow down. The error I am getting exceeds 65k characters.

@jsuereth jsuereth added the enhancement New feature or request label Aug 22, 2022
@jsuereth
Copy link
Contributor

Unfortunately, this long error is an issue with the Cloud Monitoring Metrics API. The only way to solve it for this project would be to parse the error message and attempt to produce a better one. Instead we'll escalate this against the Cloud Monitoring API itself.

@Dylan-M
Copy link

Dylan-M commented Sep 15, 2022

Any progress on this? I encountered it again yesterday.

@dashpole
Copy link
Contributor

Sorry, still no updates. I'll check with the Cloud Monitoring API team again to see if they have any updates.

@dashpole
Copy link
Contributor

dashpole commented Feb 6, 2023

Still no updates.

@Dylan-M
Copy link

Dylan-M commented Feb 6, 2023

@dashpole Thanks, we're still seeing this, so it is still an important issue.

@dashpole
Copy link
Contributor

Still no updates. If others run into the same issue, feel free to thumbs up the original comment

@Dylan-M
Copy link

Dylan-M commented Aug 14, 2023

@dashpole We still see this regularly, and it actually causes a deeper issue. If the persistent queue is enabled, these failures are put into the queue and retried over and over again. This causes issues all over the place, such as repeated API failures, an extreme log growth, and of course the persistent queue also growing on disk.

@dashpole
Copy link
Contributor

If you are using the collector exporter, we do not recommend enabling the retry on failure setting (which we default to false). The exporter (well, really the cloud monitoring client library) has a (relatively) intelligent retry mechanism already built in, which should avoid spamming logs. This issue is just tracking making the error response more helpful.

If you are experiencing other issues, feel free to open an new issue in this repo.

@Dylan-M
Copy link

Dylan-M commented Aug 14, 2023

As you say, those other issues are preventable with settings tuning. However, I was not aware that the library had a separate retry mechanism. We should probably have an internal discussion on this. Thank you for the insight.

@dashpole
Copy link
Contributor

Source for retry settings built-into the client: https://github.com/googleapis/google-cloud-go/blob/main/monitoring/apiv3/metric_client.go#L63. It is different per-api-call. CreateTimeSeries does not retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: p2
Projects
None yet
Development

No branches or pull requests

4 participants