-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backlog spikes reported every hour #850
Comments
@pjm17971 Hi Peter, we've been seeing a number of issues lately that involved upgrading the nodejs-pubsub version from one of the 0.x to the latest one. The common suggestion I've seen is to try switching to the C++ gRPC bindings, so you might give that a try.
(Or however you instantiate the PubSub object, you can provide I'll look at this some more, but I wanted to give you something quick to try! |
We have seen similar behaviour when using the |
@feywind Thanks for the info. If we give it a try I'll report back here if it worked or not. Is there an issue on the grpc javascript side that we can track so we'll know when to switch this back out? |
@pjm17971 If you have a chance, can you try turning on the GRPC debug flags with these environment variables? GRPC_TRACE=all Specifically with the configuration described as problematic in the original post, if possible. Thanks! |
Just chiming in to let you know that I'm experience the precise same behavior as @pjm17971 with a couple of subscribers running on GKE. Here are the versions and the respective behavior that I have been observing:
Providing the C++ That said, with the
@pjm17971 have you been observing these errors as well? I was not able to reliably correlate these crashes with the spikes in unacked messages though. Also, according to this grpc issue the crashes problem might have been fixed in |
I just realized that in the original issue description the spikes appear exactly each full hour. This is not the case in my setup. In my setup the spikes appear in irregular intervals around 2-5 times per day. |
Just a quick confirmation that the issue is gone for me since switching back to the C++ bindings. Need to wait for #890 until I can try out |
@feywind Sorry I didn't reply, but kind of a non-update from my end: we went back to the C++ gRPC implementation and our process behaved fine. We couldn't reproduce easily in a non-production environment so we did not attempt to deploy again with the bad library version. We can try any new version of grpc-js again down the road if a fix is believed to be in place and give feedback as to if this works for us. Sounds like #890 kind of has this problem covered. |
Thanks, everyone! There's a PR out now to update to the grpc-js that should fix the issues in #890. I'll update here again once that's merged, and an update to that nodejs-pubsub should pull it in. |
I have removed the C++ bindings again and upgraded to So for the time being I conclude that
is finally a stable setup for my use case again! @pjm17971 I think the above versions are a setup that is worth trying out. |
Thanks for the update! I'm working on getting the default grpc-js version pushed up to 0.6.18. There might be a |
|
🤖 I have created a release \*beep\* \*boop\* --- ## [3.3.0](https://www.github.com/googleapis/nodejs-bigtable/compare/v3.2.0...v3.3.0) (2021-03-08) ### Features * add CMEK fields ([googleapis#845](https://www.github.com/googleapis/nodejs-bigtable/issues/845)) ([0381fb7](https://www.github.com/googleapis/nodejs-bigtable/commit/0381fb7da68492b85f8a3359d5fb97ca4898810e)) * introduce style enumeration ([googleapis#833](https://www.github.com/googleapis/nodejs-bigtable/issues/833)) ([99b7617](https://www.github.com/googleapis/nodejs-bigtable/commit/99b7617e215126fc36ef3c3ebefb244e0d8d2242)) * **protos:** update BigtableTableAdmin GetIamPolicy, change DeleteAppProfileRequest.ignore_warnings to REQUIRED ([59a0d26](https://www.github.com/googleapis/nodejs-bigtable/commit/59a0d269d5196991dd395e671d7d5f54ce449005)) ### Bug Fixes * **browser:** check for fetch on window ([googleapis#824](https://www.github.com/googleapis/nodejs-bigtable/issues/824)) ([a38cbcc](https://www.github.com/googleapis/nodejs-bigtable/commit/a38cbcca1660bc40fe137acb973bf62f3c55754e)) * Renaming region tags to not conflict with documentation snippets ([googleapis#834](https://www.github.com/googleapis/nodejs-bigtable/issues/834)) ([5d3e8f7](https://www.github.com/googleapis/nodejs-bigtable/commit/5d3e8f721c2a32a33bf41baa1ed237fb90f7cbd6)) * **retry:** restore grpc_service_config for CreateBackup and RestoreTable ([googleapis#851](https://www.github.com/googleapis/nodejs-bigtable/issues/851)) ([3ff2828](https://www.github.com/googleapis/nodejs-bigtable/commit/3ff282855f4f9a52a89bca8d087c1423e71bd7c6)) * set keepalive configuration ([googleapis#836](https://www.github.com/googleapis/nodejs-bigtable/issues/836)) ([8105dea](https://www.github.com/googleapis/nodejs-bigtable/commit/8105dea272de44e69915d3e62e5b5add106b54cb)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
We run many workers on GKE that essentially pull messages from a Pubsub subscription at a rate of about 120k/min, do some processing and write the result into Bigtable. As part of a recent change we upgraded this Pubsub client library from 0.29.1 to 1.2 and immediately started to see alerts.
What we started to see after this upgrade was spikes in the reported backlog (and oldest unacked message age). These happened hourly. However, our service appeared not to suffer, and continued to output its product at a steady rate.
Here is an overview of this upgrade as seen in Stackdriver (running Pubsub v1.2 highlighted in red, then after 10am Friday we reverted JUST the pubsub version and the process returned to normal):
Zooming into approx Friday 12am until noon, and showing backlog at the top and oldest message at the bottom:
It is pretty clearly something that happens every hour.
I know there's another Github issue for memory spikes, but at least as far as we can tell that's not the case for us. In fact, I don't think we actually saw a real impact on processing output. This assessment is based on: 1) we didn't see lag in out downstream client which is usually the case with actual backlogs and 2) we didn't see increase in worker cpu when the backlog recovered. The biggest problem is we use these alerts on this subscription as our main indicator that the process may be experiencing problems for some reason.
Environment details
@google-cloud/pubsub
version: 1.2Steps to reproduce
Not sure we do anything special. Each worker instance creates a simple client which processes messages. We run this on GKE, with one node instance per pod. Approximately 64 workers are all pulling from the same subscription. We generally just ack messages regardless of successful processing because in this application it's ok to just drop them (a separate process will take care of it).
Hope this is helpful. Let me know if I can provide any additional data.
Thanks!
The text was updated successfully, but these errors were encountered: