-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscriber maxes out CPU after exactly 60 minutes on GKE #890
Comments
same is happening to us. Thought it was a bad implementation... |
We are also seeing the same issue. We've been having a lot of strange behavior with this library. |
Thanks everyone, especially for that log output! I find this to be particularly suspicious:
I'm working on trying to build some kind of repro to pass on to the grpc team. We have only one Node library that's defaulting to the gRPC C++ bindings at this point (spanner), so there has been some hesitation about switching this one back to it by default, but it also seems to be affecting a lot of people. (Also, the C++ one requires node-gyp, which may not be available in some environments, so adding a hard dependency to it is possibly problematic.) The current thought is that it may have something to do with this library's more extensive use of streaming APIs. So progress is being made there. Thanks for the patience and debugging help! |
We are seeing a similar issue where our node process gets killed if a subscription has been idle for 60 minutes - without any logs. |
Can you clarify when you restarted the process relative to the final timestamped log message? Was the log cut off because you restarted, or was there a time gap after the last log message and before the manual restart? |
The log you posted shows that the grpc library establishes a connection and some requests are made, then exactly an hour later the connection drops and a new connection is established (this is a normal thing that the grpc library is designed to handle). Then, right after the new connection is established, the log cuts off, saying that the process was restarted. I'm wondering if you can see timestamps for when GKE restarts the process, and when that is relative to the log lines in the file. The reason is that I'm wondering if the problem I should be looking for is more like "something weird happened while reconnecting and that triggered GKE to restart it" or "the process entered an infinite loop without logging anything, and GKE eventually detected it". |
I only restarted about 20 mins after the last log message (see the graph in the original post, where the bump in CPU usage goes down again). Between the last log message in my gist and the restart there were no further log messages retrievable from kubectl or Google cloud logging and the process consumed all CPU that Kubernetes allowed. |
My observation is: after 60 minutes of repeated log output every 30 seconds all log output stops and the node process jumps to max available CPU usage for no apparent reason. The log output in the gist shows everything retrieved from that pod which formally lived about 20 minutes longer than the last line of logs. |
Just to make sure we're not missing out any details, our initialization logic is like this: const pubsub = new PubSub();
const subscription = pubsub.subscription(subscriptionName, {
flowControl: {
maxBytes: Math.floor(getenv.int('MEMORY_LIMIT_BYTES') * 0.5),
},
});
subscription.on('message', handleMessage); where |
@murgatroid99 I tried it with early versions of I have updated the gist to include the log output of the same pod with the earlier versions of |
I have published |
Whoops, I have troubles downloading the logs, there are megabytes of:
during the period where the process runs with maxed out CPU. So basically I can highly recommend not running (Edited to include multiple subsequent log outputs to illustrate that they are only milliseconds apart) |
@murgatroid99 I managed to download 140MB of logs from Google Cloud Logging and put the part until the infinite loop begins into the gist. |
@ctavan Thanks for the continued debug info! This actually is starting to sound like something we were talking about as a possibility for a cause. Basically, after some time (I think it might've been 30 minutes) the stream is closed, and we attempt to reconnect it. If something prevents that connection, it's not supposed to busy loop reconnecting, but that might be what it's doing. A gRPC fix might be useful if that's what it is (this guy - grpc/grpc-node#1271) but we also talked about putting in a workaround in the nodejs-pubsub library if that proves more elusive. |
I had a feeling that the problem was something like that, but I tried to mitigate a slightly different potential issue. I didn't realize it would loop that tightly and generate that many log lines, though I do want to note that in normal circumstances, when you're not encountering this bug, you'll usually get about one extra log line per request. What's happening here is that a connection has been established, but something is in an invalid state that is preventing it from actually starting a stream, so it's just repeatedly retrying. The reason it only happens on 0.6.16 and above is that it's caused by grpc/grpc-node#1251, which was supposed to fix googleapis/nodejs-firestore#890. I'll try to get a new version out on Monday that fixes this. For now, if you can force it to use 0.6.15 you'll probably be fine. |
I have now published |
@murgatroid99 I've upgraded to For the sake of completeness I have added the log output of the (still running since 160 minutes) pod to the gist. I'll leave this issue open in case you want to update the transitive dependencies of |
@ctavan I will take a look at updating the versions in nodejs-pubsub. Thank you for all the help in testing this stuff! |
Had a similar issue and switched to the older C++ |
@jperasmus Did you try out the latest version of the library lately, which pulls in |
@feywind I did try with the latest version at the time, which was after Christoph posted about 0.6.18. I will try with the latest v1 grpc-js client on one of our test environments after v2 of this library includes it. |
@jperasmus Checking back in - have you had better luck with the grpc-js client 1.x? That should be picked up here now (by way of gax updates). |
Hi @feywind thanks for following up. I actually moved to a different company where we are not using this package anymore, but before I left, I did run an experiment for almost 2 weeks with the v1 |
Okay, cool! If anyone is still running into this with |
Environment details
@google-cloud/pubsub
version:[email protected]
/[email protected]
Steps to reproduce
I have a pod that subscribes to a subscription which is 100% idle (i.e. no messages published on that topic) during the observation period.
Exactly 60 minutes after the pod started it eats up all available CPU until the pod is deleted manually to force a restart.
This behavior does not appear when using the C++ gGRPC bindings as suggested in #850 (comment)
I have also enabled debug output through
I've put the full debug output (as fetched from Google Cloud Logging) of the entire lifetime of the pod in this gist.
There is no further output after the last line even though the pod kept running for a while as can be seen form the graph above: the bump in CPU usage is the time where the pod eats up all available CPU and doesn't seem to log anything. (Times in the logs correspond to the times in the graph above with 1h timezone difference, so 10:51 in the logs = 11:51 in the graph).
I can reproduce this problem with any of my subscriber pods.
I believe it is very likely that this problem also falls into the category of #868.
The text was updated successfully, but these errors were encountered: