Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@grpc/grpc-js channel and memory leak (with a minimal reproduction) #1085

Closed
merlinnot opened this issue Oct 24, 2019 · 22 comments
Closed

@grpc/grpc-js channel and memory leak (with a minimal reproduction) #1085

merlinnot opened this issue Oct 24, 2019 · 22 comments
Assignees

Comments

@merlinnot
Copy link
Contributor

Problem description

@grpc/grpc-js v0.6.9 seemingly leaks both memory and open network connections.

I did my best to provide a minimal possible reproduction by using gRPC directly, without any other layers on top, such as Google SDKs. Total size of the reproduction is ~50 LOC of readable TypeScript and uses @grpc/grpc-js, @grpc/proto-loader, ramda and path modules and packages exclusively. I hope it will make it easier to pinpoint and resolve an issue.

I'd appreciate if you could confirm that the issue is reproducible by following the steps below.

Reproduction steps

  1. Download reproduction-grpc.zip.
  2. Unzip the file.
  3. Navigate to the extracted directory.
  4. Run npm install.
  5. Run Firestore Emulator, either through gcloud or using a Docker container.
  6. Adjust the hostname and port in configuration.ts file if necessary.
  7. Run node -r ts-node/register --expose-gc index.ts.
  8. Heap size will be logged, see it steadily growing. Garbage collection is forced before every log statement to represent accurate results.
  9. Run lsof -i tcp:4500 -n -P, adjust the port in this script to the one which Firestore Emulator is running on. A list of open sockets will be displayed.
  10. Wait 10 seconds, repeat step 9. Wait 10 seconds, repeat step 9. Wait 10 seconds, ... A steadily increasing number of open sockets can be observed.

Environment

  • OS name, version and architecture: Darwin Kernel Version 19.0.0: Wed Sep 25 21:13:50 PDT 2019; root:xnu-6153.11.26~4/RELEASE_X86_64 x86_64
  • Node version: v10.16.3
  • Node installation method: nvm
  • If applicable, compiler version: -
  • Package name and version: @grpc/grpc-js v0.6.9

Additional context

This might possibly be a root cause of googleapis/nodejs-firestore#768.

@murgatroid99
Copy link
Member

When I run this, I do not see the many sockets you mentioned, and heap snapshots indicate that only 2 subchannels are ever created, which is the correct behavior. The behavior I see here matches the results in my comment on the linked issue: googleapis/nodejs-firestore#768 (comment). The primary cause of the leak is loading the proto file again for every request.

@merlinnot
Copy link
Contributor Author

Here's a result I see on my system after running it for ~1 minute:

Screen Shot 2019-10-24 at 21 43 47

As you can see, process ID is exactly the same. I understand that the behavior might be different depending in your environment. I'll check if I'll be able to reproduce it in a CI, so you would see it in an environment that can be easily recreated.

@murgatroid99
Copy link
Member

I don't actually see why that behavior would be different depending on the environment. None of the subchannel reuse logic depends on anything environmental that I can think of.

@merlinnot
Copy link
Contributor Author

merlinnot commented Oct 24, 2019

I managed to reproduce it in the CI (GitHub Actions), on a Linux VM. It's in a public repository and GitHub Actions are free, so you can clone it and experiment yourself very easily: https://github.com/merlinnot/grpc-node-leak-reproduction/commit/94c1eaa40ba8c48a203a9b4f86a51e6aacba87b7/checks?check_suite_id=280340116#step:6:1

The script that is in there runs lsof -i :8080 every five seconds and starts the repro.

  1. In first few logs there's just one connection.
  2. In line 34 of step 6 you can see a first duplicate.
  3. In line 67 you can see 3 sockets.
  4. In line 91 you can see 4 sockets.
    ...
  5. In line 2142 you can see 24 sockets and the workflow times out.

I can make it run longer if you'd like, number of sockets keeps increasing.

@murgatroid99
Copy link
Member

murgatroid99 commented Oct 24, 2019

OK, I was seeing different results because I didn't have the firestore emulator set up properly, so no connections were getting established and the script was just running anyway.

I can now reproduce the increasing number of open connections. It looks like connections are not getting properly cleaned up when the corresponding Subchannel is removed. That is a bug, and it probably explains the average few hundred bytes per call leak that I mentioned in my linked comment, but it is not a significant contributor to the memory leak. It's just one or two leaked Http2Session objects every 10 seconds. As previously stated, the main problem is that you are loading the proto file for every request.

Ironically, if you didn't close the clients, or you reused a single client, you wouldn't see that connection leak at all. It's only happening because each subchannel periodically has no references other than from the SubchannelPool, so when the SubchannelPool timer runs it deletes those subchannels and new ones need to be created.

@bcoe
Copy link
Contributor

bcoe commented Oct 29, 2019

@merlinnot thank you for your awesome work isolating this 🎉

@merlinnot
Copy link
Contributor Author

merlinnot commented Nov 1, 2019

@murgatroid99 I tested v0.6.10 and I have several observations:

  • Socket leak is no longer reproducible using the reproduction provided above.
  • Socket leak still exists in my project using @google-cloud/firestore (although this dep is pinned in google-gax, I forced it to run on v0.6.10).
  • Memory leak is still reproducible using the reproduction provided above.

I'll try to provide yet another reproduction, but next week the soonest. Memory leak is still reproducible, so I think it's best to keep this issue open.

@murgatroid99
Copy link
Member

I just published @grpc/proto-loader version 0.5.3 that might fix the memory leak.

@bcoe
Copy link
Contributor

bcoe commented Nov 4, 2019

@merlinnot if possible, let us know if this does the trick for squashing some of the memory leaks you've been able to reproduce, at which point we can upstream the work to gax.

@merlinnot
Copy link
Contributor Author

On it! I'll catch up with some critical tasks first and will test it straight away, ~45 minutes.

@merlinnot
Copy link
Contributor Author

I ran the tests with @grpc/[email protected] and @grpc/[email protected]:

Screen Shot 2019-11-04 at 20 36 52

Screen Shot 2019-11-04 at 20 37 20

The socket leak is still there, I'll have to work on isolating it (hundreds of open connections):

Screen Shot 2019-11-04 at 20 36 15

When it comes to the memory leak, it's definitely better! Here is the before:

Screen Shot 2019-11-04 at 20 41 57

And here is the after:

Screen Shot 2019-11-04 at 20 35 34

500 MBs saved! I'll now work on reproducing the socket leak again.

@merlinnot
Copy link
Contributor Author

Also, the socket leak doesn't necessarily mean that there is an issue with @grpc/grpc-js, at least until I can provide a reproduction. This might be as well some mistake on my side. I personally don't think it should be blocking for releasing a new version, we can always ship yet another one with more fixes.

@merlinnot
Copy link
Contributor Author

Actually, the memory leak is still reproducible using the code provided above! @murgatroid99, could you please double check if I'm correct? Maybe I missed something, but it seems to me that it's still there.

@murgatroid99
Copy link
Member

Using your original leak reproduction code, I definitely don't see the socket leak. After running the code for several minutes, I still see exactly two sockets open in the lsof output and I see exactly two Node / Http2Session objects in the heap dump, so I am fairly confident that the socket leak is fixed.

I do still see a steady memory leak, but none of the objects grpc creates seem to be increasing. The objects in the heap dumps that are increasing in quantity over time all seem to be V8 internal data structures related to compiling code. It may be the case that loading a .proto file is inherently slightly leaky, because it is not just generating data, it actually generates classes and functions, which V8 needs to compile internally.

@bcoe
Copy link
Contributor

bcoe commented Nov 19, 2019

@merlinnot I've released a testing version of @google-cloud/pubsub as @google-cloud/pubsub@next (1.1.6-beta.0). If you have the cycles, would you mind testing this out and letting us know if you see improvements to the memory issues you were seeing?

@merlinnot
Copy link
Contributor Author

Hey, sorry for the late reply, I was waiting for another API to land (closing clients in Firestore and PubSub), but then I got caught into discussions about that one and forgot about your question.

The current status is that we've discovered yet another reason for leaks: Firestore uses multiple gRPC clients and these are not being closed, more here: googleapis/nodejs-firestore#769, googleapis/gapic-generator-typescript#126. So now I'm waiting for libraries to pick it up, so we have another source of leaks fixed, and then... I'll continue debugging.

@murgatroid99
Copy link
Member

With the fixes linked from this issue, unused open clients/channels shouldn't be a significant source of memory or socket leaks. Channels constructed with the same arguments should use the same underlying subchannels, subchannels that disconnect should be dropped from all channels and eventually released entirely, and unused channels should not create new subchannels.

@merlinnot
Copy link
Contributor Author

I personally don't have a full vertical overview of all of the components and their internals, so I'm trying to help the best way I can: by providing reproductions, contributing fixes to issues I'm able to identify myself and opening issues for discussion, which most usually have a good reasoning behind them, like the one linked above.

What I understood from the internals of Firestore is that there is a pool of clients, but clients are sometimes dereferenced without properly closing them, see https://github.com/googleapis/nodejs-firestore/blob/a4efa097ef8de9ca4944356ab8767ddaa94c4188/dev/src/pool.ts.

Please see this thread (which was not opened by me), where a lot of different people experience massive memory leaks when using newer versions of Firestore. My experience is the same. I'm using VMs with 28.8 GBs of RAM for processes that I'd normally expect to run on ~2-3 GBs. This generates much higher costs. Others have the same problem: googleapis/nodejs-firestore#768 (comment).

I tested the same process after mocking the library (so it didn't even load its files) and the leak was gone, so I'm fairly certain it's somewhere there.

At this point I can't tell if that's an issue with gRPC itself, Firestore or any other of its dependencies. I'm just trying to hammer out all of the leaks I see, step by step. I do understand that you have the knowledge to assess how much of an impact might a particular leak have, but I don't. So as soon as I find one, I report, get it fixed and look for a next one.

@bcoe
Copy link
Contributor

bcoe commented Dec 5, 2019

@merlinnot your hard work helping make this important library to the community better isn't unnoticed.

Thank you!

@merlinnot
Copy link
Contributor Author

It looks much better after the fixes, I can't reproduce any leaks now. I hope that after implementation of the terminate method in @google-cloud/pubsub I'll be leak-free 🎉 If not, I'll work on a repro and open another issue. Thanks guys!

@bcoe
Copy link
Contributor

bcoe commented Jan 8, 2020

@merlinnot is there a tracking issue on @google-cloud/pubsub for this? Thanks for all your help.

@merlinnot
Copy link
Contributor Author

Yes, there is: googleapis/nodejs-pubsub#817

@lock lock bot locked as resolved and limited conversation to collaborators Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants