Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sub-channels): clear the cleanup interval when all channels are unrefed. #1083

Merged
merged 7 commits into from
Nov 6, 2019

Conversation

JrSchild
Copy link
Contributor

Fixes #1080

This forces a cleanup of subchannels whenever a channel is closed and disables the background task if there are no more refed subchannels. This ensure that if all clients using grpc are closed, there are no background tasks running.

Co-authored-by: Natan Sągol [email protected]

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 24, 2019

CLA Check
The committers are authorized under a signed CLA.

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@JrSchild
Copy link
Contributor Author

I signed it

@murgatroid99
Copy link
Member

Have you confirmed that this problem is not just a Jest bug? My own tests always terminate quickly without even explicitly closing channels.

@merlinnot
Copy link
Contributor

There might be a bit of an overstatement here: it probably won't fix #1083. However it will still fix one of the issues we're experiencing.

This setInterval won't prevent process from exiting, however it will keep running and consuming memory (preventing garbage collection of multiple, multiple objects, as it keeps a reference to a subchannel pool) if:

  • you don't use gRPC anymore, but don't exit the process.
  • you import the file multiple times, clearing require cache - which is a common practice in tests
  • you run it in more interesting scenarios, like using multiple Node.js VMs in one process.

We're trying to introduce some gradual changes to help resolve issues with the library we are facing at the moment. I hope that this change is small enough to be reviewed with confidence and given the examples above you'll see a point of incorporating it into the library.

We tested it on our internal projects too and it seems to work perfectly fine.

@murgatroid99
Copy link
Member

Clearing the require cache is something I hadn't considered. I'm OK with turning the timer on and off based on whether there is at least one subchannel in the pool, but I don't like forcing the pool to clear out every time a channel is closed. Part of the point here is to allow subchannel reuse to improve connection time, including when a channel is closed and then a new one is opened. If all of the channels are closed, then the subchannel pool should still get completely cleared out the next time the timer runs.

@merlinnot
Copy link
Contributor

Great. To align on this, so me and my colleagues can improve the PR tomorrow:

The expectation is that instead of trying to unref as much as possible and check if we unrefed everything, we should first check if we can unref everything and do so only if that's the case?

@murgatroid99
Copy link
Member

I'm not suggesting any logic change. I'm suggesting using the existing logic, but turning the timer on if there is a subchannel in the pool and off when the last subchannel is removed from the pool. A natural consequence of this is that if there are no open channels, every subchannel will be removed and then the timer will be disabled.

We can also talk about changing the REF_CHECK_INTERVAL number. The current choice was arbitrary. A lower number cleans everything up faster, but also uses more CPU.

@merlinnot
Copy link
Contributor

I'm not sure if that would solve the issue. Let me clarify my expectations first, so we're on the same page:

If all gRPC clients are closed (using the close method), all resources are freed immediately, synchronously.

Do you agree that it should be the case?

@murgatroid99
Copy link
Member

No, I do not agree with that. It is beneficial to have subchannels stick around for a brief time after the last channel closes in case a new channel is created.

@merlinnot
Copy link
Contributor

Could you share the use cases that you have in mind that would benefit from the current behavior?

My understanding it is that it would help if users are rapidly opening and deliberately closing connections, using the exact same options (addresses, credentials, ...). Are there some other use cases?

@murgatroid99
Copy link
Member

Yes, the use-case for this is that someone closes a client connected to a specific target, and then later (but not much later) creates a new identical client with the same target.

@merlinnot
Copy link
Contributor

Do you feel it's an expected way of using the library? Or is it an optimization towards users misusing it?

I'm starting to wonder if we should have this timer at all, maybe we should only clean it up when a client is explicitly closed?

@murgatroid99
Copy link
Member

The timer is still needed to handle clients that aren't closed explicitly and subchannels that become unused that were previously used by still-active channels.

@merlinnot
Copy link
Contributor

Thanks for the explanation.

Do you think that explicitly closing the client and recreating it repeatedly is an expected way of using this library?

@murgatroid99
Copy link
Member

No, closing and recreating identical clients is not an expected or intended use of this library. I'm retracting my previous comment here; I agree now that there is not significant inherent value in preserving subchannels past the channel lifetime.

Still, I believe that the timer is necessary for other usage patterns and that cleaning up the subchannel pool on channel close should be essentially unnecessary. So far, as far as I am aware, the only benefit to cleaning up immediately is to avoid this problem with Jest, and I do not want to start addressing issues in other libraries and tools in our code.

My preference then is to keep the functionality simple by only having the timer. I do still support turning the timer off while the pool has no subchannels. In addition, memory could be further saved by creating a new empty pool object every time the subchannel pool becomes empty.

@merlinnot
Copy link
Contributor

Maybe using Jest in the issue wasn’t the best idea, I didn’t intend to focus on it. It affects any testing setup if you clean require cache or you have memory leak detection implemented.

Conceptually it seems that “I have no resources allocated if I cleaned everything up” would be an expected behavior. Maybe we could find some middle ground here, so we can have this behavior, but it’s implemented in a way that is more to your liking? Do you feel that the current implementation is too complex? Maybe we should split it in two parts, so that it’s easier to review?

@murgatroid99
Copy link
Member

I simply don't think it's necessary to clean everything up as soon as channels close. I don't think there is a significant impact even if you clean the require cache a hundred times in one process, and I expect that this will be noise in most leak checks. I really don't think it matters that much if resources are cleaned up immediately after a channel closes, as opposed to several seconds later. They're already unrefed, so they won't hold the process open.

@merlinnot
Copy link
Contributor

I understand that it doesn’t matter for you. I think we have both pros and cons here.

Pros:

  • memory leak detection tools are happy
  • there’s lower memory consumption when you close the client, but don’t exit the process
  • there’s less CPU used when you close the client, but don’t exit the process
  • doesn’t leave any resources behind when require cache is cleaned

Cons:

  • the current implementation is changed, possibly causing an increase in complexity

Are there any things I should add to this list?

Maybe we could work together to come up with an implementation that is simpler and more clean? I tried to make the smallest amount of changes possible, take care of naming and nice structure of the code, but I have a feeling that it’s not up to your standards? Could you maybe provide me with some more detailed feedback if you feel like we can improve this PR in such a way that you’d be comfortable with merging it?

@murgatroid99
Copy link
Member

Here's how I see it:

Option 1 (the current behavior of this PR): Use the timer and a call at channel close to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:

  • Slightly less memory usage for a few seconds after each channel is closed.
  • Leak detection tools may be happier if the process ends immediately after closing a channel.
  • If channels are explicitly closed, no resources are held for a few seconds after the require cache is cleared.

Option 2 (my proposal): Use only the timer to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:

  • Slightly less CPU usage when closing a channel.
  • Simpler behavior.

For me the relative simplicity of the second option outweighs the brief and occasional value of the first.

@merlinnot
Copy link
Contributor

For us, as a user of this library, a solution proposed by you would force us to either:

  • wait for 10 seconds after each test
  • use fake timers in every single test

The first option would dramatically lower our productivity and increase costs of CIs. The second option would increase complexity of the test setup and might interfere with timers used by other libraries, which might result in unexpected behavior and inability to test real behavior of these libraries.

Maybe we could make eager cleanup optional if you’re concerned by additional CPU usage?

grpc.setOptions({ eagerSubchannelCleanup: true })

I’m trying to find a solution that would allow my company to keep using this library.

@murgatroid99
Copy link
Member

I don't understand these requirements you are describing. My own tests finish promptly without even explicitly closing clients, because these objects we're discussing are unrefed. And I don't see what a fake timer would do.

@merlinnot
Copy link
Contributor

I understand that your particular testing setup exits and I do acknowledge that unrefing allows the process to exit cleanly if there are no other tasks in the event loop.

However:

  1. If you'd be using memory leak detection tools which compare memory usage before and after each test, you'd have to either (based on the assumption that we'd choose to go with your option - disable timers on an interval, no eagerly): wait for 10 seconds or use fake timers to forcefully run the timer.
  2. If you'd be cleaning require cache, you'd have a similar situation, where multiple of these timers would pile up over 10 seconds (if a test takes 0,5s, you'd have a maximum of 20 of these timers running).

@murgatroid99
Copy link
Member

Are you saying that you are currently using one of these memory leak detection tools and the only excess memory usage it is detecting is those Subchannel instances and associated object that they own?

And I honestly don't see the problem with temporarily having 20+ timers pending. They might consume a little CPU when they run the last time but then that's it.

@merlinnot
Copy link
Contributor

No, I'm not saying it's the only leak it detects.

As my company is using this library, we'd like to contribute to it the best way we can. As an example, I recently provided a meaningful reproduction that allowed to fix the issue we and other users were facing: #1085 (that's what is being released today).

This PR provides a solution to one of the issues which is affecting us. We'd like to continue investing into this library by both providing reproductions of issues and creating pull requests to fix some of these issues if we are able to pinpoint a root cause.

@murgatroid99
Copy link
Member

I am just trying to understand what actual problem would be solved by the change you propose.

If the goal is to fix Jest reporting an error because it doesn't understand that the timer is unrefed as reported in #1080, that should be taken up with Jest; I don't want to publish code that primarily addresses bugs in other libraries.

If there is some other problem separate from #1080 that is preventing tests from finishing cleanly, tell me what that problem is. It may not even be related to the subchannel pool.

If this library was blocking the adoption of leak checking in tests because it was the only memory sticking around at the end of the test, I would consider that a strong reason to make this change or one like it. But if it's just one leak among many then don't think it's grpc's responsibility to make fixes for that purpose.

If the existence of many pending timers as a result of clearing the require cache is causing significant problems with running tests, tell me what those problems are and I'll be more inclined to make this change.

@merlinnot
Copy link
Contributor

Sure. Here's an illustration, simple JavaScript that you can actually execute with --expose-gc flag (node --expose-gc ./script.js):

const assert = require('assert');

let heapUsed = NaN;

const beforeEach = () => {
  global.gc();

  heapUsed = process.memoryUsage().heapUsed;
};

const afterEach = () => {
  global.gc();

  console.log(`Before: ${heapUsed}, after: ${process.memoryUsage().heapUsed}.`);
};

const definitelyNotLeaking = () => {
  const buffer = Array.from({ length: 1024 * 1024 });

  assert(buffer.length === 1048576, "Math doesn't work anymore.");
};

const definitelyLeaking = () => {
  const buffer = Array.from({ length: 1024 * 1024 });
  setTimeout(() => {
    console.log(
      `Hey look, test finished, but I'm still running! - ${buffer.size}`,
    );
  }, 500).unref();
};

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyNotLeaking();
afterEach();

beforeEach();
definitelyLeaking();
afterEach();

beforeEach();
definitelyLeaking();
afterEach();

You should see an output similar to:

Before: 3502640, after: 3448280.
Before: 3489640, after: 3481360.
Before: 3487184, after: 3484216.
Before: 3484328, after: 11887168.
Before: 11887280, after: 20277448.

As you can see, the memory usage is radically growing with setTimeout, even tho it's unrefed. That's because it holds references to some objects defined outside of it's callback, preventing these from being garbage collected. The same happens in our case, where this timer not only consumes memory by itself, but also keeps all of the referenced objects in memory.

From what I've seen most memory leak detection tools uses V8's C++ API directly, as it exposes more functionalities, but I hope this simple example will help you understand the issue.

@murgatroid99
Copy link
Member

OK, yes, I do understand that a timer that closes over things will hold those things in memory for as long as the timer is pending. And I know that unrefing the timer doesn't change that because that is addressing a different problem. And I do understand that you can write a function that checks how much memory is allocated and see that more memory is allocated.

What I am trying to understand here is the actual practical impact of these temporary memory usage spikes, in terms of test failures or integration problems or application issues or users' bug reports or whatever.

@merlinnot
Copy link
Contributor

If you have such a tool enabled, ever test which uses gRPC, even if cleaned up correctly (.close etc) will fail the memory check unless you (given we’d start disabling the timer in a way you proposed):

  • wait 10 seconds after each test
  • use fake timers to run this timer forcefully

What’s more, if you clean require cache, memory allocated by gRPC will start to pile up (as I described above, if the test execution takes half a second, you have up to 20 pending timers). In our scenario tests are running on a 32-core VM in full parallelization and isolation. That means that we’d have 32 * 20 = 640 gRPCs allocated at the same time, which is a significant additional memory pressure.

I didn’t see any other library that would behave in such a way, that after deliberately closing it, resources would remain allocated (even for a few seconds). Do you know some other libraries that do so?

@murgatroid99
Copy link
Member

I understand that if you were to use such a tool and set it up that way, its report would include those objects. I just want to know what the actual impact of this is. Is this preventing you from using such a memory leak detection tool effectively?

As for the memory usage, subchannels use something on the order of kilobytes or tens of kilobytes of memory. So in the situation you describe, you will see a peak of something on the order of several megabytes of total excess memory usage. That shouldn't be a problem on just about any modern system.

@merlinnot
Copy link
Contributor

merlinnot commented Nov 1, 2019

Yes, it does prevent usage of such tools. It's very hard to determine where a leak comes from, as it's usually a chain of references from one thing to another. You probably know how hard it is sometimes to get some understanding of different memory leaks if you use a debugger, it's also hard to determine for these tools where does a memory leak come from, so you just get a "yes/no" response. At least I personally didn't ever come across a tool that somehow allows you to exclude some paths.

Situation with native grpc was a little different, as it's a native add-on and memory leak detection tools don't keep track of these, only actual JS heap.

So at the moment I can't enable memory leak detection for any tests, as we're using Google SDKs extensively, especially Firestore and PubSub (with emulators). Having an actually clean state (synchronously) would allow me and other developers to use these tools. It's really useful for us to detect memory leaks in the CI :) I know at my company we're a little paranoid about tests, but given the speed we're developing at (sometimes dozens of features and deployment daily) and having a "master = production" mentality, we really need to take a good care of our tests.

We're trying our best to contribute to open source, especially the tools we use daily. You can take a look at my public history (e.g. 25 merged and 1 open PR to firebase-functions) and my company's public profile with PubSub, Firestore and Flatbuffers containers, GitHub Actions etc.

We created this PR as it's an important functionality for us and we prefer to contribute to the community instead of just using forks, as we believe it's a more sustainable way for the community and our company. We also believe that it would ease adoption of memory leak detection tools for others, which might be a good idea for many.

If this implementation is not something that is acceptable for you, maybe you could propose a different solution (e.g. what I mentioned above, to add it as an option)? We'd be happy to discuss it and adopt our approach.

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for all of the back-and-forth. I feel sufficiently convinced now that cleaning the pool up when channels close is the best way to improve this behavior for some use-cases.

@@ -288,6 +288,10 @@ export class ChannelImplementation implements Channel {
close() {
this.resolvingLoadBalancer.destroy();
this.updateState(ConnectivityState.SHUTDOWN);

if (this.subchannelPool !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type-wise, this.subchannelPool can't be undefined, so this check is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 93f8169.

const allSubchannelsUnrefed = this.unrefUnusedSubchannels();

if (allSubchannelsUnrefed && this.cleanupTimer !== undefined) {
clearInterval(this.cleanupTimer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should be that the timer is started if there is at least one subchannel in the pool, and stopped if there are no longer any subchannels in the pool. The timer should not be disabled just because a single channel is closed. So, this check should be in unrefUnusedSubchannels.That allows a simplification where unrefUnusedSubchannels returns nothing, and is public, and is the function that Channel#close calls directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one.
Fixed in 5f271de

/**
* A timer of a task performing a periodic subchannel cleanup.
*/
private cleanupTimer: NodeJS.Timer | undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer not allowing undefined here, and instead having a separate boolean indicating whether the timer is running. I find it cleaner type-wise to allow as few types as possible. The downside is that the constructor needs to do cleanupTimer=... and clearInterval(cleanupTimer) and starting the timer is slightly more code, but the upside is that you never need to check the value of cleanupTimer itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you prefer it to be more explicit, rather than having undefined to indicate that it's not running? I'd personally prefer to also destroy the reference to the timer if we're not running it anymore, as it ideally should not be used anymore. I thought we might be more explicit, but destroy the reference too:

type CleanupTask = { running: true; timer: Nodejs.Timer } | { running: false };

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I like keeping the timer reference is it allows the code for disabling the timer to work even if the timer is already disabled, which simplifies code:

clearInterval(timer);
running = false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @murgatroid99,

I couldn't find any documentation, explaining if the callback of setInterval is garbage collected when clearInterval is called on its timer. This makes me uncertain if we should keep the timer on the instance or if we should remove it. I consulted with my colleagues and no one was certain about it. So I think it might cause more confusion than clarity.

What do you think, maybe we could leave it as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. One alternative would be to do the same thing I suggested in the constructor and replace it with a do-nothing interval and then immediately cancel it, but that kind of removes the benefit I mentioned so I'm OK with keeping it as it is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, actually, this is a little minor, but can we have the alternate type be null instead of undefined. I dislike using undefined as an explicit type, and null would be more consistent with other code in this project. It shouldn't change much; this line can just be private cleanupTimer: NodeJS.Timer | null = null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to null and merged master, as far as I remember you made a fix to macOS builds.

* Ensure that the cleanup task is spawned.
*/
ensureCleanupTask(): void {
if (this.global === true && this.cleanupTimer === undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=== true shouldn't be needed here. global is always a boolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 93f8169.

@merlinnot
Copy link
Contributor

Thanks for the feedback! We'll rebase on master and address your comments on Monday :)

@murgatroid99
Copy link
Member

I don't think you need to rebase. The relevant parts of these files haven't changed since you opened this PR.

@JrSchild
Copy link
Contributor Author

JrSchild commented Nov 6, 2019

@murgatroid99 Thank you for your feedback, this should be ready for another review.

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@merlinnot
Copy link
Contributor

@thelinuxfoundation I signed it.

@merlinnot
Copy link
Contributor

@murgatroid99 Could you re-run Kokoro?

@murgatroid99 murgatroid99 merged commit a7567f0 into grpc:master Nov 6, 2019
@merlinnot
Copy link
Contributor

Yay, thanks! 😃

@murgatroid99
Copy link
Member

I have published this change in grpc-js 0.6.11.

@JrSchild
Copy link
Contributor Author

JrSchild commented Nov 7, 2019

Thanks!

@lock lock bot locked as resolved and limited conversation to collaborators Feb 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connections remain open after shut down
5 participants