-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency #469
Comments
🎉 This issue has been resolved in version 1.6.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
@MattFenelon moving our convo from #470 into here. In short I'm still having a ton of trouble with web server lockups after this change and it's hard to figure out what exactly is going on… but it seems like without this change everything is a-ok. My DB pool values and the sideload thread max values seem to be reasonable, and yet sometimes things will just hang causing the webserver to not respond to requests. I think we need to roll this thing back and investigate further on an unreleased branch. The part that worries me most is that there's no error message or any sort of clue that this is the issue, meaning there could be people out there with apps breaking without a clear cut cause. Also for me in order to get things running properly again it has required a server restart, which is a real bummer. I'm going to put out a new patch release without these changes, and then maybe we can work together to try and resolve what's going on by working off a branch? |
…n what's causing thread pool hangs. refs #469 revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (#471)" revert: "add thread pool and concurrency_max_threads configuration option (#470)" This reverts commit 51fb51c. This reverts commit 697d761.
Sorry to hear that @jkeen. I see you've reverted, that makes sense until we can clear this up. In terms of diagnosing:
|
…handle on what's causing thread pool hangs. refs graphiti-api#469" This reverts commit 7941b6f.
This option allows to limit the maximum number of resources that can be sideloaded concurrently. With a properly configured connection pool, this ensures that the activerecord's connection pool is not exhausted by the sideloading process. The thread pool configuration is based on ActiveRecord's global_thread_pool_async_query_executor. This was previously attempted but there were reports of deadlocks. This code differs from the original by using Concurrency::Delay assigned to a constant instead of a regular Mutex. The Delay+constant is how concurrent-ruby sets up their global thread pools so it's more likely to be correct. Closes graphiti-api#469 See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287 See: graphiti-api#470
The fact that it's locking up completely makes me think it's a deadlock related to the use of the Mutex. I've gone with a different approach in #472 but I haven't been able to test those changes yet. Do you want to give it a try? I can test them in about a week or so. |
…n what's causing thread pool hangs. refs graphiti-api#469 revert: "thread pool scope and mutex need to be global across all instances of Scope for it to be a global thread pool (graphiti-api#471)" revert: "add thread pool and concurrency_max_threads configuration option (graphiti-api#470)" This reverts commit 51fb51c. This reverts commit 697d761.
## [1.6.3](graphiti-api/graphiti@v1.6.2...v1.6.3) (2024-03-26) ### Bug Fixes * Remove thread pool executor logic until we get a better handle on what's causing thread pool hangs. refs [graphiti-api#469](graphiti-api#469) ([7941b6f](graphiti-api@7941b6f)), closes [graphiti-api#471](graphiti-api#471) [graphiti-api#470](graphiti-api#470)
For the staging site that kept running into issues yesterday: DB_POOL = 30 It's running on a VPS and when inspecting the process list it didn't seem like any processes had CPU spiked, or anything else that might usually jump out at me as a problem. I wasn't sure what I should be looking for in spotting any hung threads from the command line, even. Any pointers there? I can give #472 a try this week sometime. Appreciate the follow up |
I think I've found the issue. It only seems to happen when side-loading children of child resources. Does that tally up with the side loading you're doing, e.g. For the sake of an example, let's imagine a thread pool of 1 and an unbounded queue:
Though it's technically a deadlock the graphiti code uses sleep to wait for the promises so it manifests itself as a hang. In my PR I replaced the sleep with I think it can be fixed by making the child resource sideload logic non-blocking. That would remove the need for the child resources to wait, which would then free up the threads to be used for other promises in the queue. But it probably needs some kind of queue to allow the parent resource to know when all child resources are loaded.
A few ifs there but I'm going to try this approach and see how I get on. I've added a successfully failing test case that shows the deadlock: e4de93e |
And just for the sake of clarity—how is that scenario you outlined working now in the released version of graphiti? I guess what I'm trying to understand is if the strategy you're describing works, does it yield in performance gains, or does it yield a more safe approach to what is currently happening by default? |
Sure I can do that. There's no (real) limit to the threads that the released version of graphiti can use. Instead the problem is that the unbounded threads eventually can exhaust the available database connection pool as ActiveRecord has a 1 thread = 1 connection policy. For this failure example, we have a puma thread pool of 1 and a database connection pool of 1:
In the real world this is less likely to happen because there have to be enough web threads running at the same time to exhaust the database pool. For example, consider we have 5 puma threads and a database connection pool of 5:
And so on. If someone has limited traffic, limited sideloads, or a large database connection pool, they may never see this issue. A naive way to solve it is to just increase the database connection pool but the AR error will raise its head eventually with enough traffic and sideloads. |
@MattFenelon I was adjusting values for db pool, max threads, and web_concurrency today when I ran square into this exact problem! I never ran into it before because of how my database.yml is setup where |
@jkeen Good to know! I agree it's very much based on each app's configuration and resource lay out. With low sideloads and/or a large database connection pool, it's unlikely to ever be a problem. That said, the default for new Rails' apps is to use RAILS_MAX_THREADS for the database connection pool. An app with that configuration will run into the problem sooner or later. |
I recently enabled concurrency on a system and it resulted in a lot of
ActiveRecord::ConnectionTimeoutError
errors.I believe it's because ActiveRecord has a 1 thread = 1 connection policy and the database connection pool for this system is limited to the number of puma threads. As soon as the system receives enough requests to use up the connection pool, the concurrently side-loaded resource threads are unable to get a connection from the pool, and vice versa.
But I'm surprised no one has come across this issue before so perhaps I've misunderstood the problem.
One solution might be to increase the system's database connection pool size. This would allow Graphiti's threads to have enough connections to not exhaust the thread pool. The difficulty would be in sizing the db pool. Given each request can sideload
x
resources, the total number of database connections required at any given time would bex * number of concurrent requests
.Ideally though I think it would make sense to allow the thread pool size to be configured.
Concurrent::Promise
, used to side-loaded resources, has by default a thread pool equal to java's int max.What do you think?
The text was updated successfully, but these errors were encountered: