-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add thread pool with promises to limit concurrent sideloading #472
base: master
Are you sure you want to change the base?
Add thread pool with promises to limit concurrent sideloading #472
Conversation
…handle on what's causing thread pool hangs. refs graphiti-api#469" This reverts commit 7941b6f.
This option allows to limit the maximum number of resources that can be sideloaded concurrently. With a properly configured connection pool, this ensures that the activerecord's connection pool is not exhausted by the sideloading process. The thread pool configuration is based on ActiveRecord's global_thread_pool_async_query_executor. This was previously attempted but there were reports of deadlocks. This code differs from the original by using Concurrency::Delay assigned to a constant instead of a regular Mutex. The Delay+constant is how concurrent-ruby sets up their global thread pools so it's more likely to be correct. Closes graphiti-api#469 See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287 See: graphiti-api#470
…s-configuration-option
Still getting a lockup. This sounded possibly related: ruby-concurrency/concurrent-ruby#933 And it looks like graphiti only requires But one thing i realized: one request that's causing a hang has 13 sideloads… which is a lot, admittedly. But still, if there are more sideloads than there are threads, I think that might be a trouble spot? |
Thanks for testing that, @jkeen. Good find on that bug. Upgrading the gem is definitely something to consider but as you say, it doesn't seem to have solved the issue. I'm surprised we're having issues as the code is very similar to that used by ActiveRecord for the new async queries. So perhaps there's something in the way that Graphiti is calling the thread pool that is causing problems. I suspect that's more the case then there being a bug in concurrent-ruby's thread pool. 13 sideloads seems a lot but the default thread pool has 4 threads and a queue of 4 * 4 that (25) and if the queue is expired it's meant to use the calling thread to run the request. I'm wondering if that the :caller_runs option might be where the problem lies. Are all of the sideloads running ActiveRecord queries? I think I might need to add some logging to debug the issue. |
All the sideloads run active record queries, and some are pulling nested relationships off the sideloads, as in I think figuring out a good logging solution would be great improvement—I dug around the ruby-concurrency docs briefly and couldn't make heads or tails of how one might log when a new thread is getting created or hung up. But if we had some good logs this issue might be much easier to diagnose. As far as the other options for fallback, I haven't tried :abort or :drop. I would assume abort would make the request throw a 500, and drop would… return incomplete data? |
…s-configuration-option
This include some logging for diagnostics to show what's happening when the deadlock occurs. ``` Graphiti::Scope #resolve_sideloads when the requested sideload exists on the resource with concurrency with nested sideloads greater than Graphiti.config.concurrency_max_threads thread 6220: employees queuing positions thread 6220: employees waiting on [:positions] thread 6240: running positions thread 6240: positions queuing department thread 6240: positions waiting on [:department] does not deadlock (FAILED - 1) Failures: 1) Graphiti::Scope#resolve_sideloads when the requested sideload exists on the resource with concurrency with nested sideloads greater than Graphiti.config.concurrency_max_threads does not deadlock Failure/Error: expect { instance.resolve_sideloads(results) }.not_to raise_error expected no Exception, got #<fatal:"No live threads left. Deadlock?\n2 threads, 2 sleeps current:0x00007f7e6f7b1780 main thread:...or.rb:339 sleep_forever>\n rb_thread_t:0x00007f7e6f7b1780 native:0x000070000cfb4000 int:0\n \n"> with backtrace: # ./lib/graphiti/scope.rb:78:in `resolve_sideloads' # ./spec/scope_spec.rb:145:in `block (7 levels) in <top (required)>' # ./spec/scope_spec.rb:145:in `block (6 levels) in <top (required)>' # ./spec/scope_spec.rb:145:in `block (6 levels) in <top (required)>' ```
I've added a failing test case that shows the deadlock: |
Excellent!! Nice work on that 🥇 |
…s-configuration-option # Conflicts: # lib/graphiti/scope.rb
Not sure if I want to keep this but adding for now.
Rails assigns a connection to a thread so the close call has to happen on that same thread for the thread's connection to be closed.
@resource.resolve will return either a promise or a real value
I misunderstood what @resource.resolve would call, assuming that the `scope.to_a` referred to the Scope class. It actually refers to the base_scope which for AR will be an AR query.
Pass the arguments to the promise so it runs with the correct values. I've also disabled the adapter close for now as I'm not sure how to close it without breaking other promises that might run on the same thread in the pool.
@jkeen do you know how to get the rails tests to run locally? I'm struggling to debug the failures there. I've tried running:
Which fails when trying to install the sqlite3 gem
The installed sqlite3 is the default that comes with macOS
I've also tried using a later version. |
@MattFenelon Sqlite is notorious for throwing build errors and I feel like I've spent countless hours of my life troubleshooting system-specific compilation errors with sqlite with ruby and node projects, and with the time in that space I should be an expert, and yet every time it comes along I'm no better equipped than I was the last time. Did you try installing the gem separately first? This has also helped me in the past: https://github.com/sparklemotion/sqlite3-ruby/blob/main/INSTALLATION.md |
Sounds about right 😅 I've managed to work around it by upgrading to version 2 of the sqlite3 gem as that uses native binaries. I also had to upgrade rails to the main branch version (8) but that's got me to a point where I can debug the rails tests. |
To debug the rails tests. I'll remove this once the tests are passing.
I'll remove this once the PR is ready
This ensures that any thread locals that were available to the Graphiti resolving thread are also accessible to the sideload threads.
…s-configuration-option
This ensures that any fiber locals that were available to the Graphiti resolving thread are also accessible to the sideload threads.
Otherwise the thread/fiber locals are reset back to the same values.
Avoid memory leaks where the thread/fiber pool store's every local seen.
For ruby 3.3 and head on GitHub actions, the following error happens when accessing Fiber.current.storage.keys: ``` Failure/Error: fiber_storage = Fiber.current.storage.keys.each_with_object({}) do |key, memo| memo[key] = Fiber[key] end NoMethodError: undefined method 'keys' for nil ``` I don't think this should be happening as Fiber.current.storage is meant to always return a hash, even with Fiber.new(storage: nil) or Fiber.current.storage = nil.
This proves that #flat is required to flatten the promise returned by sideload.future_resolve that is itself called from within a promise.
Collecting stats on the thread pool can be useful to understand the state of the threadpool. For example, are a lot of tasks having to be queued.
…s-configuration-option
Multiple sideloads can error concurrently. In that case, throw the first error and ignore the rest. This will allow normal processing to deal with an expected error, rather than the Concurrent::MultipleErrors class that concurrent-ruby raises.
I don't have a repo of the no exception but have seen this error in testing: ``` exception class/object expected .rescue { |err| raise err } ^^^ ```
With synchronous loading, the first error is raised and loading is aborted. With async, multiple errors can happen together and we need to pick one. Pick the first to match how synchronous errors happen.
@jkeen I'm very close with this now and hope to be able to open the PR for review next week. I have had the patch running in production in two services for more than a week with no problems. The last thing I would like to check is that the ActiveSupport::Notification I added for the thread pool works as expected. I'm hoping to be able to test that this week. |
@MattFenelon Great news! I've been following along and it looks like things have progressed a bunch. Looking forward to testing this out whenever you're ready. Is the notifications/global_thread_pool_stats part of the puzzle for letting someone know if configurations need to be changed? The thread pool numbers calculation bit of everything I hope we can make as simple and foolproof as possible |
@jkeen This is ready now! I'm interested in hearing your thoughts.
Exactly! The thread pool is a point of contention. All sideloads across all requests go through the thread pool. For instance, with the default configuration of 4 threads, if one request is using up all 4 threads to sideload, the next request will have to wait until one of the threads is available to sideload its own resources. It's not exactly as clear cut as that because threads free up all of the time but it gives a simple overview. It's a trade-off but it ensures the safety of the system by ensuring that it will never use up more database connections than there are available. With the thread pool stats notification, the state of the thread pool can be observed at run time in order to tune to each system's exact needs. If That's for advanced use cases though. The default of 4 seems like a reasonable starting point for most applications, particularly as it's the default for Rails' load_async thread pool. |
@jkeen I was thinking more about the impact this change will have once merged. It might be surprising for users of the gem to find the use of threads changed. How about I configure the thread pool to work the same way as before, i.e. the threads would be unbounded? I could perhaps also add a deprecated flag to signify the problems with setting concurrency without concurrency_max_threads. What do you think? |
@MattFenelon If it's the right thing to do ultimately (and it seems like it is given the hidden problems you explained about the current method) I'm fine with it being a change, bumping the minor version. If you think the new way is going to break stuff unexpectedly then we could bump the major version even. The one thing I don't want to do is have some great improvement that requires someone to find a hidden option or read the documentation in order to find and benefit from; the best way should be the default—and it seems like what you've been working on is the best way. Would you agree? |
A minor bump feels ok to me. Internally it's a decent size change but it shouldn't break anything or require any changes on the part of consumers of the gem.
Yes I agree. This change makes the default configuration safer by fixing #469. It's also likely to have improve performance. With the current set up of an unbounded number of threads, there can be a high contention for resources, which particularly affects the latency of the long-tail of requests. In my testing on my local system, I found even a small increase of one thread to the thread pool (4 to 5) decreased the overall requests per second the system could handle. It sounds like we're in agreement so, if you're happy with the change, I think this PR is good to go. |
…s-configuration-option
…s-configuration-option
Previously, sideloading in Graphiti used an unbounded number of threads,
which could decrease performance due to resource contention and exhaust the
ActiveRecord connection pool, leading to timeouts and connection errors.
This commit introduces a thread pool based on ActiveRecord's
global_thread_pool_async_query_executor
, limiting the maximum number ofresources that can be sideloaded concurrently. With a properly configured
connection pool, this ensures that the ActiveRecord connection pool is not
exhausted by the sideloading process.
Promises are used to manage recursive links between threads, avoiding
deadlocks that were reported in earlier attempts.
Single-threaded mode is reintroduced with a new implementation that uses
promises in both async and sync modes. The difference lies in whether the
thread pool is set to synchronous or not.
Parent thread's local variables are copied to the sideloading threads,
ensuring that any thread locals available to the Graphiti resolving thread
are also accessible to the sideload threads. Similarly, parent fiber's local
variables are copied to the sideloading threads, preserving fiber-local
context during sideloading.
Thread pool statistics broadcasting is added for observability. Collecting
stats on the thread pool can be useful to understand the state of the thread
pool, for example, if many tasks are being queued.
Bump
concurrent-ruby
gem from version 1.0 to 1.2+, which includes a fixfor worker threads getting stuck when using the
caller_runs
policy(ruby-concurrency/concurrent-ruby#933).
TODO
ResourceProxy#data
to another method for concurrentresolve_sideloads
return a promise?Closes #469
See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287
See: #470