Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread pool with promises to limit concurrent sideloading #472

Open
wants to merge 55 commits into
base: master
Choose a base branch
from

Conversation

MattFenelon
Copy link
Contributor

@MattFenelon MattFenelon commented Mar 27, 2024

Previously, sideloading in Graphiti used an unbounded number of threads,
which could decrease performance due to resource contention and exhaust the
ActiveRecord connection pool, leading to timeouts and connection errors.

This commit introduces a thread pool based on ActiveRecord's
global_thread_pool_async_query_executor, limiting the maximum number of
resources that can be sideloaded concurrently. With a properly configured
connection pool, this ensures that the ActiveRecord connection pool is not
exhausted by the sideloading process.

Promises are used to manage recursive links between threads, avoiding
deadlocks that were reported in earlier attempts.

Single-threaded mode is reintroduced with a new implementation that uses
promises in both async and sync modes. The difference lies in whether the
thread pool is set to synchronous or not.

Parent thread's local variables are copied to the sideloading threads,
ensuring that any thread locals available to the Graphiti resolving thread
are also accessible to the sideload threads. Similarly, parent fiber's local
variables are copied to the sideloading threads, preserving fiber-local
context during sideloading.

Thread pool statistics broadcasting is added for observability. Collecting
stats on the thread pool can be useful to understand the state of the thread
pool, for example, if many tasks are being queued.

Bump concurrent-ruby gem from version 1.0 to 1.2+, which includes a fix
for worker threads getting stuck when using the caller_runs policy
(ruby-concurrency/concurrent-ruby#933).

TODO

  • Fix deadlock
  • Fix rails integration
  • Refactor ResourceProxy#data to another method for concurrent
  • Does save break, i.e. does resolve_sideloads return a promise?
  • Close adapter
  • Thread locals
  • Thread pool stats broadcast (to be tested)

Closes #469

See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287
See: #470

…handle on what's causing thread pool hangs. refs graphiti-api#469"

This reverts commit 7941b6f.
This option allows to limit the maximum number of resources that can be
sideloaded concurrently. With a properly configured connection pool,
this ensures that the activerecord's connection pool is not exhausted by
the sideloading process.

The thread pool configuration is based on ActiveRecord's
global_thread_pool_async_query_executor.

This was previously attempted but there were reports of deadlocks. This
code differs from the original by using Concurrency::Delay assigned to a
constant instead of a regular Mutex. The Delay+constant is how
concurrent-ruby sets up their global thread pools so it's more likely to
be correct.

Closes graphiti-api#469

See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287
See: graphiti-api#470
@jkeen
Copy link
Collaborator

jkeen commented Mar 28, 2024

Still getting a lockup. This sounded possibly related: ruby-concurrency/concurrent-ruby#933

And it looks like graphiti only requires concurrency-ruby ~> 1.0, when really it should be > 1.2 to include that fix? However, I tried including that and trying it out and it didn't seem to help.

But one thing i realized: one request that's causing a hang has 13 sideloads… which is a lot, admittedly. But still, if there are more sideloads than there are threads, I think that might be a trouble spot?

@MattFenelon
Copy link
Contributor Author

Still getting a lockup. This sounded possibly related: ruby-concurrency/concurrent-ruby#933

And it looks like graphiti only requires concurrency-ruby ~> 1.0, when really it should be > 1.2 to include that fix? However, I tried including that and trying it out and it didn't seem to help.

But one thing i realized: one request that's causing a hang has 13 sideloads… which is a lot, admittedly. But still, if there are more sideloads than there are threads, I think that might be a trouble spot?

Thanks for testing that, @jkeen. Good find on that bug. Upgrading the gem is definitely something to consider but as you say, it doesn't seem to have solved the issue.

I'm surprised we're having issues as the code is very similar to that used by ActiveRecord for the new async queries. So perhaps there's something in the way that Graphiti is calling the thread pool that is causing problems. I suspect that's more the case then there being a bug in concurrent-ruby's thread pool.

13 sideloads seems a lot but the default thread pool has 4 threads and a queue of 4 * 4 that (25) and if the queue is expired it's meant to use the calling thread to run the request. I'm wondering if that the :caller_runs option might be where the problem lies.

Are all of the sideloads running ActiveRecord queries?

I think I might need to add some logging to debug the issue.

@jkeen
Copy link
Collaborator

jkeen commented Mar 28, 2024

All the sideloads run active record queries, and some are pulling nested relationships off the sideloads, as in includes=items,items.comments, etc.

I think figuring out a good logging solution would be great improvement—I dug around the ruby-concurrency docs briefly and couldn't make heads or tails of how one might log when a new thread is getting created or hung up. But if we had some good logs this issue might be much easier to diagnose.

As far as the other options for fallback, I haven't tried :abort or :drop. I would assume abort would make the request throw a 500, and drop would… return incomplete data?

This include some logging for diagnostics to show what's happening when
the deadlock occurs.

```
Graphiti::Scope
  #resolve_sideloads
    when the requested sideload exists on the resource
      with concurrency
        with nested sideloads greater than Graphiti.config.concurrency_max_threads
thread 6220: employees queuing positions
thread 6220: employees waiting on [:positions]
thread 6240: running positions
thread 6240: positions queuing department
thread 6240: positions waiting on [:department]
          does not deadlock (FAILED - 1)

Failures:

  1) Graphiti::Scope#resolve_sideloads when the requested sideload exists on the resource with concurrency with nested sideloads greater than Graphiti.config.concurrency_max_threads does not deadlock
     Failure/Error: expect { instance.resolve_sideloads(results) }.not_to raise_error

       expected no Exception, got #<fatal:"No live threads left. Deadlock?\n2 threads, 2 sleeps current:0x00007f7e6f7b1780 main thread:...or.rb:339 sleep_forever>\n   rb_thread_t:0x00007f7e6f7b1780 native:0x000070000cfb4000 int:0\n   \n"> with backtrace:
         # ./lib/graphiti/scope.rb:78:in `resolve_sideloads'
         # ./spec/scope_spec.rb:145:in `block (7 levels) in <top (required)>'
         # ./spec/scope_spec.rb:145:in `block (6 levels) in <top (required)>'
     # ./spec/scope_spec.rb:145:in `block (6 levels) in <top (required)>'
```
@MattFenelon
Copy link
Contributor Author

I've added a failing test case that shows the deadlock:

e4de93e

@jkeen
Copy link
Collaborator

jkeen commented Apr 12, 2024

Excellent!! Nice work on that 🥇

…s-configuration-option

# Conflicts:
#	lib/graphiti/scope.rb
Not sure if I want to keep this but adding for now.
Rails assigns a connection to a thread so the close call has to happen
on that same thread for the thread's connection to be closed.
@resource.resolve will return either a promise or a real value
I misunderstood what @resource.resolve would call, assuming that the
`scope.to_a` referred to the Scope class. It actually refers to the
base_scope which for AR will be an AR query.
Pass the arguments to the promise so it runs with the correct values.

I've also disabled the adapter close for now as I'm not sure how to
close it without breaking other promises that might run on the same
thread in the pool.
@MattFenelon
Copy link
Contributor Author

@jkeen do you know how to get the rails tests to run locally? I'm struggling to debug the failures there.

I've tried running:

$ bundle exec appraisal install

Which fails when trying to install the sqlite3 gem

An error occurred while installing sqlite3 (1.4.4), and Bundler cannot continue.

The installed sqlite3 is the default that comes with macOS

$ sqlite3 --version
3.32.2 2021-07-12 15:00:17 bcd014c473794b09f61fbc0f4d9488365b023f16123b278dbbd49948c27calt2

I've also tried using a later version.

@jkeen
Copy link
Collaborator

jkeen commented May 21, 2024

@MattFenelon Sqlite is notorious for throwing build errors and I feel like I've spent countless hours of my life troubleshooting system-specific compilation errors with sqlite with ruby and node projects, and with the time in that space I should be an expert, and yet every time it comes along I'm no better equipped than I was the last time.

Did you try installing the gem separately first? This has also helped me in the past: https://github.com/sparklemotion/sqlite3-ruby/blob/main/INSTALLATION.md

@MattFenelon
Copy link
Contributor Author

@MattFenelon Sqlite is notorious for throwing build errors and I feel like I've spent countless hours of my life troubleshooting system-specific compilation errors with sqlite with ruby and node projects, and with the time in that space I should be an expert, and yet every time it comes along I'm no better equipped than I was the last time.

Did you try installing the gem separately first? This has also helped me in the past: https://github.com/sparklemotion/sqlite3-ruby/blob/main/INSTALLATION.md

Sounds about right 😅

I've managed to work around it by upgrading to version 2 of the sqlite3 gem as that uses native binaries. I also had to upgrade rails to the main branch version (8) but that's got me to a point where I can debug the rails tests.

To debug the rails tests. I'll remove this once the tests are passing.
I'll remove this once the PR is ready
MattFenelon and others added 18 commits September 11, 2024 17:54
This ensures that any thread locals that were available to the Graphiti
resolving thread are also accessible to the sideload threads.
This ensures that any fiber locals that were available to the Graphiti
resolving thread are also accessible to the sideload threads.
Otherwise the thread/fiber locals are reset back to the same values.
Avoid memory leaks where the thread/fiber pool store's every local seen.
For ruby 3.3 and head on GitHub actions, the following error happens
when accessing Fiber.current.storage.keys:

```
Failure/Error:
  fiber_storage = Fiber.current.storage.keys.each_with_object({}) do |key, memo|
    memo[key] = Fiber[key]
  end

NoMethodError:
   undefined method 'keys' for nil
```

I don't think this should be happening as Fiber.current.storage is meant
to always return a hash, even with Fiber.new(storage: nil) or
Fiber.current.storage = nil.
This proves that #flat is required to flatten the promise returned by
sideload.future_resolve that is itself called from within a promise.
Collecting stats on the thread pool can be useful to understand the
state of the threadpool. For example, are a lot of tasks having to be
queued.
Multiple sideloads can error concurrently. In that case, throw the first
error and ignore the rest. This will allow normal processing to deal
with an expected error, rather than the Concurrent::MultipleErrors class
that concurrent-ruby raises.
I don't have a repo of the no exception but have seen this error in
testing:

```
exception class/object expected .rescue { |err| raise err } ^^^
```
With synchronous loading, the first error is raised and loading is
aborted.

With async, multiple errors can happen together and we need to pick one.

Pick the first to match how synchronous errors happen.
@MattFenelon
Copy link
Contributor Author

@jkeen I'm very close with this now and hope to be able to open the PR for review next week. I have had the patch running in production in two services for more than a week with no problems. The last thing I would like to check is that the ActiveSupport::Notification I added for the thread pool works as expected. I'm hoping to be able to test that this week.

@jkeen
Copy link
Collaborator

jkeen commented Sep 26, 2024

@MattFenelon Great news! I've been following along and it looks like things have progressed a bunch. Looking forward to testing this out whenever you're ready. Is the notifications/global_thread_pool_stats part of the puzzle for letting someone know if configurations need to be changed?

The thread pool numbers calculation bit of everything I hope we can make as simple and foolproof as possible

@MattFenelon MattFenelon marked this pull request as ready for review September 27, 2024 15:14
@MattFenelon
Copy link
Contributor Author

@jkeen This is ready now! I'm interested in hearing your thoughts.

Is the notifications/global_thread_pool_stats part of the puzzle for letting someone know if configurations need to be changed?

Exactly! The thread pool is a point of contention. All sideloads across all requests go through the thread pool. For instance, with the default configuration of 4 threads, if one request is using up all 4 threads to sideload, the next request will have to wait until one of the threads is available to sideload its own resources. It's not exactly as clear cut as that because threads free up all of the time but it gives a simple overview.

It's a trade-off but it ensures the safety of the system by ensuring that it will never use up more database connections than there are available.

With the thread pool stats notification, the state of the thread pool can be observed at run time in order to tune to each system's exact needs. If graphiti.thread_pool_queue_length is showing that tasks are sitting in the queue often, that's perhaps a sign that an extra thread or two would be beneficial.

That's for advanced use cases though. The default of 4 seems like a reasonable starting point for most applications, particularly as it's the default for Rails' load_async thread pool.

@MattFenelon MattFenelon changed the title Add thread pool and concurrency max threads configuration option Add thread pool with promises to limit concurrent sideloading Sep 27, 2024
@MattFenelon
Copy link
Contributor Author

@jkeen I was thinking more about the impact this change will have once merged. It might be surprising for users of the gem to find the use of threads changed. How about I configure the thread pool to work the same way as before, i.e. the threads would be unbounded? I could perhaps also add a deprecated flag to signify the problems with setting concurrency without concurrency_max_threads. What do you think?

@jkeen
Copy link
Collaborator

jkeen commented Oct 15, 2024

@MattFenelon If it's the right thing to do ultimately (and it seems like it is given the hidden problems you explained about the current method) I'm fine with it being a change, bumping the minor version. If you think the new way is going to break stuff unexpectedly then we could bump the major version even.

The one thing I don't want to do is have some great improvement that requires someone to find a hidden option or read the documentation in order to find and benefit from; the best way should be the default—and it seems like what you've been working on is the best way. Would you agree?

@MattFenelon
Copy link
Contributor Author

MattFenelon commented Oct 16, 2024

...If you think the new way is going to break stuff unexpectedly then we could bump the major version even.

A minor bump feels ok to me. Internally it's a decent size change but it shouldn't break anything or require any changes on the part of consumers of the gem.

The one thing I don't want to do is have some great improvement that requires someone to find a hidden option or read the documentation in order to find and benefit from; the best way should be the default—and it seems like what you've been working on is the best way. Would you agree?

Yes I agree. This change makes the default configuration safer by fixing #469. It's also likely to have improve performance. With the current set up of an unbounded number of threads, there can be a high contention for resources, which particularly affects the latency of the long-tail of requests. In my testing on my local system, I found even a small increase of one thread to the thread pool (4 to 5) decreased the overall requests per second the system could handle.

It sounds like we're in agreement so, if you're happy with the change, I think this PR is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ActiveRecord::ConnectionTimeoutError "all pooled connections were in use" after enabling concurrency
2 participants