Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different lock retry configurations for different schema_statements? #67

Open
camallen opened this issue Jan 28, 2025 · 5 comments

Comments

@camallen
Copy link
Contributor

camallen commented Jan 28, 2025

Some schema statements have different lock requirements and trying to create a generic lock retry mechanism that fits different needs can lead to timeouts and retry exhaustion which can then fail the migration process.

For instance when attempting to add_reference_concurrently(foreign_key: true) we add an index concurrently and then add a foreign key.

Adding an index concurrently results in a ShareUpdateExclusiveLock which allow reads on the table. However that process must wait for all existing transactions to finish before it can successfully build the index.

by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate.

On a busy database with a longer statement timeout that index creation may fail with a shorter lock_timeout, depending on how the query patterns are on the database at any given moment (noting that ratcheting down the statement_timeout is ideal but currently unfeasible).

Then adding the FK constraint uses AccessExclusiveLock which blocks any query behind it so we really want a very short lock_timeout for this statement but with lots of retries :(

Nothing is allowed to run concurrently with ALTER TABLE ADD CONSTRAINT

Our current problem is supporting these types of schema statements with the existing lock retry mechanism https://github.com/fatkodima/online_migrations/blob/master/docs/configuring.md#lock-timeout-retries for a migration run (on deploy).

Ideally we don't have to teach all our developers about the intricacies of database locks and how to split migrations up with overridden lock mechanics in migrations, they are better focussed on domain modelling and not deploying these migrations to a busy database.

Is there any way we can modify the LockRetrier implementation to support these conflicting use cases?

Perhaps we can add different lock retries on the config and dynamically select them via the schema statement, e.g. add_index?

Or allow the lock_retrier to reflect on the migration and auto select the lock retrier implementation to use?

@fatkodima
Copy link
Owner

I didn't get what the problem is. We use the default lock retrier and add_reference_concurrently with no problems yet on a busy database. Maybe if you can describe a specific case, I can help.

Lock retrier uses small lock timeouts with many attempts, so this won't block the database for a long time. Both adding an index and a foreign key are made idempotent in this gem, so if you use add_reference_concurrently and index was created, but for some reason foreign key addition failed and the migration is retried/redeployed, then the index creation will be skipped this time.

Also, you can configure different lock retriers by assigning to config.lock_retrier with existing classes or custom class. Or configure a custom check which suggests using something special.

@camallen
Copy link
Contributor Author

camallen commented Jan 28, 2025

Our problem is a lock retrier with a short timeout is having troubles creating an index concurrently due to it having to wait for any existing transaction id, possibly minutes.

That combined with the need for short timeouts for strong locking operations like adding a FK leaves our lock retrier implementation in a middle ground where we hit lock_timeouts while waiting for long running transactions on the index builds. Perhaps an example setup will help.

Currently we have a constant lock retrier configured like, it's a targetted to avoiding blocking queries with strong locks while trying to allow time for index builds.

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 9,
  delay: 10.seconds,
  lock_timeout: 5.seconds
)

When adding an index with this config and there are some long running queries we're hitting PG::LockNotAvailable statement timeout errors.

The index builds fine with a retrier configuration like the following as it allows the lock to wait for any existing transaction that could modify or use the index to terminate.

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 4,
  delay: 30.seconds,
  lock_timeout: 3.minutes
)

However with that configuration we have problematic blocking operations when adding foreign keys on busy tables as the queries can pile up behind the strong locks that schema statement uses. Ideally when adding FKs with strong locks we'd use a retrier configuration like

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 60,
  delay: 5.seconds,
  lock_timeout: 0.5.seconds
)

So we have two distinct retrier configurations depending on the type of schema statement the migration is running. Ideally having an internal mechanism in this gem that detects the type of schema statement and applies a tailored lock retier mechanism based where it can, as developers can get it wrong and cause downtime selecting the wrong retrier configuration.

Hopefully this helps clarify the problem we're experiencing with the current system. I'd be happy to work on this implementation myself but I couldn't see a clean pathway to doing so with the current gem api.

@fatkodima
Copy link
Owner

Thanks for the explanation. The easiest solution I can currently think of is extending the implementation of lock retrier in the gem to also accept the current command and its arguments and allow people to do some logic depending on it.

Something like:

class MyLockRetrier < LockRetrier
  def lock_timeout(_attempt, command: nil, arguments: nil)
    if command == :add_index
      30
    else
      5
    end
  end
  # other methods
end

Will that work?

@camallen
Copy link
Contributor Author

camallen commented Jan 28, 2025

Yes that would work! Having the lock_timeout api extended to include the command and arguments would be ideal. I can reflect on the command internally and return the correct timeout.

Ideally the same API surface exists for the retries and delay. E.g. (re)building an index might be compute intensive with larger delays, similarly less lots of retries when adding FKs but less for adding an index.

Sadly, I couldn't quite see how to inject the command to the with_lock_retries method to do it myself.

@fatkodima
Copy link
Owner

Should be done somewhere here

def method_missing(method, *args, &block)
if ar_schema?
super
elsif command_checker.check(method, *args, &block)
if in_transaction?
super
elsif method == :with_lock_retries
connection.with_lock_retries(*args, &block)
else
connection.with_lock_retries { super }
end
end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants