Support different lock retry configurations for different schema_statements? #67

camallen · 2025-01-28T07:10:44Z

Some schema statements have different lock requirements and trying to create a generic lock retry mechanism that fits different needs can lead to timeouts and retry exhaustion which can then fail the migration process.

For instance when attempting to add_reference_concurrently(foreign_key: true) we add an index concurrently and then add a foreign key.

Adding an index concurrently results in a ShareUpdateExclusiveLock which allow reads on the table. However that process must wait for all existing transactions to finish before it can successfully build the index.

by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate.

On a busy database with a longer statement timeout that index creation may fail with a shorter lock_timeout, depending on how the query patterns are on the database at any given moment (noting that ratcheting down the statement_timeout is ideal but currently unfeasible).

Then adding the FK constraint uses AccessExclusiveLock which blocks any query behind it so we really want a very short lock_timeout for this statement but with lots of retries :(

Nothing is allowed to run concurrently with ALTER TABLE ADD CONSTRAINT

Our current problem is supporting these types of schema statements with the existing lock retry mechanism https://github.com/fatkodima/online_migrations/blob/master/docs/configuring.md#lock-timeout-retries for a migration run (on deploy).

Ideally we don't have to teach all our developers about the intricacies of database locks and how to split migrations up with overridden lock mechanics in migrations, they are better focussed on domain modelling and not deploying these migrations to a busy database.

Is there any way we can modify the LockRetrier implementation to support these conflicting use cases?

Perhaps we can add different lock retries on the config and dynamically select them via the schema statement, e.g. add_index?

Or allow the lock_retrier to reflect on the migration and auto select the lock retrier implementation to use?

The text was updated successfully, but these errors were encountered:

fatkodima · 2025-01-28T10:09:43Z

I didn't get what the problem is. We use the default lock retrier and add_reference_concurrently with no problems yet on a busy database. Maybe if you can describe a specific case, I can help.

Lock retrier uses small lock timeouts with many attempts, so this won't block the database for a long time. Both adding an index and a foreign key are made idempotent in this gem, so if you use add_reference_concurrently and index was created, but for some reason foreign key addition failed and the migration is retried/redeployed, then the index creation will be skipped this time.

Also, you can configure different lock retriers by assigning to config.lock_retrier with existing classes or custom class. Or configure a custom check which suggests using something special.

camallen · 2025-01-28T10:32:52Z

Our problem is a lock retrier with a short timeout is having troubles creating an index concurrently due to it having to wait for any existing transaction id, possibly minutes.

That combined with the need for short timeouts for strong locking operations like adding a FK leaves our lock retrier implementation in a middle ground where we hit lock_timeouts while waiting for long running transactions on the index builds. Perhaps an example setup will help.

Currently we have a constant lock retrier configured like, it's a targetted to avoiding blocking queries with strong locks while trying to allow time for index builds.

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 9,
  delay: 10.seconds,
  lock_timeout: 5.seconds
)

When adding an index with this config and there are some long running queries we're hitting PG::LockNotAvailable statement timeout errors.

The index builds fine with a retrier configuration like the following as it allows the lock to wait for any existing transaction that could modify or use the index to terminate.

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 4,
  delay: 30.seconds,
  lock_timeout: 3.minutes
)

However with that configuration we have problematic blocking operations when adding foreign keys on busy tables as the queries can pile up behind the strong locks that schema statement uses. Ideally when adding FKs with strong locks we'd use a retrier configuration like

config.lock_retrier = OnlineMigrations::ConstantLockRetrier.new(
  attempts: 60,
  delay: 5.seconds,
  lock_timeout: 0.5.seconds
)

So we have two distinct retrier configurations depending on the type of schema statement the migration is running. Ideally having an internal mechanism in this gem that detects the type of schema statement and applies a tailored lock retier mechanism based where it can, as developers can get it wrong and cause downtime selecting the wrong retrier configuration.

Hopefully this helps clarify the problem we're experiencing with the current system. I'd be happy to work on this implementation myself but I couldn't see a clean pathway to doing so with the current gem api.

fatkodima · 2025-01-28T11:15:40Z

Thanks for the explanation. The easiest solution I can currently think of is extending the implementation of lock retrier in the gem to also accept the current command and its arguments and allow people to do some logic depending on it.

Something like:

class MyLockRetrier < LockRetrier
  def lock_timeout(_attempt, command: nil, arguments: nil)
    if command == :add_index
      30
    else
      5
    end
  end
  # other methods
end

Will that work?

camallen · 2025-01-28T12:03:23Z

Yes that would work! Having the lock_timeout api extended to include the command and arguments would be ideal. I can reflect on the command internally and return the correct timeout.

Ideally the same API surface exists for the retries and delay. E.g. (re)building an index might be compute intensive with larger delays, similarly less lots of retries when adding FKs but less for adding an index.

Sadly, I couldn't quite see how to inject the command to the with_lock_retries method to do it myself.

fatkodima · 2025-01-28T12:06:24Z

Should be done somewhere here

online_migrations/lib/online_migrations/migration.rb

Lines 19 to 31 in 02a0445

    
           def method_missing(method, *args, &block) 
        
             if ar_schema? 
        
               super 
        
             elsif command_checker.check(method, *args, &block) 
        
               if in_transaction? 
        
                 super 
        
               elsif method == :with_lock_retries 
        
                 connection.with_lock_retries(*args, &block) 
        
               else 
        
                 connection.with_lock_retries { super } 
        
               end 
        
             end 
        
           end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different lock retry configurations for different schema_statements? #67

Support different lock retry configurations for different schema_statements? #67

camallen commented Jan 28, 2025 •

edited

Loading

fatkodima commented Jan 28, 2025

camallen commented Jan 28, 2025 •

edited

Loading

fatkodima commented Jan 28, 2025

camallen commented Jan 28, 2025 •

edited

Loading

fatkodima commented Jan 28, 2025

Support different lock retry configurations for different schema_statements? #67

Support different lock retry configurations for different schema_statements? #67

Comments

camallen commented Jan 28, 2025 • edited Loading

fatkodima commented Jan 28, 2025

camallen commented Jan 28, 2025 • edited Loading

fatkodima commented Jan 28, 2025

camallen commented Jan 28, 2025 • edited Loading

fatkodima commented Jan 28, 2025

camallen commented Jan 28, 2025 •

edited

Loading

camallen commented Jan 28, 2025 •

edited

Loading

camallen commented Jan 28, 2025 •

edited

Loading