Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rails transaction retry after rollback #258

Open
Piioo opened this issue Dec 9, 2022 · 12 comments · May be fixed by #364
Open

Rails transaction retry after rollback #258

Piioo opened this issue Dec 9, 2022 · 12 comments · May be fixed by #364

Comments

@Piioo
Copy link

Piioo commented Dec 9, 2022

Hallo,

I have not much informations, but maybe you see something I miss.

We have a transaction where we save 7 objects in different tables.
On our staging/production system sometimes the transaction commit breaks with an error.
The transaction is rollbacked and rails try to retry it.

Our problem is, that 2nd (retried) commit saves only 3 of 7 objects in the DB without any error.

def execute
    @admin = ResourceOwner::User.new(admin_params)
    @company = build_company
    ActiveRecord::Base.transaction do
      @company.save!
      @company.additional_info.save!
      @admin.save!
      ConfirmationToken::PasswordResetToken.prepare_token(@admin, expires_at: nil)
      @admin.add_role(:company_admin, company)
    end
  end

And here the errors we get from the Database:

2022-12-09T07:16:57.974550705+01:00 stdout F {"sql":"COMMIT","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"exception":["ActiveRecord::SerializationFailure","PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): \"sql txn\" meta={id=7caee3f1 key=/Table/72/1/820825936153870355 pri=0.00425340 epo=0 ts=1670566614.790254963,2 min=1670566602.063413353,0 seq=52} lock=true stat=PENDING rts=1670566602.063413353,0 wto=false gul=1670566602.563413353,0\nHINT:  See: https://www.cockroachlabs.com/docs/v21.1/transaction-retry-error-reference.html#retry_serializable\n"],"exception_object":"PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): \"sql txn\" meta={id=7caee3f1 key=/Table/72/1/820825936153870355 pri=0.00425340 epo=0 ts=1670566614.790254963,2 min=1670566602.063413353,0 seq=52} lock=true stat=PENDING rts=1670566602.063413353,0 wto=false gul=1670566602.563413353,0\nHINT:  See: https://www.cockroachlabs.com/docs/v21.1/transaction-retry-error-reference.html#retry_serializable\n","duration":45.32,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request","exception"],"@timestamp":"2022-12-09T06:16:57.973Z","@version":"1"}

2022-12-09T07:16:57.98031866+01:00 stdout F {"sql":"ROLLBACK","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"exception":["ActiveRecord::StatementInvalid","PG::InternalError: ERROR:  there is no transaction in progress\n"],"exception_object":"PG::InternalError: ERROR:  there is no transaction in progress\n","duration":5.12,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request","exception"],"@timestamp":"2022-12-09T06:16:57.979Z","@version":"1"}

2022-12-09T07:16:58.216634327+01:00 stdout F {"sql":"BEGIN","name":"TRANSACTION","binds":[],"type_casted_binds":[],"statement_name":null,"duration":4.69,"request_id":"25d5d2166a7737339f1bd0c0f73fd54e","source":"100.65.49.24","tags":["request"],"@timestamp":"2022-12-09T06:16:58.215Z","@version":"1"}

After this rails try to retry begins a new transaction (last log message) and saves only token and role. There are no insert queries for admin or company.

Maybe the others has already an id (because token and role needs the id from the admin) and for this reason they are not saved.
Or company and admin are in cache and token and role not.
Token and Role may be only created when the admin.id is present.

Im trying to write a rspec test, but fail to trigger a retry transaction.

Have you a clue what we may do?

@Piioo
Copy link
Author

Piioo commented Jan 26, 2023

The Problem is, that some of the objects keep there IDs and persisted status, so that Rails thinks in the retry, I do not have to save them, because they are already persisted

@Piioo
Copy link
Author

Piioo commented Jan 26, 2023

I think the bug comes form here:
https://github.com/cockroachdb/activerecord-cockroachdb-adapter/blob/master/lib/active_record/connection_adapters/cockroachdb/transaction_manager.rb#L13-L21

This code catches the ActiveRecord::StatementInvalid error before Rails may do this in https://github.com/rails/rails/blob/v7.0.4.2/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L322-L329

And now the transaction.state.invalidate! and rollback_transaction is not running, so rails do not try to save the objects again.

(Right now its not possible for me to create a test, so I cant fix this :/ )

@shfung
Copy link

shfung commented Jul 26, 2024

I think we have encountered the same problem as well, where a db transaction is being retried but the record memory status is not being reset, hence the record is not saved, but the after_save callback was fired.

We think that this problem can be reproduced fairly reliabliy using the following steps:

  1. run the following setup script in a console and don't exit out of the console after that
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Base.configurations = {
  "cockroachdb" => {
    adapter: "cockroachdb",
    host: "localhost",
    port: 26256,
    user: "root",
    database: "crdb_rollback_test",
  }
}
ActiveRecord::Base.establish_connection(:cockroachdb)

ActiveRecord::Schema.define do
  create_table :customers, force: true do |t|
    t.string :name
  end

  create_table :customer_data, force: true do |t|
    t.belongs_to :customer, null: false, foreign_key: true
    t.string :city
    t.datetime :deleted_at
  end
end

class CockroachDbApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  connects_to database: { writing: :cockroachdb, reading: :cockroachdb }
end

class Customer < CockroachDbApplicationRecord
end

class CustomerData < CockroachDbApplicationRecord
  belongs_to :customer
  after_save :mark_as_current

  # we want to set all other CustomerData to have deleted_at
  def mark_as_current
    if deleted_at.blank?
      ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id)
      puts 'mark_as_current sleeping'
      sleep 5
      puts 'mark_as_current sleeping end'
      CustomerData.where(id: ids).update_all(deleted_at: Time.current)
    end
  end
end

customer = Customer.create!(name: 'Name')
CustomerData.create!(customer_id: customer.id, city: 'SF')
customer_data = CustomerData.new(customer_id: customer.id, city: 'NYC')
  1. Open a separate console and run the following script to connect to the same crdb database, and and don't exit out of the console after that either
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Base.configurations = {
  "cockroachdb" => {
    adapter: "cockroachdb",
    host: "localhost",
    port: 26256,
    user: "root",
    database: "crdb_rollback_test",
  }
}
ActiveRecord::Base.establish_connection(:cockroachdb)

class Customer < CockroachDbApplicationRecord
end

class CustomerData < CockroachDbApplicationRecord
  belongs_to :customer
  after_save :mark_as_current

  def mark_as_current
    if deleted_at.blank?
      ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id)
      CustomerData.where(id: ids).update_all(deleted_at: Time.current)
    end
  end
end

customer = Customer.first
customer_data = CustomerData.new(customer_id: customer.id, city: 'NYC')
  1. Go back to the first console, continue to run the following script
customer_data.save!
  1. Once you see the 'mark_as_current sleeping' display in the first console, immediately run the following in the 2nd console
customer_data.save!

In the 2nd console, you should see the Transaction being committed without any problem
But in the 1st console with the sleep, you should see the Transaction being committed but after that being rolled back, then only the two SQL statements from the mark_as_current callback are displayed without any additional "CustomerData Create"
If you now run

CustomerData.where(customer_id: customer.id, deleted_at: nil)

in either console, it will return an empty array, which is the wrong behavior.

@BuonOmo Can you help look into it with the above steps to reproduce? Thanks

@BuonOmo
Copy link
Collaborator

BuonOmo commented Jul 29, 2024

@shfung I reproduce the behaviour you are saying (empty array). I'll make it simpler and in a single script to execute in tests. However, it feels like the behaviour is working quite normally. You are not taking advantage of any transaction, I understand how this could fail using any kind of database.

So two things:

  • I'll try to see if there is an implementation issue in this library (or crdb)
  • I'll show you below an implementation example I would do

Async Design

run an after_commit script that triggers an async job (I recommend Sidekiq). This job uses a lock by customer_id (redis redlock for instance) runs something like:

def perform(customer_id, cd_id) # should be wrapped in a transaction
  Locker.lock("c#{customer_id}-cd#{cd_id}) do # should raise to re-enqueue job if locked
    ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: cd_id).pluck(:id)
    CustomerData.where(id: ids).update_all(deleted_at: Time.current)
    CustomerData.where(id: cd_id).update_all(deleted_at: nil)
  end
end 

Note that I don't fully understand your use case, as I can see that you can only delete data. And my script is kind of un-deleting. Another solution could be something like:

cd = CustomerData.find(cd_id)
# Only delete older ones.
CustomerData.where(customer_id: customer_id).where("created_at < ?", cd.created_at).update_all(deleted_at: Time.current)

Or just not have this column and use the latest customer data as the valid one...

Simple design

Use an update lock around the customer (depending on your usual queries, this can lead to overhead):

    def mark_as_current
      if deleted_at.blank?
        Customer.find(customer_id).with_lock do
          ids = CustomerData.where(customer_id: customer_id, deleted_at: nil).where.not(id: id).pluck(:id)
          CustomerData.where(id: ids).update_all(deleted_at: Time.current)
        end
      end
    end

Note that the first data created will get the priority in that case. And your could consider a FOR UPDATE NOWAIT lock to avoid overhead and raise directly.


@Piioo I did not see your message, I think it is quite important, and related to the bug I am currently trying to solve in #333. Thank you for the links, I'll have a look at this while finishing my PR. And we'll see if this fixes everything. Basically, the client retry logic is messing with rails internal logic

@shfung
Copy link

shfung commented Jul 29, 2024

@BuonOmo Thanks for the suggestions. We notice some places for improvement in our own implementation that we made changes to, but we still believe that there is a underlying retry issue caused by the gem.

Can you elaborate more on what you meant by You are not taking advantage of any transaction, I understand how this could fail using any kind of database?

The after_save callback should be executed in the same db transaction as the save!, that's why when the db transaction is being retried, the after_save callback is being retried as well. We tried the same script under MySQL setup with SERIALIZABLE isolation level, and the race condition didn't happen.

Looking into the code, it seems that the db transaction errored out when it's being committed here in the ensure block of the within_new_transaction method of the TransactionManager object: https://github.com/rails/rails/blob/6f0d1ad14b92b9f5906e44740fce8b4f1c7075dc/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L565
When it failes, it raises an

ActiveRecord::SerializationFailure:"PG::TRSerializationFailure: ERROR:  restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE  - failed preemptive refresh due to conflicting locks on...

exception that is caught by the rescue Exception underneath it: https://github.com/rails/rails/blob/6f0d1ad14b92b9f5906e44740fce8b4f1c7075dc/activerecord/lib/active_record/connection_adapters/abstract/transaction.rb#L569-L572 and it will try to rollback the transaction. But this PG::TRSerializationFailure: ERROR is a special error that indicates the db transaction failed and should not be interacted with anymore. So, when the code tries to rollback the transaction, it will get another error:

PG::NoActiveSqlTransaction: ERROR:  there is no transaction in progress

This exception got raised to the patch by the this gem, and triggered the retry:

rescue ActiveRecord::StatementInvalid => error
raise unless retryable? error
raise if attempts >= @connection.max_transaction_retries
sleep_rand_seconds(attempts)
within_new_transaction(isolation: isolation, joinable: joinable, attempts: attempts + 1) { yield }
end

Unfortunately, because the rollback failed, ActiveRecord actually didn't rollback the status of the db record in memory. So, when the db transaction is being retried, ActiveRecord thinks that the CustomerData was saved already and didn't insert it into the db, but it still execute the after save callback to delete the other CustomerData rows.

@BuonOmo
Copy link
Collaborator

BuonOmo commented Aug 2, 2024

There is a underlying retry issue caused by the gem.

That's for sure, that is the main thing I'm investigating on these days! But I am not sure resolving this issue will fully solve your design problems... These are tricky race condition issues that you'd need to test precisely !

You are not taking advantage of any transaction, I understand how this could fail using any kind of database

Well if there is a wrapping transaction my comment is useless. The reproduction code you send wasn't using any, hence my comment. Maybe your codebase does wrap every http call in a transaction (default behaviour for rails if I remember correctly).

@arolling-sa
Copy link

We also encountered this bug while testing our application under load. It occurred when we made several concurrent requests to the same Create API (with different values); all the requests returned a success response with an ID for the new record, but only one or two records would actually be committed to the database. After a great deal of debugging, we came to exactly the same conclusion as @shfung did in their second comment: when Rails attempts to rollback the transaction, a new exception is raised and the state of the in-memory record is not reset. According to our testing, this can affect any model with a transactional callback that performs a query on the same table.

TLDR: When a commit fails due a PG::TRSerializationFailure error, attempting to rollback the transaction raises a PG::NoActiveSqlTransaction error and Rails does not rollback the in-memory records. This can be fixed via a new monkeypatch that catches this specific exception and ensures transaction.rollback_records is called before retrying. See below for details, links, and a single file reproduction script.

Clarifications:

  • The theory proposed by the initial reporter is not quite correct; the CRDB within_new_transaction monkeypatch is indeed catching the errors instead of the Rails rescue block, but due to Rails' ensure the transaction is still committed and an attempt is made to roll it back. That call to rollback_transaction then performs transaction.rollback, raises the PG::NoActiveSqlTransaction error as detailed by @shfung, and never reaches transaction.rollback_records which resets the record state.
  • The reply from @BuonOmo on July 29th states that You are not taking advantage of any transaction; on August 2nd this was restated as The reproduction code you send wasn't using any, hence my comment. Maybe your codebase does wrap every http call in a transaction (default behaviour for rails if I remember correctly).. This assumption is incorrect, as the example code uses an after_save callback and the whole callback chain is wrapped in a transaction. The only exceptions are the two transactional callbacks, after_commit and after_rollback.

Sequence of events, using the CustomerData example

  1. The call to customer_data.save! in the first console (CD1) starts the save transaction with an INSERT on the customer_data table, returning the new record's ID.
  2. This triggers the after_save callback mark_as_current, which queries the same table then starts a sleep.
  3. Meanwhile the second console runs customer_data.save! with a different new record (CD2), which starts a transaction and calls INSERT on the same table, returning a different new ID.
  4. It continues to the after_save callback, which queries the table, finds no matching (committed) records, and completes by committing CD2.
  5. CD1 finishes its sleep and resumes mark_as_current, then tries to commit the new record. This triggers a PG::TRSerializationFailure because the table is now dirty - a new record has been added by a different process and there are conflicting locks.
  6. Rails catches this error and calls rollback_transaction, which is supposed to first rollback the DB transaction and then rollback the state of the in-memory record. However, in this case there is no need to rollback the DB transaction and it raises a PG::NoActiveSqlTransaction exception, so the transaction.rollback_records step gets skipped.
  7. This exception is caught by the CRDB TransactionManager monkeypatch, which retries the transaction.
  8. ActiveRecord believes that the in-memory record for CD1 has been inserted, so it doesn't try it again, but it still runs the after_save callback which queries the table and updates the deleted_at for the CD2 record, then commits the transaction.
  9. Final result: CD1 has an ID but was never committed; CD2 has an ID and was committed then marked as deleted.

Reproduction Script

@mmalek-sa and I found that this bug can be trivially reproduced using a model validation and concurrent processes, without any need to explicitly sleep. Since it is essentially a race condition, one random record will commit properly while the others will report a successful save but silently fail to commit.

# frozen_string_literal: true

Gem.paths = { "GEM_HOME" => Gem.user_dir }
require 'bundler/inline'

gemfile(true) do
  source 'https://rubygems.org'

  gem 'activerecord', "7.1.4"

  gem 'activerecord-cockroachdb-adapter', "~> 7.1.X"

  gem 'securerandom'
end

require 'active_record'
require 'activerecord-cockroachdb-adapter'
require 'minitest/autorun'
require 'logger'
require 'securerandom'

# You might need to change the database name or connection string.
ActiveRecord::Base.establish_connection('cockroachdb://root@localhost:26255/testdb?sslmode=disable')
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :customers, force: true do |t|
    t.string :name
    t.string :username
  end
end

class Customer < ActiveRecord::Base
  validate :validate_unique_username

  def validate_unique_username
    duplicate = Customer.where(username: username).any?
    errors.add("Duplicate username!") if duplicate
  end

  def to_s
    "#{id}:#{name}:#{username}"
  end
end

# This monkey patch fixes the bug.
# ActiveRecord::ConnectionAdapters::TransactionManager.class_eval do
#   def rollback_transaction(transaction = nil)
#     @connection.lock.synchronize do
#       transaction ||= @stack.last
#       begin
#         transaction.rollback
#       rescue ActiveRecord::StatementInvalid => err 
#         # This is important to make ActiveRecord aware the record was not inserted/saved
#         # Otherwise AR will assume save was successful and it doesn't retry the transaction
#         # See this for more details:
#         # https://github.com/cockroachdb/activerecord-cockroachdb-adapter/issues/258#issuecomment-2256633329
#         if err.cause.is_a?(PG::NoActiveSqlTransaction)
#           puts "rollback_transaction exception PG::NoActiveSqlTransaction, rolling back records #{transaction.records}"
#           transaction.rollback_records
#         end
#
#         raise
#       ensure
#         @stack.pop if @stack.last == transaction
#       end
#       transaction.rollback_records
#     end
#   end
# end

class BugTest < ActiveSupport::TestCase
  def test_concurrent_customer_insert_with_processes
    concurrency_level = 3
    time_to_start = 2.seconds.from_now
    pids = []

    concurrency_level.times do |i|
      pids << Process.fork do
        sleep_time = (time_to_start - Time.current).to_f
        sleep(sleep_time) if sleep_time > 0

        begin
          c = Customer.new
          c.name = "Customer #{i + 1}"
          c.username = SecureRandom.uuid
          c.save!
          puts "Inserted customer with id #{c.id} successfully!"
        rescue => e
          puts "Process #{Process.pid} - Error: #{e.message}"
        end
      end
    end

    pids.each { |pid| Process.wait(pid) }

    puts "Customers #{Customer.all.map { |c| c.to_s }}"
    assert_equal concurrency_level, Customer.count
  end
end

Proposed solution

  • Essentially, all of this is the expected (and desired) behavior except the way it skips the transaction.rollback_records step that updates the state of the in-memory ActiveRecord record. Rather than a retry logic error, this is a state logic error.
  • Monkeypatching the rollback_transaction method to catch this specific exception and rolling back the records before re-raising appears to fix the problem.
  • NOTE: this code is for Rails 7.1.x and later; the relevant method is somewhat different in Rails 7.0.x and earlier.
ActiveRecord::ConnectionAdapters::TransactionManager.class_eval do
  def rollback_transaction(transaction = nil)
    @connection.lock.synchronize do
      transaction ||= @stack.last
      begin
        transaction.rollback
      rescue ActiveRecord::StatementInvalid => err 
        # This is important to make Active Record aware the record was not inserted/saved
        # Otherwise Active Record will assume save was successful and it doesn't retry the transaction
        # See this thread for more details:
        # https://github.com/cockroachdb/activerecord-cockroachdb-adapter/issues/258#issuecomment-2256633329
        transaction.rollback_records if err.cause.is_a?(PG::NoActiveSqlTransaction)

        raise
      ensure
        @stack.pop if @stack.last == transaction
      end
      transaction.rollback_records
    end
  end
end

@BuonOmo
Copy link
Collaborator

BuonOmo commented Jan 30, 2025

@arolling-sa thank you for the detailed response. Unfortunately, your reproduction is giving me even worse locally: a segv in the pg gem. I'll be on both issues fast. I believe the segv could be only related to my laptop by a mac m1.

I'll try to do both things in parallel:

  • fix the segv locally
  • create a new PR based on your reproduction to see if we can reproduce the bug in CI.

BuonOmo added a commit that referenced this issue Jan 30, 2025
@BuonOmo BuonOmo linked a pull request Jan 30, 2025 that will close this issue
BuonOmo added a commit that referenced this issue Jan 30, 2025
BuonOmo added a commit that referenced this issue Jan 30, 2025
BuonOmo added a commit that referenced this issue Jan 30, 2025
BuonOmo added a commit that referenced this issue Jan 30, 2025
BuonOmo added a commit that referenced this issue Jan 30, 2025
@arolling-sa
Copy link

@BuonOmo That's odd, I was testing this on a MacBook M1 as well (I believe @mmalek-sa uses a Macbook M2) and neither of us saw a segv; it could be something specific to your environment. Also, we were using version 7.1.1 of the activerecord-cockroachdb-adapter and version 1.5.8 of pg.

@BuonOmo
Copy link
Collaborator

BuonOmo commented Jan 30, 2025

@arolling-sa I'm using 1.5.9. But the segv is also present in 1.5.9. This could also have to do with postgres version (Using pg_config --version gives me PostgreSQL 14.15 (Homebrew)). Or maybe cockroach version (I doubt it, but I have locally v24.3.2)

EDIT: found the segv culprit ged/ruby-pg#311 (comment), local solution is PGGSSENCMODE=disable

BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
@BuonOmo
Copy link
Collaborator

BuonOmo commented Jan 31, 2025

I implemented the test and solution. But I want to check one last thing before we push. This seems to only fail in crdb 24.1 and I want to understand why. Can you confirm you are using this version @arolling-sa ?

Ok it might not be that it fails only in 24.1 but that the test does not work aaall the time. I'll figure out a way to make it more robust. And in 23.2 there's no issue

BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
BuonOmo added a commit that referenced this issue Jan 31, 2025
@arolling-sa
Copy link

@BuonOmo - I was using 23.1.25 when I reproduced it, actually, and I believe my colleague tested against multiple CRDB versions. And the initial report that started this thread was from December 2022 - they don't mention the version, but according to the release page the latest possible version they could have been using was 22.2.0. As this is essentially a race condition, it may be affected by local resources and configuration. Have you tried increasing the concurrency level to make it more likely to fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants