Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Fix retries and failover #320

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

dawid-sklodowski
Copy link

Pull-request that fixes failover and retry mechanism.

Changes in details:

  • Refactoring: move with_retry method to Cluster -- it belongs there as it operates on cluster.
  • Introduce retries on write operations -- it makes sense, because:
    • Update is idempotent
    • Delete is idempotent (deletes rows which are matching query)
    • Insert -- in worse case scenario we could end up with duplicated data, however given that moped is used by mongoid, which inserts rows always with _id already present, therefore such duplicated insert will raise unique index on _id violation, which is fine.
  • Fixes failover mechanism -- Node#flush was was using ensure_connected, which involves failover, however processing of database messages after executing operations (and raising errors based on them) was outside of ensure_connected block, therefore failover mechanism wasn't exercised in most cases it was meant for.
  • Removes Reconfigure failover mechanism -- it was raising new exceptions but not retrying -- it should be good enough to just retry.
  • Refactoring: Move recognition mechanism for some errors from Errors class to Reply class, so errors recognition is in one place.
  • Fixes refresh mechanism -- if node was successfully refreshed it isn't down any more.

Outcome of those changes is that you can kill / restart mongo replica-set nodes in whatever order and as often as you like. You can even stop all of them for couple of seconds (driven by retry_count and retry_interval) and application will be able to recover without loosing any operations or throwing errors.

@matsimitsu
Copy link

Pushed this to our staging and it seems to work great with authentication failures / stepdowns etc. (and SSL enabled)!

@durran
Copy link
Member

durran commented Sep 23, 2014

Looks good to me. @arthurnn What do you think?

@zarqman
Copy link
Contributor

zarqman commented Oct 1, 2014

+1
Would really like to see one of the PRs that addresses failover pulled soon.

@matsimitsu
Copy link

Found one more issue, if you have a replicaset and you want to re-sync a node (because of disk usage) and the node is in STARTUP2 mode, connection will fail with the following error:

2014-10-04T10:43:49.441Z 9291 TID-oulq07iok WARN: The operation: #<Moped::Protocol::Commands::Authenticate
  @length=167
  @request_id=54119
  @response_to=0
  @op_code=2004
  @flags=[]
  @full_collection_name="production.$cmd"
  @skip=0
  @limit=-1
  @selector={:authenticate=>1, :user=>"xx", :nonce=>"xx", :key=>"xx"}
  @fields=nil>
failed with error 18: "auth failed"

See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.

Steps taken:

  • shutdown mongodb on a node in a replicaset
  • remove mongodb data files
  • start mongodb
  • mongodb will now re-sync the data from another node in the state STARTUP2

It will keep on retrying to authenticate on this node causing constant failures.

@rakusai
Copy link

rakusai commented Jan 8, 2015

+1 this sees like to fix the issue, too #268

@jperichon
Copy link

+1 this works for me. Anybody using it in production?

@zarqman
Copy link
Contributor

zarqman commented Jan 21, 2015

@jperichon, we've been using it successfully in production for 3+ months. We added a couple of patches on top of it to fix up things it missed. Haven't seen any problems with the included commits though--they've been great.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants