Fix retries and failover #320

dawid-sklodowski · 2014-09-22T14:54:13Z

Pull-request that fixes failover and retry mechanism.

Changes in details:

Refactoring: move with_retry method to Cluster -- it belongs there as it operates on cluster.
Introduce retries on write operations -- it makes sense, because:
- Update is idempotent
- Delete is idempotent (deletes rows which are matching query)
- Insert -- in worse case scenario we could end up with duplicated data, however given that moped is used by mongoid, which inserts rows always with _id already present, therefore such duplicated insert will raise unique index on _id violation, which is fine.
Fixes failover mechanism -- Node#flush was was using ensure_connected, which involves failover, however processing of database messages after executing operations (and raising errors based on them) was outside of ensure_connected block, therefore failover mechanism wasn't exercised in most cases it was meant for.
Removes Reconfigure failover mechanism -- it was raising new exceptions but not retrying -- it should be good enough to just retry.
Refactoring: Move recognition mechanism for some errors from Errors class to Reply class, so errors recognition is in one place.
Fixes refresh mechanism -- if node was successfully refreshed it isn't down any more.

Outcome of those changes is that you can kill / restart mongo replica-set nodes in whatever order and as often as you like. You can even stop all of them for couple of seconds (driven by retry_count and retry_interval) and application will be able to recover without loosing any operations or throwing errors.

matsimitsu · 2014-09-23T12:21:52Z

Pushed this to our staging and it seems to work great with authentication failures / stepdowns etc. (and SSL enabled)!

This reverts commit edd9eed.

durran · 2014-09-23T14:51:06Z

Looks good to me. @arthurnn What do you think?

zarqman · 2014-10-01T22:24:42Z

+1
Would really like to see one of the PRs that addresses failover pulled soon.

matsimitsu · 2014-10-04T10:47:11Z

Found one more issue, if you have a replicaset and you want to re-sync a node (because of disk usage) and the node is in STARTUP2 mode, connection will fail with the following error:

2014-10-04T10:43:49.441Z 9291 TID-oulq07iok WARN: The operation: #<Moped::Protocol::Commands::Authenticate
  @length=167
  @request_id=54119
  @response_to=0
  @op_code=2004
  @flags=[]
  @full_collection_name="production.$cmd"
  @skip=0
  @limit=-1
  @selector={:authenticate=>1, :user=>"xx", :nonce=>"xx", :key=>"xx"}
  @fields=nil>
failed with error 18: "auth failed"

See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.

Steps taken:

shutdown mongodb on a node in a replicaset
remove mongodb data files
start mongodb
mongodb will now re-sync the data from another node in the state STARTUP2

It will keep on retrying to authenticate on this node causing constant failures.

rakusai · 2015-01-08T06:50:14Z

+1 this sees like to fix the issue, too #268

jperichon · 2015-01-21T01:28:41Z

+1 this works for me. Anybody using it in production?

zarqman · 2015-01-21T04:29:03Z

@jperichon, we've been using it successfully in production for 3+ months. We added a couple of patches on top of it to fix up things it missed. Haven't seen any problems with the included commits though--they've been great.

dawid-sklodowski added 12 commits September 4, 2014 16:37

Refactoring: move Selectable#with_retry to Cluster#with_retry

b1079af

Retry on writes

b72c097

Failover working

439ea78

Cleanup

8edebd7

Cleanup + fixing specs

57ce75c

Fixed read specs

977d776

Refreshed node is not down!

a11bb3a

Fix node spec

455a69d

Dont change Ruby version

cc79920

Cluster#with_retry using Failover::STRATEGIES

e41100f

Fixing Typo

a81effe

Dont do unless .. else ..

827bedc

dawid-sklodowski mentioned this pull request Sep 22, 2014

Failover authentication fixes #311

Closed

Additional specs

2be410c

dawid-sklodowski force-pushed the fix-retries-and-failover branch from cade947 to 2be410c Compare September 22, 2014 15:23

dawid-sklodowski mentioned this pull request Sep 22, 2014

2.0.0 fix retries operations #315

Closed

try to do tests more stable

edd9eed

Revert "try to do tests more stable"

4c2a0a2

This reverts commit edd9eed.

zarqman mentioned this pull request Oct 4, 2014

Fix 16550 "not authorized", TypeError, and others #324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix retries and failover #320

Fix retries and failover #320

dawid-sklodowski commented Sep 22, 2014

matsimitsu commented Sep 23, 2014

durran commented Sep 23, 2014

zarqman commented Oct 1, 2014

matsimitsu commented Oct 4, 2014

rakusai commented Jan 8, 2015

jperichon commented Jan 21, 2015

zarqman commented Jan 21, 2015

Fix retries and failover #320

Are you sure you want to change the base?

Fix retries and failover #320

Conversation

dawid-sklodowski commented Sep 22, 2014

matsimitsu commented Sep 23, 2014

durran commented Sep 23, 2014

zarqman commented Oct 1, 2014

matsimitsu commented Oct 4, 2014

rakusai commented Jan 8, 2015

jperichon commented Jan 21, 2015

zarqman commented Jan 21, 2015