You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a race condition on riak_repl_keylist_server:bloom_fold of fullsync, which can throw a function_clause error, but full_sync manager treats this as a normal finish of the partition replication. So a user cannot notice all keys could be not replicated to the sink cluster even if a partition could be skipped to be replicated.
The Cause
In keylisting fullsync, bloom_fold as fold function on vnode worker waits for resume_pause after sending a batch to the sink node. But somehow other fold message was received at this waiting worker. (See last message in crash.log) Then, the branch works, ?TRACE macro returns ok atom as a accumulator which causes vnode worker's crash.
Reproduction Steps
Couldn't find it.
Occurrence Frequency
Sometimes this have been observed by a customer. For them, it looks this happens randomly.
error.log
2016-04-07 23:22:01.094 [error] <0.9942.66> gen_server <0.9942.66> terminated with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675
2016-04-07 23:22:01.111 [error] <0.9942.66> CRASH REPORT Process <0.9942.66> with 0 neighbours exited with reason: no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in gen_server:terminate/6 line 744
2016-04-07 23:22:01.111 [error] <0.1131.0> Supervisor {<0.1131.0>,poolboy_sup} had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.9942.66> exit with reason no function clause matching riak_repl_keylist_server:bloom_fold({<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<121,107,250,67,143,8,73,...>>}, <<53,1,0,0,0,34,131,108,0,0,0,1,104,2,109,0,0,0,8,239,219,125,147,226,32,130,220,104,2,97,1,110,...>>, ok) line 675 in context child_terminated
crash.log
binary data was replaced as <<"ommited binary">>
2016-04-13 10:40:31 =ERROR REPORT====
** Generic server <0.916.0> terminating
** Last message in was {'$gen_cast',{work,{fold,#Fun<riak_cs_kv_multi_backend.9.110104299>,#Fun<riak_kv_vnode.35.88487897>},{raw,#Ref<0.0.4.162629>,<0.22168.4>},<0.884.0>}}
** When Server state == {state,riak_kv_worker,{state,1118962191081472546749696200048404186924073353216}}
** Reason for termination ==
** {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
2016-04-13 10:40:31 =CRASH REPORT====
crasher:
initial call: riak_core_vnode_worker:init/1
pid: <0.916.0>
registered_name: []
exception exit: {{function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,744}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ancestors: [<0.887.0>,<0.885.0>,<0.884.0>,<0.680.0>,riak_core_vnode_sup,riak_core_sup,<0.220.0>]
messages: [bloom_resume]
links: [<0.887.0>,<0.885.0>]
dictionary: [{bitcask_file_mod,bitcask_file},{bitcask_time_fudge,no_testing}]
trap_exit: false
status: running
heap_size: 6772
stack_size: 27
reductions: 20747615
neighbours:
2016-04-13 10:40:31 =SUPERVISOR REPORT====
Supervisor: {<0.887.0>,poolboy_sup}
Context: child_terminated
Reason: {function_clause,[{riak_repl_keylist_server,bloom_fold,[{<<48,98,58,16,121,107,99,149,64,231,6,8,234,204,240,88,62,111,225>>,<<35,217,96,165,236,160,69,118,179,100,39,110,92,174,247,160,0,0,0,0>>},<<"omitted binary">>,ok],[{file,"src/riak_repl_keylist_server.erl"},{line,675}]},{bitcask_fileops,fold_int_loop,5,[{file,"src/bitcask_fileops.erl"},{line,554}]},{bitcask_fileops,fold_file_loop,8,[{file,"src/bitcask_fileops.erl"},{line,720}]},{bitcask_fileops,fold,3,[{file,"src/bitcask_fileops.erl"},{line,391}]},{bitcask,subfold,3,[{file,"src/bitcask.erl"},{line,506}]},{bitcask_nifs,keydir_frozen,4,[{file,"src/bitcask_nifs.erl"},{line,304}]},{riak_kv_bitcask_backend,'-fold_objects/4-fun-0-',5,[{file,"src/riak_kv_bitcask_backend.erl"},{line,351}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]}]}
Offender: [{pid,<0.916.0>},{name,riak_core_vnode_worker},{mfargs,{riak_core_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
The text was updated successfully, but these errors were encountered:
Basho-JIRA
changed the title
Skip a partition replication due to race condition on fullsync
Skip a partition replication due to race condition on fullsync [JIRA: RIAK-2551]
May 10, 2016
There is a race condition on riak_repl_keylist_server:bloom_fold of fullsync, which can throw a function_clause error, but full_sync manager treats this as a normal finish of the partition replication. So a user cannot notice all keys could be not replicated to the sink cluster even if a partition could be skipped to be replicated.
The Cause
In keylisting fullsync, bloom_fold as fold function on vnode worker waits for
resume_pause
after sending a batch to the sink node. But somehow other fold message was received at this waiting worker. (See last message in crash.log) Then, the branch works, ?TRACE macro returnsok
atom as a accumulator which causes vnode worker's crash.Reproduction Steps
Couldn't find it.
Occurrence Frequency
Sometimes this have been observed by a customer. For them, it looks this happens randomly.
error.log
crash.log
binary data was replaced as
<<"ommited binary">>
The text was updated successfully, but these errors were encountered: