fix(FullScan): choose non-running_nemesis node #9370

temichus · 2024-11-26T12:43:55Z

this commit has the following changes

1 introduce common targed_node_lock mechanism
that can be used in nemesis and Scan operations

2 FullScan operation now run only on free of nemeses node

3 change all node.running_nemesis settings to use common methods
set/unset_running_nemesis from common targed_node_lock file (except unit tests)

fixes: #9284

Testing

https://argus.scylladb.com/tests/scylla-cluster-tests/d1a01caa-4f7e-40c1-8c17-69c9a6882de4

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

temichus · 2024-11-26T12:45:05Z

@fruch, @soyacz, @aleksbykov Coul you please take a look at this idea before I start testing it

soyacz

Generally LGTM, still needs testing as setting target node is messy (e.g. we set target node in disrupt_method_wrapper and unset it in run method.)

soyacz · 2024-11-26T14:34:04Z

sdcm/nemesis.py

@@ -453,10 +431,10 @@ def set_target_node(self, dc_idx: Optional[int] = None, rack: Optional[int] = No
                self.target_node = random.choice(nodes)

            if current_disruption:
-                self.target_node.running_nemesis = current_disruption
+                set_running_nemesis(self.target_node, current_disruption, self.disruptive)


_get_target_nodes anyway works the old way - never selecting any node that runs any nemesis, regardless of descriptiveness. So selecting node during nemesis startup will work the old way.
I'm not saying it's wrong, just noting it.

fruch · 2024-11-26T15:49:27Z

sdcm/scan_operation_thread.py

@@ -28,7 +29,7 @@
 from sdcm.utils.decorators import retrying, Retry

 if TYPE_CHECKING:
-    from sdcm.cluster import BaseNode
+    pass


just remove the whole if

fruch · 2024-11-26T15:54:43Z

sdcm/nemesis.py

@@ -381,14 +362,11 @@ def publish_event(self, disrupt, status=True, data=None):
        DisruptionEvent(nemesis_name=disrupt, severity=severity, **data).publish()

    def set_current_running_nemesis(self, node):
-        with NEMESIS_TARGET_SELECTION_LOCK:
-            node.running_nemesis = self.current_disruption
+        set_running_nemesis(node, self.current_disruption, self.disruptive)


how does self.disruptive would work ?

all those are set on only on nemesis classes, not the Nemesis class.

aleksbykov

Short remarks.
Also, because running_nemesis is list now, need to be carefully with passing to (un)set_running_nemesis function name of nemesis.

aleksbykov · 2024-11-26T13:48:12Z

sdcm/target_node_lock.py

+
+
+@contextmanager
+def run_nemesis(node_list: list[BaseNode], nemesis_label: str, is_disruptive=True):


i think better have is_disruptive=False as in set_running_nemesis

aleksbykov · 2024-11-26T15:56:28Z

sdcm/scan_operation_thread.py

 if TYPE_CHECKING:
-    from sdcm.cluster import BaseNode
+    pass


may be it is not needed at all

temichus · 2024-11-27T13:16:39Z

@soyacz @fruch @aleksbykov, Thank you for your comments. I've changed the logic to choose only a non running_nemesis node for run_nemesis, and it will help to do FullScan safe when several nemesis are running. But I don't think it will help for RollingRestartCluster because it 'disrupts' all nodes in the cluster. What do you think regarding updating disrupt_rolling_restart_cluster to restart only non running_nemesis nodes?

fruch · 2024-12-04T17:18:44Z

@temichus

have you tried just raising the timout and connection_timeout for the client used for the fullscans ?
so it can tolerate for nodes being down ?

in cassandra-stress and all other tools we have retries for this prouposes, do fullscan have them ?

(FYI, you took this yourself, I didn't seen this one before you raised this PR)

temichus · 2024-12-10T09:24:38Z

@temichus

have you tried just raising the timout and connection_timeout for the client used for the fullscans ? so it can tolerate for nodes being down ?

in cassandra-stress and all other tools we have retries for this prouposes, do fullscan have them ?

(FYI, you took this yourself, I didn't seen this one before you raised this PR)

@fruch

I tried.

2024-12-09 14:14:30.612: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=fb92eb18-41e3-4bf0-8113-38bb4165e433 during_nemesis=RollingRestartCluster duration=40m0s node=longevity-10gb-3h-node-ope-db-node-e76e436f-1 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: OperationTimedOut('errors={<Host: 10.4.2.76:9042 eu-west>: <Error from server: code=0000 [Server error] message="mapreduce_service is shutting down">}, last_host=10.4.2.234:9042'),session execution timeout 2400 is exceeded

2024-12-09 12:54:30.158: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=b1c5ce17-c805-43cd-9dda-7b0cd1e3952a during_nemesis=RollingRestartCluster duration=5m2s node=longevity-10gb-3h-node-ope-db-node-e76e436f-3 select_from=keyspace1.standard1 message=NoHostAvailable('Unable to complete the operation against any hosts', {<Host: 10.4.2.76:9042 eu-west>: ConnectionShutdown('Connection to 10.4.2.76:9042 was closed'), <Host: 10.4.2.234:9042 eu-west>: ConnectionShutdown('Connection to 10.4.2.234:19042 was closed'), <Host: 10.4.1.48:9042 eu-west>: ConnectionShutdown('Connection to 10.4.1.48:19042 was closed')})

2024-12-09 15:43:42.270 <2024-12-09 15:43:42.246>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=b001cd25-cf7c-4069-a1eb-56bb6a741610: type=RUNTIME_ERROR regex=std::runtime_error line_number=12014 node=longevity-10gb-3h-node-ope-db-node-e76e436f-3
2024-12-09T15:43:42.246+00:00 longevity-10gb-3h-node-ope-db-node-e76e436f-3      !ERR | scylla[9204]:  [shard 0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))

But I agree with you that the statement that "fullscan broken by rolling restart" sounds strange(maybe eleven like a bug), so will continue to investigate

temichus · 2024-12-10T17:12:06Z

@fruch @soyacz @aleksbykov

according to @fruch, I've back to solution with locking nodes 1 by one during RollingRestartCluster and increasing timouts

The last test run sows only 2 error messages but all nemesis pass

2024-12-10 13:54:40.470: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=5ff46034-9a64-4978-a370-a35effc5c1f4 during_nemesis=RollingRestartCluster duration=40m0s node=longevity-10gb-3h-node-ope-db-node-d1a01caa-3 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: OperationTimedOut("errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.0.234:9042"),session execution timeout 2400 is exceeded

can be #8705

and

2024-12-10 11:22:38.945: (FullScanAggregateEvent Severity.ERROR) period_type=end event_id=eadc36fd-dd71-4b8c-adc5-46e1f6b664ff duration=1m56s node=longevity-10gb-3h-node-ope-db-node-d1a01caa-2 select_from=keyspace1.standard1 message=FullScanAggregatesOperation operation failed: Fullscan failed - 'mapreduce_service_requests_dispatched_to_other_nodes' was not triggered

is scylladb/scylladb#21578 - will change it to warning in the next PR

fruch · 2024-12-10T17:47:20Z

@temichus
https://github.com/scylladb/scylla-enterprise/issues/4740 is something happens in upgrade, I don't understand how it's related here...

temichus · 2024-12-10T18:00:15Z

@temichus scylladb/scylla-enterprise#4740 is something happens in upgrade, I don't understand how it's related here...

wrolg link scylladb/scylladb#21578

fruch · 2024-12-10T19:22:59Z

sdcm/nemesis.py

@@ -984,7 +963,12 @@ def disrupt_soft_reboot_node(self):

    @decorate_with_context(ignore_ycsb_connection_refused)
    def disrupt_rolling_restart_cluster(self, random_order=False):
-        self.cluster.restart_scylla(random_order=random_order)
+        try:
+            for node in self.cluster.nodes:


it's not the only place in SCT we have a rolling restart

_enable_disable_table_encryption

disrupt_mgmt_restore

maybe we should consider moving this logic into restart_scylla

node.restart_scylla and cluster.restart_scylla is used in many places(test_functional.py, mng_cli_test.py, audit.py, artifacts_test.py, stop_compaction_test.py). But the lock mechanism is needed only when we have nemeses(in longevity tests). So let's keep it for nemeses only(at least for now)

will update _enable_disable_table_encryption and disrupt_mgmt_restore

roydahan · 2024-12-11T11:06:22Z

First, does it solve the issue?

temichus · 2024-12-11T12:37:29Z

First, does it solve the issue?

yes, there is no conflict anymore between FullScan and RollingRestart. Each chooses only the "free" node and locks it while doing an operation on ti

this commit has the following changes 1 introduce common targed_node_lock mechanism that can be used in nemesis and Scan operations 2 FullScan operation now run only on free of nemeses node 3 change all node.running_nemesis settings to use common methods set/unset_running_nemesis from common targed_node_lock file (except unit tests) 4 change disrupt_rolling_restart_cluster nemesis to lock all nodes in the cluster befo performing restart fixes: scylladb#9284

temichus · 2024-12-11T16:27:29Z

after a discussion with @fruch lock mechanism is not the best because Nemeses just will not run and wait for the lock. we heed a solution to let full-scan know (somehow) that the node was rebooted during full-scan

cc @aleksbykov @roydahan

roydahan · 2024-12-11T16:36:56Z

I'm suggesting again, to have a set of errors we can ignore approved by relevant developer and apply it in all relevant nemesis.

fruch · 2024-12-11T21:51:29Z

I'm suggesting again, to have a set of errors we can ignore approved by relevant developer and apply it in all relevant nemesis.

This is about a failure in fullscan, not nemesis
We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.
The problem we are facing is, that in rolling restart you target all nodes, without letting other parts of the code know about it, so the full scan thread doesn't know it should ignore the failure.

Lock suggested in this PR might cause high changes the nemesis doing rolling restart to be skipped in parallel nemesis cases.

I suggested that we need some way to identify that nodes restarted during the fullscan operation, but this not trivial with current event system.

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

roydahan · 2024-12-12T11:06:30Z

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

This is an option but it may hide problems.

roydahan · 2024-12-12T11:07:36Z

We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.

So what's the problem to ignore also the "map reduce" specific error we are getting now?

fruch · 2024-12-12T13:41:45Z

We already have code there that ignore fullscan results nemesis targeted the node that was the coordinator.

So what's the problem to ignore also the "map reduce" specific error we are getting now?

The problem is that ignore logic kicked in only if fullscan was using the same node as the current nemesis target node. which isn't the case in rolling restarts, cause we target all nodes.

If we're gonna just ignore the validation of the map reduce metric, we might as well remove it completely.
But it's just one of the symptoms we are seeing, we have there also failures of the fullscan queries themselves.

fruch · 2024-12-12T13:46:46Z

One other option, since fullscan is just a stress tool, we implemented in python.
Maybe we should take the same approach, and apply retries with exponential backkoff

This is an option but it may hide problems.

How it's different from the problems it might hide in c-s when we are retrying ?

Our from current implemention that make anything into a warning if it's during a nemesis targeting the same node ?

We use retry when there's an inerit cause of failure which is expected, and we can specifically cross match.
Consider if we had implement fullscan as stress tools in multiple languages, we would have implement retries, wouldn't we ?

roydahan · 2024-12-12T17:47:18Z

The problem is that ignore logic kicked in only if fullscan was using the same node as the current nemesis target node. which isn't the case in rolling restarts, cause we target all nodes.

If we're gonna just ignore the validation of the map reduce metric, we might as well remove it completely. But it's just one of the symptoms we are seeing, we have there also failures of the fullscan queries themselves.

We don't need to ignore the validation, we just need to ignore that specific error we're getting now.
If there are more errors, we need to report issue about them and ignore them only if this is what's decided.
In that case, what's the problem to run full scans inside a context manager that ignores this error?

roydahan · 2024-12-12T17:49:14Z

How it's different from the problems it might hide in c-s when we are retrying ?

It's a background thread that one hardly notice if it runs or not.
E.g. when we had aggregation issue at field and we didn't notice.

Our from current implemention that make anything into a warning if it's during a nemesis targeting the same node ?

We use retry when there's an inerit cause of failure which is expected, and we can specifically cross match. Consider if we had implement fullscan as stress tools in multiple languages, we would have implement retries, wouldn't we ?

You may use retry which will hide 99% of the problems and you wouldn't know, can't tell from monitor or anywhere else.

temichus requested review from enaydanov, fruch, soyacz and aleksbykov November 26, 2024 12:43

github-actions bot assigned temichus Nov 26, 2024

temichus force-pushed the node_operations_lock branch 5 times, most recently from 19d83c6 to 169726c Compare November 26, 2024 13:21

temichus added backport/2023.1 Need to backport to 2023.1 backport/2024.1 Need backport to 2024.1 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 backport/6.2 labels Nov 26, 2024

soyacz reviewed Nov 26, 2024

View reviewed changes

fruch reviewed Nov 26, 2024

View reviewed changes

aleksbykov previously approved these changes Nov 26, 2024

View reviewed changes

temichus dismissed aleksbykov’s stale review via 71384d2 November 26, 2024 17:16

temichus force-pushed the node_operations_lock branch 3 times, most recently from 33fd377 to 6c1792d Compare November 27, 2024 12:20

temichus changed the title ~~fix(FullScan): choose only non-disruptive node~~ fix(FullScan): choose non-running_nemesis node Nov 27, 2024

temichus force-pushed the node_operations_lock branch from 6c1792d to 793489d Compare November 27, 2024 12:26

temichus force-pushed the node_operations_lock branch 3 times, most recently from 27ead1f to 6c23dee Compare December 1, 2024 11:25

temichus force-pushed the node_operations_lock branch 5 times, most recently from 3eb3f14 to 1184ff1 Compare December 10, 2024 16:36

temichus requested review from soyacz, fruch and aleksbykov December 10, 2024 17:12

temichus marked this pull request as ready for review December 10, 2024 17:12

fruch reviewed Dec 10, 2024

View reviewed changes

temichus force-pushed the node_operations_lock branch from 1184ff1 to e3e7af0 Compare December 11, 2024 12:34

temichus force-pushed the node_operations_lock branch from e3e7af0 to 267cc0a Compare December 11, 2024 12:39

temichus marked this pull request as draft December 11, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(FullScan): choose non-running_nemesis node #9370

fix(FullScan): choose non-running_nemesis node #9370

temichus commented Nov 26, 2024 •

edited

Loading

temichus commented Nov 26, 2024

soyacz left a comment

soyacz Nov 26, 2024

fruch Nov 26, 2024

fruch Nov 26, 2024

aleksbykov left a comment

aleksbykov Nov 26, 2024

aleksbykov Nov 26, 2024

temichus commented Nov 27, 2024 •

edited

Loading

fruch commented Dec 4, 2024

temichus commented Dec 10, 2024 •

edited

Loading

temichus commented Dec 10, 2024 •

edited

Loading

fruch commented Dec 10, 2024

temichus commented Dec 10, 2024

fruch Dec 10, 2024

fruch Dec 10, 2024

fruch Dec 10, 2024

temichus Dec 11, 2024

roydahan commented Dec 11, 2024

temichus commented Dec 11, 2024

temichus commented Dec 11, 2024

roydahan commented Dec 11, 2024

fruch commented Dec 11, 2024

roydahan commented Dec 12, 2024

roydahan commented Dec 12, 2024

fruch commented Dec 12, 2024

fruch commented Dec 12, 2024

roydahan commented Dec 12, 2024

roydahan commented Dec 12, 2024



		@contextmanager
		def run_nemesis(node_list: list[BaseNode], nemesis_label: str, is_disruptive=True):

fix(FullScan): choose non-running_nemesis node #9370

Are you sure you want to change the base?

fix(FullScan): choose non-running_nemesis node #9370

Conversation

temichus commented Nov 26, 2024 • edited Loading

Testing

PR pre-checks (self review)

Reminders

temichus commented Nov 26, 2024

soyacz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksbykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

temichus commented Nov 27, 2024 • edited Loading

fruch commented Dec 4, 2024

temichus commented Dec 10, 2024 • edited Loading

temichus commented Dec 10, 2024 • edited Loading

fruch commented Dec 10, 2024

temichus commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roydahan commented Dec 11, 2024

temichus commented Dec 11, 2024

temichus commented Dec 11, 2024

roydahan commented Dec 11, 2024

fruch commented Dec 11, 2024

roydahan commented Dec 12, 2024

roydahan commented Dec 12, 2024

fruch commented Dec 12, 2024

fruch commented Dec 12, 2024

roydahan commented Dec 12, 2024

roydahan commented Dec 12, 2024

temichus commented Nov 26, 2024 •

edited

Loading

temichus commented Nov 27, 2024 •

edited

Loading

temichus commented Dec 10, 2024 •

edited

Loading

temichus commented Dec 10, 2024 •

edited

Loading