IGNITE-23054 Improve cluster status REST endpoint #4614

Pochatkin · 2024-10-22T11:33:22Z

https://issues.apache.org/jira/browse/IGNITE-23054

Cluster status REST and CLI methods are improved:

Added different states healthy, cmg majority lost, metastore majority lost
When cmg majority lost returns corresponding status instead of timeout exception
Impoved integration test's cluster class. Added possibility to select different nodes for CMG and MS groups

valepakh · 2024-10-31T09:22:02Z

...pache/ignite/internal/cli/commands/cluster/status/ItClusterStatusCommandInitializedTest.java

+        assertOutput("cluster", 2, "Metastore majority lost", cmgNodes(), metastoreNodes());
+
+        CLUSTER.startNode(0);
+        Thread.sleep(10000);


Let's change this to await().untilAsserted()

valepakh · 2024-10-31T09:22:47Z

...i/src/integrationTest/java/org/apache/ignite/internal/cli/commands/sql/ItSqlCommandTest.java

@@ -33,6 +33,8 @@ class ItSqlCommandTest extends CliSqlCommandTestBase {
    void nonExistingFile() {
        execute("sql", "--file", "nonexisting", "--jdbc-url", JDBC_URL);

+        CLUSTER.stopNode(0);


Why this is needed?

valepakh · 2024-10-31T09:29:40Z

.../cli/src/main/java/org/apache/ignite/internal/cli/call/cluster/status/ClusterStatusCall.java


 /**
 * Call to get cluster status.
 */
 @Singleton
-public class ClusterStatusCall implements Call<UrlCallInput, ClusterStatus> {
+public class ClusterStatusCall implements Call<UrlCallInput, ClusterState> {
+    private static final int READ_TIMEOUT = 50_000;


Why this needs to be 50 seconds? As I understand this depends on the timeout settings in the cluster state retrieval, can we at least point to the place where this is configured in the CMGManager?

valepakh · 2024-10-31T09:30:35Z

modules/rest-api/src/main/java/org/apache/ignite/internal/rest/api/cluster/ClusterState.java

@@ -64,24 +73,28 @@ public class ClusterState {
    @JsonCreator
    public ClusterState(
            @JsonProperty("cmgNodes") Collection<String> cmgNodes,
-            @JsonProperty("msNodes") Collection<String> msNodes,
+            @JsonProperty("msNodes")  Collection<String> msNodes,


Suggested change

@JsonProperty("msNodes") Collection<String> msNodes,

@JsonProperty("msNodes") Collection<String> msNodes,

valepakh · 2024-10-31T09:34:55Z

.../rest/src/main/java/org/apache/ignite/internal/rest/cluster/ClusterManagementController.java

        );
    }

+    private ClusterStatus mapClusterStatus(org.apache.ignite.internal.cluster.management.ClusterState clusterState) {
+        Set<String> metaStorageNodes = clusterState.metaStorageNodes();
+        long presentedMetaStorageNodes = topologyService.allMembers().stream()


Suggested change

long presentedMetaStorageNodes = topologyService.allMembers().stream()

long presentMetaStorageNodes = topologyService.allMembers().stream()

valepakh · 2024-10-31T09:35:12Z

.../rest/src/main/java/org/apache/ignite/internal/rest/cluster/ClusterManagementController.java

+                .filter(metaStorageNodes::contains)
+                .count();
+
+        if (presentedMetaStorageNodes <= metaStorageNodes.size() / 2) {


Suggested change

if (presentedMetaStorageNodes <= metaStorageNodes.size() / 2) {

if (presentMetaStorageNodes <= metaStorageNodes.size() / 2) {

valepakh · 2024-11-01T10:07:59Z

.../cli/src/main/java/org/apache/ignite/internal/cli/call/cluster/status/ClusterStatusCall.java

+public class ClusterStatusCall implements Call<UrlCallInput, ClusterState> {
+    /**
+     * We need to overlap timeout from raft client.
+     * We can't determine timeout value because it's configurable for each node


Maybe we can read the node configuration here and just add a couple of seconds to the configured timeout?

rpuch · 2024-11-01T11:36:59Z

modules/cli/src/main/java/org/apache/ignite/internal/cli/call/cluster/status/ClusterState.java


 /**
 * Class that represents the cluster status.
 */
-public class ClusterStatus {
+public class ClusterState {


We already have an entity called 'cluster state' in the CMG domain. This one is a different entity and yet it's called identically. Could a distinct name be devised? Something like ClusterRuntimeState, maybe.

This is CLI output DTO object. So, I have added postfix

rpuch · 2024-11-01T11:38:02Z

modules/cli/src/main/java/org/apache/ignite/internal/cli/call/cluster/status/ClusterState.java

        this.cmgNodes = cmgNodes;
        this.metadataStorageNodes = metadataStorageNodes;


How about making defensive copies? Better safe than sorry, and this class should not have a huge amount of instances

We use builder approach in CLI module

rpuch · 2024-11-01T11:41:56Z

modules/cli/src/main/java/org/apache/ignite/internal/cli/decorators/ClusterStatusDecorator.java

+            case MS_MAJORITY_LOST:
+                return fg(Color.RED).mark("Metastore majority lost");
+            case HEALTHY:
+                return fg(Color.GREEN).mark("active");
+            case CMG_MAJORITY_LOST:
+                return fg(Color.RED).mark("CMG majority lost");


The ordering is a bit strange: why is HEALTHY in the middle?

rpuch · 2024-11-01T11:46:24Z

modules/rest-api/src/main/java/org/apache/ignite/internal/rest/api/cluster/ClusterState.java

    private final ClusterTag clusterTag;

    @Schema(description = "IDs the cluster had before.")
    @IgniteToStringInclude
    private final @Nullable List<UUID> formerClusterIds;

+    @Schema(description = "Cluster status.",
+            requiredMode = RequiredMode.REQUIRED)
+    private final ClusterStatus clusterStatus;


It seems that entities from different domains are mixed here. This ClusterState class represents the CMG cluster state that is the state stored in the CMG related to the cluster itself. But ClusterStatus is some volatile thing, it might change during runtime however it wants, even if the CMG cluster state does not change.

I believe we should not mix them together. Could a distinct API endpoint (and CLI command) be added to access this volatile information?

rpuch · 2024-11-01T11:46:55Z

modules/rest-api/src/main/java/org/apache/ignite/internal/rest/api/cluster/ClusterStatus.java

+
+    /**
+     * The metastore group has lost its majority. Almost all of the cluster functions are inoperative.
+     * To restore their operation, it is necessary to return the majority to the metastore group.


Suggested change

* To restore their operation, it is necessary to return the majority to the metastore group.

* To restore their operation, it is necessary to restore the majority to the metastore group.

rpuch · 2024-11-01T11:47:12Z

modules/rest-api/src/main/java/org/apache/ignite/internal/rest/api/cluster/ClusterStatus.java

+    MS_MAJORITY_LOST,
+
+    /**
+     * The cluster management group has lost its majority. The cluster is completely inoperative until the majority is returned.


Suggested change

* The cluster management group has lost its majority. The cluster is completely inoperative until the majority is returned.

* The cluster management group has lost its majority. The cluster is completely inoperative until the majority is restored.

rpuch · 2024-11-01T11:50:45Z

.../rest/src/main/java/org/apache/ignite/internal/rest/cluster/ClusterManagementController.java

+                .count();
+
+        if (presentMetaStorageNodes <= metaStorageNodes.size() / 2) {
+            return ClusterStatus.MS_MAJORITY_LOST;


The fact that we have enough nodes in the physical topology does not necessarily imply we have an MG majority. For instance, something on levels higher than the network might prevent a leader to be elected. The most accurate way would be to try executing a read command against the MS group (like reading its current revision) and handle a timeout.

# Conflicts: # modules/metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/MetaStorageManagerImpl.java # modules/metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/MetaStorageService.java

IGNITE-23054 Improve cluster status REST endpoint

e40a9e8

Pochatkin requested a review from PakhomovAlexander October 22, 2024 11:35

Pochatkin added 2 commits October 22, 2024 14:39

Fix timeout

312daa8

Fix tests

7fd3564

valepakh reviewed Oct 31, 2024

View reviewed changes

Pochatkin requested a review from valepakh November 1, 2024 10:00

Fix comments and test

909cc35

Pochatkin requested review from rpuch and valepakh and removed request for valepakh November 1, 2024 10:00

valepakh reviewed Nov 1, 2024

View reviewed changes

rpuch reviewed Nov 1, 2024

View reviewed changes

Pochatkin added 2 commits November 20, 2024 16:35

Fix comments

124a6f7

Merge branch 'main' into IGNITE-23054

35eaa84

# Conflicts: # modules/metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/MetaStorageManagerImpl.java # modules/metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/MetaStorageService.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-23054 Improve cluster status REST endpoint #4614

IGNITE-23054 Improve cluster status REST endpoint #4614

Pochatkin commented Oct 22, 2024 •

edited

Loading

valepakh Oct 31, 2024

Pochatkin Nov 7, 2024

valepakh Oct 31, 2024

Pochatkin Nov 7, 2024

valepakh Oct 31, 2024

valepakh Oct 31, 2024

Pochatkin Nov 7, 2024

valepakh Oct 31, 2024

Pochatkin Nov 7, 2024

valepakh Oct 31, 2024

Pochatkin Nov 7, 2024

valepakh Nov 1, 2024

rpuch Nov 1, 2024

Pochatkin Nov 7, 2024

rpuch Nov 1, 2024

Pochatkin Nov 7, 2024

rpuch Nov 1, 2024

rpuch Nov 1, 2024

rpuch Nov 1, 2024

rpuch Nov 1, 2024

rpuch Nov 1, 2024

	@JsonProperty("msNodes") Collection<String> msNodes,
	@JsonProperty("msNodes") Collection<String> msNodes,

	long presentedMetaStorageNodes = topologyService.allMembers().stream()
	long presentMetaStorageNodes = topologyService.allMembers().stream()

	if (presentedMetaStorageNodes <= metaStorageNodes.size() / 2) {
	if (presentMetaStorageNodes <= metaStorageNodes.size() / 2) {

		this.cmgNodes = cmgNodes;
		this.metadataStorageNodes = metadataStorageNodes;

	* To restore their operation, it is necessary to return the majority to the metastore group.
	* To restore their operation, it is necessary to restore the majority to the metastore group.

	* The cluster management group has lost its majority. The cluster is completely inoperative until the majority is returned.
	* The cluster management group has lost its majority. The cluster is completely inoperative until the majority is restored.

IGNITE-23054 Improve cluster status REST endpoint #4614

Are you sure you want to change the base?

IGNITE-23054 Improve cluster status REST endpoint #4614

Conversation

Pochatkin commented Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pochatkin commented Oct 22, 2024 •

edited

Loading