Skip to content

Commit bbae8b5

Browse files
renetapopovatselmegbaasanNataliaIvakina
authored
Docs for the rafted status check procedure. (#1823) (#1832)
Cherry-picked #1823 and #1827 --------- Co-authored-by: Tselmeg Baasan <37698237+tselmegbaasan@users.noreply.github.com> Co-authored-by: NataliaIvakina <82437520+NataliaIvakina@users.noreply.github.com>
1 parent 6934912 commit bbae8b5

File tree

3 files changed

+79
-0
lines changed

3 files changed

+79
-0
lines changed

modules/ROOT/content-nav.adoc

+1
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@
148148
*** xref:clustering/monitoring/show-servers-monitoring.adoc[]
149149
*** xref:clustering/monitoring/show-databases-monitoring.adoc[]
150150
*** xref:clustering/monitoring/endpoints.adoc[]
151+
*** xref:clustering/monitoring/status-check.adoc[]
151152
** xref:clustering/disaster-recovery.adoc[]
152153
//** xref:clustering/internals.adoc[]
153154
** xref:clustering/settings.adoc[]

modules/ROOT/pages/clustering/index.adoc

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This chapter describes the following:
1919
** xref:clustering/monitoring/show-servers-monitoring.adoc[Monitor servers] -- The tools available for monitoring the servers in a cluster.
2020
** xref:clustering/monitoring/show-databases-monitoring.adoc[Monitor databases] -- The tools available for monitoring the databases in a cluster.
2121
** xref:clustering/monitoring/endpoints.adoc[Monitor cluster endpoints for status information] -- The endpoints and semantics of endpoints used to monitor the health of the cluster.
22+
** xref:clustering/monitoring/status-check.adoc[Cluster status check] label:new[Introduced in 5.24] -- The procedure that checks which databases are up-to-date and can participate in a successful replication.
2223
* xref:clustering/disaster-recovery.adoc[Disaster recovery] -- How to recover a cluster in the event of a disaster.
2324
* xref:clustering/settings.adoc[Settings reference] -- A summary of the most important cluster settings.
2425
* xref:clustering/server-syntax.adoc[Server commands reference] -- Reference of Cypher administrative commands to add and manage servers.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
:description: This section describes how to monitor a database's availability with the help of the cluster status check procedure.
2+
3+
:page-role: enterprise-edition new-5.24
4+
[[cluster-status-check]]
5+
= Cluster status check
6+
7+
Neo4j 5.24 introduces the xref:reference/procedures.adoc#procedure_dbms_cluster_statusCheck[`dbms.cluster.statusCheck()`] procedure, which can be used to monitor the ability to replicate in clustered databases, which in most cases means being able to write to the database.
8+
You can also use the procedure to check which members are up-to-date and can participate in a successful replication.
9+
Therefore, it is useful in determining the fault-tolerance of a clustered database as well.
10+
A third and final function is to determine the leader of the cluster.
11+
12+
[NOTE]
13+
====
14+
The member on which the procedure is called replicates a dummy transaction in the same cluster as the real transactions, and verifies that it can be replicated and applied.
15+
16+
Since the status check doesn't replicate an actual transaction, it's not guaranteed that the database is write available even though the status check reports that it can replicate.
17+
Apart from replication there are other stops in the write path that can potentially block a transaction from being applied, e.g. issues in the database.
18+
However, it tells that the cluster is healthy and in most cases that means that the database is write available.
19+
====
20+
21+
[[procedure-syntax]]
22+
== Syntax
23+
24+
[source, shell]
25+
----
26+
CALL dbms.cluster.statusCheck(databases :: LIST<STRING>, timeoutMilliseconds = null :: INTEGER)
27+
----
28+
29+
* *databases:* the list of databases for which the status check should run.
30+
Providing an empty list runs the status check for all *clustered* databases on that server, i.e. the status check won't run on singles or secondaries.
31+
* *timeoutMilliseconds:* specifies how long the replication may take.
32+
Default value is 1000 milliseconds.
33+
If replication takes longer than this timeout, it will return that replication is unsuccessful.
34+
35+
36+
The procedure returns a row for all primary members of all the requested databases where each row consists of:
37+
38+
* *database:* the database for which the `status check entry` was replicated.
39+
* *serverId:* the server id of each primary member, which did or did not participate in a successful replication of the `status check entry`.
40+
* *serverName:* the server name of each primary member.
41+
* *address:* the Bolt address of each primary member.
42+
* *replicationSuccessful:* indicates if the server (on which the procedure is run) can replicate a transaction.
43+
+
44+
** `TRUE` -- if this server managed to replicate the dummy transaction to a majority of cluster members within the given timeout.
45+
** `FALSE` -- if it failed to replicate within the timeout.
46+
The value is the same column-wise.
47+
A failed replication can either mean a real issue in the cluster (e.g., no leader) or that this server is too far behind in apply and can't replicate.
48+
* *memberStatus:* shows the status of each primary member.
49+
It can be `APPLYING`, `REPLICATING`, or `UNAVAILABLE`.
50+
+
51+
** `APPLYING` means that the member can replicate and is actively applying transactions.
52+
** `REPLICATING` means that the member can participate in replicating, but can't apply.
53+
This state is uncommon, but may happen while waiting for the database to start and accept transactions.
54+
* *recognisedLeader:* shows the server id of the perceived leader of each primary member.
55+
* *recognisedLeaderTerm:* shows the term of the perceived leader of each primary member.
56+
If the members report different leaders, the one with the highest term should be trusted.
57+
* *requester:* is `TRUE` for the server on which the procedure is run, and `FALSE` on the remaining servers.
58+
* *error:* contains the error message if there is one.
59+
An example of an error is that one or more of the requested databases doesn't exist on the requester.
60+
61+
In general, you can use the `replicationSuccessful` field to determine overall write-availability, whereas the `memberStatus` field can be checked in order to see whether the database is fault-tolerant or not.
62+
63+
[NOTE]
64+
====
65+
Members that are `REPLICATING` are good from a data safety point of view.
66+
They can participate in replication and keep the data durably until application.
67+
They are also up-to-date and therefore eligible leaders.
68+
So they add to the fault-tolerance.
69+
70+
Members that are `APPLYING` have all the qualities of `REPLICATING` members, so they too add to the fault-tolerance.
71+
But they are also applying to the database, which is a requirement for writing transactions and reading with bookmarks in a timely manner.
72+
73+
Lastly, `UNAVAILABLE` members are either too far behind or unreachable.
74+
They are unhealthy and cannot add to the fault-tolerance.
75+
====
76+
77+

0 commit comments

Comments
 (0)