|
| 1 | +:description: This section describes how to monitor a database's availability with the help of the cluster status check procedure. |
| 2 | + |
| 3 | +:page-role: enterprise-edition new-5.24 |
| 4 | +[[cluster-status-check]] |
| 5 | += Cluster status check |
| 6 | + |
| 7 | +Neo4j 5.24 introduces the xref:reference/procedures.adoc#procedure_dbms_cluster_statusCheck[`dbms.cluster.statusCheck()`] procedure, which can be used to monitor the ability to replicate in clustered databases, which in most cases means being able to write to the database. |
| 8 | +You can also use the procedure to check which members are up-to-date and can participate in a successful replication. |
| 9 | +Therefore, it is useful in determining the fault-tolerance of a clustered database as well. |
| 10 | +A third and final function is to determine the leader of the cluster. |
| 11 | + |
| 12 | +[NOTE] |
| 13 | +==== |
| 14 | +The member on which the procedure is called replicates a dummy transaction in the same cluster as the real transactions, and verifies that it can be replicated and applied. |
| 15 | +
|
| 16 | +Since the status check doesn't replicate an actual transaction, it's not guaranteed that the database is write available even though the status check reports that it can replicate. |
| 17 | +Apart from replication there are other stops in the write path that can potentially block a transaction from being applied, e.g. issues in the database. |
| 18 | +However, it tells that the cluster is healthy and in most cases that means that the database is write available. |
| 19 | +==== |
| 20 | + |
| 21 | +[[procedure-syntax]] |
| 22 | +== Syntax |
| 23 | + |
| 24 | +[source, shell] |
| 25 | +---- |
| 26 | +CALL dbms.cluster.statusCheck(databases :: LIST<STRING>, timeoutMilliseconds = null :: INTEGER) |
| 27 | +---- |
| 28 | + |
| 29 | +* *databases:* the list of databases for which the status check should run. |
| 30 | +Providing an empty list runs the status check for all *clustered* databases on that server, i.e. the status check won't run on singles or secondaries. |
| 31 | +* *timeoutMilliseconds:* specifies how long the replication may take. |
| 32 | +Default value is 1000 milliseconds. |
| 33 | +If replication takes longer than this timeout, it will return that replication is unsuccessful. |
| 34 | + |
| 35 | + |
| 36 | +The procedure returns a row for all primary members of all the requested databases where each row consists of: |
| 37 | + |
| 38 | +* *database:* the database for which the `status check entry` was replicated. |
| 39 | +* *serverId:* the server id of each primary member, which did or did not participate in a successful replication of the `status check entry`. |
| 40 | +* *serverName:* the server name of each primary member. |
| 41 | +* *address:* the Bolt address of each primary member. |
| 42 | +* *replicationSuccessful:* indicates if the server (on which the procedure is run) can replicate a transaction. |
| 43 | ++ |
| 44 | +** `TRUE` -- if this server managed to replicate the dummy transaction to a majority of cluster members within the given timeout. |
| 45 | +** `FALSE` -- if it failed to replicate within the timeout. |
| 46 | +The value is the same column-wise. |
| 47 | +A failed replication can either mean a real issue in the cluster (e.g., no leader) or that this server is too far behind in apply and can't replicate. |
| 48 | +* *memberStatus:* shows the status of each primary member. |
| 49 | +It can be `APPLYING`, `REPLICATING`, or `UNAVAILABLE`. |
| 50 | ++ |
| 51 | +** `APPLYING` means that the member can replicate and is actively applying transactions. |
| 52 | +** `REPLICATING` means that the member can participate in replicating, but can't apply. |
| 53 | +This state is uncommon, but may happen while waiting for the database to start and accept transactions. |
| 54 | +* *recognisedLeader:* shows the server id of the perceived leader of each primary member. |
| 55 | +* *recognisedLeaderTerm:* shows the term of the perceived leader of each primary member. |
| 56 | +If the members report different leaders, the one with the highest term should be trusted. |
| 57 | +* *requester:* is `TRUE` for the server on which the procedure is run, and `FALSE` on the remaining servers. |
| 58 | +* *error:* contains the error message if there is one. |
| 59 | +An example of an error is that one or more of the requested databases doesn't exist on the requester. |
| 60 | + |
| 61 | +In general, you can use the `replicationSuccessful` field to determine overall write-availability, whereas the `memberStatus` field can be checked in order to see whether the database is fault-tolerant or not. |
| 62 | + |
| 63 | +[NOTE] |
| 64 | +==== |
| 65 | +Members that are `REPLICATING` are good from a data safety point of view. |
| 66 | +They can participate in replication and keep the data durably until application. |
| 67 | +They are also up-to-date and therefore eligible leaders. |
| 68 | +So they add to the fault-tolerance. |
| 69 | +
|
| 70 | +Members that are `APPLYING` have all the qualities of `REPLICATING` members, so they too add to the fault-tolerance. |
| 71 | +But they are also applying to the database, which is a requirement for writing transactions and reading with bookmarks in a timely manner. |
| 72 | +
|
| 73 | +Lastly, `UNAVAILABLE` members are either too far behind or unreachable. |
| 74 | +They are unhealthy and cannot add to the fault-tolerance. |
| 75 | +==== |
| 76 | + |
| 77 | + |
0 commit comments