GreptimeTeam · nicecui · Jul 16, 2024 · Jul 12, 2024 · Jul 12, 2024 · Jul 12, 2024
diff --git a/docs/auto-imports.d.ts b/docs/auto-imports.d.ts
@@ -1,7 +1,6 @@
 /* eslint-disable */
 /* prettier-ignore */
 // @ts-nocheck
-// noinspection JSUnusedGlobalSymbols
 // Generated by unplugin-auto-import
 export {}
 declare global {

diff --git a/docs/nightly/en/summary.yml b/docs/nightly/en/summary.yml
@@ -89,6 +89,7 @@
       - quick-start
       - cluster-deployment
     - region-migration
+    - region-failover
     - monitoring
     - tracing
     # TODO

@@ -0,0 +1,86 @@
+# Region Failover
+
+Region Failover provides the ability to recover regions from region failures without losing data. This is implemented via [Region Migration](/user-guide/operations/region-migration).
+
+## Enable the Region Failover
+
+:::warning Warning
+This feature is only available on GreptimeDB running on distributed mode and
+
+- Using Kafka WAL
+- Using [shared storage](/user-guide/operations/configuration.md#storage-options) (e.g., AWS S3)
+:::
+
+### Via configuration file
+Set the `enable_region_failover=true` in [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) configuration file.
+
+### Via GreptimeDB Operator
+
+Set the `meta.enableRegionFailover=true`, e.g.,
+```bash
+helm install greptimedb greptime/greptimedb-cluster \
+  --set meta.enableRegionFailover=true \ 
+  ...
+```
+
+## The recovery time of Region Failover
+
+The recovery time of Region Failover depends on:
+
+- number of regions per Topic.
+- the Kafka cluster read throughput performance.
+
+:::tip Note
+
+In best practices, [the number of topics/partitions supported by a Kafka cluster is limited](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html) (exceeding this number can degrade Kafka cluster performance). 
+ Therefore, we allow multiple regions to share a single topic as the WAL. 
+:::
+
+### The read amplification
+
+The data belonging to a specific region consists of data files plus data in the WAL (typically `WAL[LastCheckpoint...Latest]`). The failover of a specific region only requires reading the region's WAL data to reconstruct the memory state, which is called region replaying. However, If multiple regions share a single topic, replaying data for a specific region from the topic requires filtering out unrelated data (i.e., data from other regions). **This means replaying data for a specific region from the topic requires reading more data than the actual size of the region's data in the topic, a phenomenon known as read amplification**.
+
+Although multiple regions share the same topic, allowing the Datanode to support more regions, the cost of this approach is read amplification during WAL replay.
+
+For example, configure 128 topics for [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration), and if the whole cluster holds 1024 regions (physical regions), every 8 regions will share one topic.
+
+![Read Amplification](/remote-wal-read-amplification.png)
+
+<p style="text-align: center;"><b>(Figure1: recovery Region 3 need to read redundant data 7 times larger than the actual size)</b></p>
+
+
+A simple model to estimate the read amplification factor (replay data size/actual data size):
+
+- For a single topic, if we try to replay all regions that belong to the topic, then the amplification factor would be 7+6+...+1 = 28 times. (The Region WAL data distribution is shown in the Figure 1. Replaying Region 3 will read 7 times redundant data larger than the actual size; Region 6 will read 6 times, and so on)
+- When recovering 100 regions (requiring about 13 topics), the amplification factor is approximately 28 \* 13 = 364 times.
+
+| Number of regions per Topic | Number of topics required for 100 Regions | Single topic read amplification factor | Total reading amplification factor | Replay data size (GB) |
+| --------------------------- | ----------------------------------------- | -------------------------------------- | ---------------------------------- | ---------------- |
+| 1                           | 100                                       | 0                                      | 0                                  | 0.5              |
+| 2                           | 50                                        | 1                                      | 50                                 | 25.5             |
+| 4                           | 25                                        | 6                                      | 150                                | 75.5             |
+| 8                           | 13                                        | 28                                     | 364                                | 182.5            |
+| 16                          | 7                                         | 120                                    | 840                                | 420.5            |
+
+**If the Kafka cluster can provide 300MB/s read throughput, recovering 100 regions requires approximately 10 minutes (182.5GB/0.3GB = 10m).**
+
+
+### More examples
+
+| Number of regions per Topic | Replay data size (GB) | Kafka throughput 300MB/s- Reovery time (secs) | Kafka throughput 1000MB/s- Reovery time (secs) |
+| --------------------------- | ---------------- | --------------------------------------------- | ---------------------------------------------- |
+| 1                           | 0.5              | 2                                             | 1                                              |
+| 2                           | 25.5             | 85                                            | 26                                             |
+| 4                           | 75.5             | 252                                           | 76                                             |
+| 8                           | 182.5            | 608                                           | 183                                            |
+| 16                          | 420.5            | 1402                                          | 421                                            |
+
+<sub>\*: Assuming the unflushed data size is 0.5GB.</sub>
+<sub>\**Replay data size: The total size of WAL data that needs to be read to reconstruct the memory state.</sub>
+
+### Suggestions for improving recovery time
+
+We have calculated the recovery time under different Number of regions per Topic configuration for reference.
+In actual scenarios, the read amplification may be larger than this model.
+If you are very sensitive to recovery time, we recommend that each region have its topic(i.e., Number of regions per Topic is 1).
+
@@ -3,7 +3,7 @@
 Region Migration allows users to move regions between the Datanode.
 
 :::warning Warning
-This feature is only available on GreptimeDB running on cluster mode and 
+This feature is only available on GreptimeDB running on distributed mode and 
 - Using Kafka WAL
 - Using [shared storage](/user-guide/operations/configuration.md#storage-options) (e.g., AWS S3)
 

@@ -0,0 +1,85 @@
+# Region Failover
+
+Region Failover 提供了在不丢失数据的情况下从 Region 故障中恢复的能力。这是通过 [Region 迁移](/user-guide/operations/region-migration) 实现的。
+
+## 开启 Region Failover 
+
+:::warning Warning
+该功能仅在 GreptimeDB 集群模式下可用，并且需要满足以下条件
+
+- 使用 Kafka WAL
+- 使用[共享存储](/user-guide/operations/configuration.md#storage-options) (例如：AWS S3)
+  :::
+
+### 通过配置文件
+
+在 [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) 配置文件中设置 `enable_region_failover=true`.
+
+### 通过 GreptimeDB Operator
+
+通过设置 `meta.enableRegionFailover=true`, 例如
+
+```bash
+helm install greptimedb greptime/greptimedb-cluster \
+  --set meta.enableRegionFailover=true \
+  ...
+```
+
+## Region Failover 的恢复用时
+
+Region Failover 的恢复时间取决于：
+
+- 每个 Topic 的 region 数量
+- Kafka 集群的读取吞吐性能
+
+:::tip Note
+
+在最佳实践中，[Kafka 集群所支持的 topics/partitions 数量是有限的](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html)（超过这个数量可能会导致 Kafka 集群性能下降）。
+因此，我们允许多个 regions 共享一个 topic 作为 WAL。
+:::
+
+### 读放大
+
+属于特定 Region 的数据由数据文件和 WAL 中的数据（通常为 WAL[LastCheckpoint...Latest]）组成。特定 Region 的故障恢复只需要读取该 Region 的 WAL 数据以重建内存状态，这被称为 Region 重放（region replaying）。然而，如果多个 Region 共享一个 Topic，则从 Topic 重放特定 Region 的数据需要过滤掉不相关的数据（即其他 Region 的数据）。**这意味着从 Topic 重放特定 Region 的数据需要读取比该 Region 实际 WAL 数据大小更多的数据，这种现象被称为读取放大（read amplification）**。
+
+尽管多个 Region 共享同一个 Topic，可以让 Datanode 支持更多的 Region，但这种方法的代价是在 Region 重放过程中产生读取放大。
+
+例如，为 [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) 配置 128 个 Topic，如果整个集群包含 1024 个 Region（物理 Region），那么每 8 个 Region 将共享一个 Topic。
+
+![Read Amplification](/remote-wal-read-amplification.png)
+
+<p style="text-align: center;"><b>(图 1：恢复 Region 3 需要读取比实际大小大 7 倍的冗余数据)</b></p>
+
+估算读取放大倍数（重播数据大小/实际数据大小）的简单模型：
+
+- 对于单个 Topic，如果我们尝试重播属于该 Topic 的所有 Region，那么放大倍数将是 7+6+...+1 = 28 倍。（图 1 显示了 Region WAL 数据分布。重播 Region 3 将读取比实际大小大 7 倍的冗余数据；重播 Region 6 将读取比实际大小大 6 倍的冗余数据，以此类推）
+- 在恢复 100 个 Region 时（需要大约 13 个 Topic），放大倍数大约为 28 \* 13 = 364 倍。
+
+| 每个 Topic 的 Region 数量 | 100 个 Region 所需 Topic 数量 | 单个 Topic 读放大系数 | 总读放大系数 | 重放数据大小（GB） |
+| ------------------------- | ----------------------------- | --------------------- | ------------ | ------------------ |
+| 1                         | 100                           | 0                     | 0            | 0.5                |
+| 2                         | 50                            | 1                     | 50           | 25.5               |
+| 4                         | 25                            | 6                     | 150          | 75.5               |
+| 8                         | 13                            | 28                    | 364          | 182.5              |
+| 16                        | 7                             | 120                   | 840          | 420.5              |
+
+**如果 Kafka 集群可以提供 300MB/s 的读取吞吐量，恢复 100 个 Region 大约需要 10 分钟（182.5GB/0.3GB = 10 分钟）。**
+
+### 更多例子
+
+| 每个主题的区域数 | 重放数据大小（GB） | Kafka 吞吐量 300MB/s- 恢复时间（秒） | Kafka 吞吐量 1000MB/s- 恢复时间（秒） |
+| ---------------- | ------------------ | ------------------------------------ | ------------------------------------- |
+| 1                | 0.5                | 2                                    | 1                                     |
+| 2                | 25.5               | 85                                   | 26                                    |
+| 4                | 75.5               | 252                                  | 76                                    |
+| 8                | 182.5              | 608                                  | 183                                   |
+| 16               | 420.5              | 1402                                 | 421                                   |
+
+<sub>\*: 假设未刷新的数据大小为 0.5GB。</sub>
+<sub>\*\*回放数据大小：重建内存状态所需读取的 WAL 数据的总大小。</sub>
+
+### 改进恢复时间的建议
+
+我们已经根据不同的每个 Topic 包含 Region 数量的配置计算了恢复时间，供参考。
+在实际场景中，读取放大的现象可能会比这个模型更为严重。
+如果您对恢复时间非常敏感，我们建议每个 Region 都有自己的 Topic（即，每个 Topic 包含的 Region 数量为 1）。