GreptimeTeam · nicecui · Jul 16, 2024 · Jul 12, 2024 · Jul 12, 2024 · Jul 12, 2024
diff --git a/docs/auto-imports.d.ts b/docs/auto-imports.d.ts
@@ -1,7 +1,6 @@
 /* eslint-disable */
 /* prettier-ignore */
 // @ts-nocheck
-// noinspection JSUnusedGlobalSymbols
 // Generated by unplugin-auto-import
 export {}
 declare global {

diff --git a/docs/nightly/en/summary.yml b/docs/nightly/en/summary.yml
@@ -89,6 +89,7 @@
       - quick-start
       - cluster-deployment
     - region-migration
+    - region-failover
     - monitoring
     - tracing
     # TODO

@@ -0,0 +1,81 @@
+# Region Failover
+
+Region Failover provides the ability to recover regions from region failures without losing data. This is implemented via [Region Migration](/user-guide/operations/region-migration).
+
+## Enable the Region Failover
+
+This feature is only available on GreptimeDB running on distributed mode and
+
+- Using Kafka WAL
+- Using [shared storage](/user-guide/operations/configuration.md#storage-options) (e.g., AWS S3)
+
+### Via configuration file
+Set the `enable_region_failover=true` in [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) configuration file.
+
+### Via GreptimeDB Operator
+
+Set the `meta.enableRegionFailover=true`, e.g.,
+```bash
+helm install greptimedb greptime/greptimedb-cluster \
+  --set meta.enableRegionFailover=true \ 
+  ...
+```
+
+## The recovery time of Region Failover
+
+The recovery time of Region Failover depends on:
+
+- number of regions per Topic.
+- the Kafka cluster read throughput performance.
+
+### The read amplification
+
+In best practices, [the number of topics/partitions supported by a Kafka cluster is limited](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html) (exceeding this number can degrade Kafka cluster performance). 
+Therefore, we allow multiple regions to share a single topic as the WAL.
+However, this may cause to the read amplification issue.
+
+The data belonging to a specific region consists of data files plus data in the WAL (typically `WAL[LastCheckpoint...Latest]`). The failover of a specific region only requires reading the region's WAL data to reconstruct the memory state, which is called region replaying. However, If multiple regions share a single topic, replaying data for a specific region from the topic requires filtering out unrelated data (i.e., data from other regions). This means replaying data for a specific region from the topic requires reading more data than the actual size of the region's data in the topic, a phenomenon known as read amplification.
+
+Although multiple regions share the same topic, allowing the Datanode to support more regions, the cost of this approach is read amplification during WAL replay.
+
+For example, configure 128 topics for [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration), and if the whole cluster holds 1024 regions (physical regions), every 8 regions will share one topic.
+
+![Read Amplification](/remote-wal-read-amplification.png)
+
+<p style="text-align: center;"><b>(Figure1: recovery Region 3 need to read redundant data 7 times larger than the actual size)</b></p>
+
+
+A simple model to estimate the read amplification factor (replay data size/actual data size):
+
+- For a single topic, if we try to replay all regions that belong to the topic, then the amplification factor would be 7+6+...+1 = 28 times. (The Region WAL data distribution is shown in the Figure 1. Replaying Region 3 will read 7 times redundant data larger than the actual size; Region 6 will read 6 times, and so on)
+- When recovering 100 regions (requiring about 13 topics), the amplification factor is approximately 28 \* 13 = 364 times.
+
+Assuming we have 100 regions to recover, and the actual data size of all region is 0.5GB, the following table shows the replay data size based on the number of regions per topic.
+
+| Number of regions per Topic | Number of topics required for 100 Regions | Single topic read amplification factor | Total reading amplification factor | Replay data size (GB) |
+| --------------------------- | ----------------------------------------- | -------------------------------------- | ---------------------------------- | ---------------- |
+| 1                           | 100                                       | 0                                      | 0                                  | 0.5              |
+| 2                           | 50                                        | 1                                      | 50                                 | 25.5             |
+| 4                           | 25                                        | 6                                      | 150                                | 75.5             |
+| 8                           | 13                                        | 28                                     | 364                                | 182.5            |
+| 16                          | 7                                         | 120                                    | 840                                | 420.5            |
+
+
+The following table shows the recovery time of 100 regions under different read throughput conditions of the Kafka cluster. For example, when providing a read throughput of 300MB/s, recovering 100 regions requires approximately 10 minutes (182.5GB/0.3GB = 10m).
+
+| Number of regions per Topic | Replay data size (GB) | Kafka throughput 300MB/s- Reovery time (secs) | Kafka throughput 1000MB/s- Reovery time (secs) |
+| --------------------------- | ---------------- | --------------------------------------------- | ---------------------------------------------- |
+| 1                           | 0.5              | 2                                             | 1                                              |
+| 2                           | 25.5             | 85                                            | 26                                             |
+| 4                           | 75.5             | 252                                           | 76                                             |
+| 8                           | 182.5            | 608                                           | 183                                            |
+| 16                          | 420.5            | 1402                                          | 421                                            |
+
+
+### Suggestions for improving recovery time
+
+In the above example, we calculated the recovery time based on the number of Regions contained in each Topic for reference.
+We have calculated the recovery time under different Number of regions per Topic configuration for reference.
+In actual scenarios, the read amplification may be larger than this model.
+If you are very sensitive to recovery time, we recommend that each region have its topic(i.e., Number of regions per Topic is 1).
+
@@ -3,7 +3,7 @@
 Region Migration allows users to move regions between the Datanode.
 
 :::warning Warning
-This feature is only available on GreptimeDB running on cluster mode and 
+This feature is only available on GreptimeDB running on distributed mode and 
 - Using Kafka WAL
 - Using [shared storage](/user-guide/operations/configuration.md#storage-options) (e.g., AWS S3)
 

@@ -0,0 +1,81 @@
+# Region Failover
+
+Region Failover 提供了在不丢失数据的情况下从 Region 故障中恢复的能力。这是通过 [Region 迁移](/user-guide/operations/region-migration) 实现的。
+
+## 开启 Region Failover 
+
+
+该功能仅在 GreptimeDB 集群模式下可用，并且需要满足以下条件
+
+- 使用 Kafka WAL
+- 使用[共享存储](/user-guide/operations/configuration.md#storage-options) (例如：AWS S3)
+
+
+### 通过配置文件
+
+在 [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) 配置文件中设置 `enable_region_failover=true`.
+
+### 通过 GreptimeDB Operator
+
+通过设置 `meta.enableRegionFailover=true`, 例如
+
+```bash
+helm install greptimedb greptime/greptimedb-cluster \
+  --set meta.enableRegionFailover=true \
+  ...
+```
+
+## Region Failover 的恢复用时
+
+Region Failover 的恢复时间取决于：
+
+- 每个 Topic 的 region 数量
+- Kafka 集群的读取吞吐性能
+
+
+### 读放大
+
+在最佳实践中，[Kafka 集群所支持的 topics/partitions 数量是有限的](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html)（超过这个数量可能会导致 Kafka 集群性能下降）。
+因此，GreptimeDB 允许多个 regions 共享一个 topic 作为 WAL，然而这可能会带来读放大的问题。
+
+属于特定 Region 的数据由数据文件和 WAL 中的数据（通常为 WAL[LastCheckpoint...Latest]）组成。特定 Region 的 failover 只需要读取该 Region 的 WAL 数据以重建内存状态，这被称为 Region 重放（region replaying）。然而，如果多个 Region 共享一个 Topic，则从 Topic 重放特定 Region 的数据需要过滤掉不相关的数据（即其他 Region 的数据）。这意味着从 Topic 重放特定 Region 的数据需要读取比该 Region 实际 WAL 数据大小更多的数据，这种现象被称为读取放大（read amplification）。
+
+尽管多个 Region 共享同一个 Topic，可以让 Datanode 支持更多的 Region，但这种方法的代价是在 Region 重放过程中产生读取放大。
+
+例如，为 [metasrv](/user-guide/operations/configuration.md#metasrv-only-configuration) 配置 128 个 Topic，如果整个集群包含 1024 个 Region（物理 Region），那么每 8 个 Region 将共享一个 Topic。
+
+![Read Amplification](/remote-wal-read-amplification.png)
+
+<p style="text-align: center;"><b>(图 1：恢复 Region 3 需要读取比实际大小大 7 倍的冗余数据)</b></p>
+
+估算读取放大倍数（重放数据大小/实际数据大小）的简单模型：
+
+- 对于单个 Topic，如果我们尝试重放属于该 Topic 的所有 Region，那么放大倍数将是 7+6+...+1 = 28 倍。（图 1 显示了 Region WAL 数据分布。重放 Region 3 将读取约为实际大小 7 倍的数据；重放 Region 6 将读取约为实际大小 6 倍的数据，以此类推）
+- 在恢复 100 个 Region 时（需要大约 13 个 Topic），放大倍数大约为 28 \* 13 = 364 倍。
+
+假设要恢复 100 个 Region，所有 Region 的实际数据大小是 0.5 GB，下表根据每个 Topic 的 Region 数量展示了数据重放的总量。
+
+| 每个 Topic 的 Region 数量 | 100 个 Region 所需 Topic 数量 | 单个 Topic 读放大系数 | 总读放大系数 | 重放数据大小（GB） |
+| ------------------------- | ----------------------------- | --------------------- | ------------ | ------------------ |
+| 1                         | 100                           | 0                     | 0            | 0.5                |
+| 2                         | 50                            | 1                     | 50           | 25.5               |
+| 4                         | 25                            | 6                     | 150          | 75.5               |
+| 8                         | 13                            | 28                    | 364          | 182.5              |
+| 16                        | 7                             | 120                   | 840          | 420.5              |
+
+下表展示了在 Kafka 集群在不同读取吞吐量情况下，100 个 region 的恢复时间。例如在提供 300MB/s 的读取吞吐量的情况下，恢复 100 个 Region 大约需要 10 分钟（182.5GB/0.3GB = 10 分钟）。
+
+| 每个主题的区域数 | 重放数据大小（GB） | Kafka 吞吐量 300MB/s- 恢复时间（秒） | Kafka 吞吐量 1000MB/s- 恢复时间（秒） |
+| ---------------- | ------------------ | ------------------------------------ | ------------------------------------- |
+| 1                | 0.5                | 2                                    | 1                                     |
+| 2                | 25.5               | 85                                   | 26                                    |
+| 4                | 75.5               | 252                                  | 76                                    |
+| 8                | 182.5              | 608                                  | 183                                   |
+| 16               | 420.5              | 1402                                 | 421                                   |
+
+
+### 改进恢复时间的建议
+
+在上文中我们根据不同的每个 Topic 包含的 Region 数量计算了恢复时间以供参考。
+在实际场景中，读取放大的现象可能会比这个模型更为严重。
+如果您对恢复时间非常敏感，我们建议每个 Region 都有自己的 Topic（即，每个 Topic 包含的 Region 数量为 1）。