Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support redis(standalone replica) pitr #961

Merged
merged 5 commits into from
Sep 10, 2024

Conversation

Chiwency
Copy link
Contributor

@Chiwency Chiwency commented Aug 19, 2024

We leverage the AOF backup feature from Redis version 7.0 onwards to implement Point-In-Time Recovery (PITR). By enabling the aof-timestamp-enabled yes and aof-disable-auto-gc no parameters, we activate AOF timestamp annotations while disabling the automatic cleanup of historical AOF files. A StatefulSet, which is continuously responsible for the ongoing backup process, manages and tracks the AOF files.

The continuous backup process works as follows:

  1. Historical AOF files in the data directory are compressed, packaged, and pushed to the backup repository.
  2. The latest AOF file is backed up to the repository directory and continuously updated. When Redis triggers AOF rewriting, the file is compressed and archived as described in step 1.
  3. The backup files in the repository are named using the format ${base_file_ctime}.${seq}.${suffix}, where base_file_ctime is the creation time of the base aof file, marking the start of this particular backup package. This timestamp creates a continuous timeline for all backups. The seq is a sequence number used for recovery in case of a backup pod failure.

The recovery process works as follows:

  1. Based on DP_RESTORE_TIME, the backup package with the base_file_ctime closest to DP_RESTORE_TIME is selected. This base file's data is used as the starting point. Then, AOF files are processed according to their timestamp annotations , up until DP_RESTORE_TIME is reached.

@ldming
Copy link
Collaborator

ldming commented Aug 27, 2024

@Y-Rookie could you take a look?

@@ -43,9 +43,10 @@ appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-disable-auto-gc no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value of this parameter? And what is its purpose? I couldn't find it in the official redis.conf file. This issue: redis/redis#10561 mentions that it is just a testing parameter that exists in the internal code.

Copy link
Contributor Author

@Chiwency Chiwency Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value is no, I mentioned this parameter just to clarify that I used this parameter. I actually set the parameter to yes in the backup target pod only while the users trigger the PITR. As I see, using the testing parameter doesn't affect the existing cluster and the users who don't need PITR.

@shanshanying
Copy link
Contributor

Hi @Chiwency ,

after some discussion with my colleges. we suggested:

  1. add new parameters to configConstraint (are they static/dynamic/immutble parameters) to make sure they are updated in the expected way.
  2. do not modify existing parameters' default values. otherwise, other users, who are using redis now, will be affected.
  3. instead, write a brief description on hwo to trigger a PITR. e.g create cluster -> reconfiguration (to update parametes) -> pitr.

@Chiwency
Copy link
Contributor Author

Chiwency commented Aug 27, 2024

Hi @Chiwency ,

after some discussion with my colleges. we suggested:

  1. add new parameters to configConstraint (are they static/dynamic/immutble parameters) to make sure they are updated in the expected way.
  2. do not modify existing parameters' default values. otherwise, other users, who are using redis now, will be affected.
  3. instead, write a brief description on hwo to trigger a PITR. e.g create cluster -> reconfiguration (to update parametes) -> pitr.

@shanshanying I get it.

  1. And I will revert the aof-timestamp-enabled to default value no
  2. I have a question about point 2, to implement PIRT which is unsupported in Redis, for users who open PITR's continuous backup, the backup script will change the two parameters aof-disable-auto-gc and aof-timestamp-enabled to yes for target redis instance, is OK? It won't affect the users who don's use PITR and existing cluster.
  3. For point 3, the two parameters just take effect in the target one instance, users don't know which instance is selected to backup, so I prefer to set them by script.

@shanshanying
Copy link
Contributor

shanshanying commented Aug 28, 2024

Hi @Chiwency,

  1. I have a question about point 2...

The point is: if the change of parameters will affect exsitng RUNNING redis clusters, we'd better change is in a more EXPLICIT way: tell users how to modify the values mannuly, either by OpsRequest, or edit some configmap, and write a doc to explain the side-effect.
I think we can update parameters through our "Reconfig" OpsRequest. Isn't it?

@Chiwency
Copy link
Contributor Author

The point is: if the change of parameters will affect exsitng RUNNING redis clusters, we'd better change is in a more EXPLICIT way: tell users how to modify the values mannuly, either by OpsRequest, or edit some configmap, and write a doc to explain the side-effect. I think we can update parameters through our "Reconfig" OpsRequest. Isn't it?

Hi @shanshanying
How can I understand the meaning of "affect exsitng RUNNING redis clusters"? As I revert the config file changes, users who don't trigger PITR backup will not be affected, and there will be no any change. But when users apply PITR to a exsiting RUNNING redis clusters, there are two scenarios at this point:

  1. IMPLICT parameter modification
    User trigger PITR by only one command, then the script will change the value of aof-timestamp-enabled and aof-disable-auto-gc. These two parameters are only applied to the AOF logs. From the user's perspective, the parameter changes will not be noticeable.
kbcli cluster update <redis-cluster-name> --backup-enabled=true --backup-method=aof
  1. EXPLICIT parameter modification
    User trigger PITR by "Reconfig" the aof-timestamp-enabled yes first, then trigger PITR backup.
kbcli cluster edit-config <redis-cluster-name>
kbcli cluster update <redis-cluster-name> --backup-enabled=true --backup-method=aof

Whether it's Option 1 or Option 2, documentation will be provided later to explain the changes. My key question here is whether these parameters should be managed by the users, do these parameter modifications count as affecting the existing cluster?
As I see, "update parameters through our "Reconfig" OpsRequest" may be an additional burden for the users, including understanding and operating. And PITR feature relies on these two parameters, so the two command's in Option 2 actually an atomic command.
What is your opinion on this?

@nayutah
Copy link
Contributor

nayutah commented Sep 3, 2024

The point is: if the change of parameters will affect exsitng RUNNING redis clusters, we'd better change is in a more EXPLICIT way: tell users how to modify the values mannuly, either by OpsRequest, or edit some configmap, and write a doc to explain the side-effect. I think we can update parameters through our "Reconfig" OpsRequest. Isn't it?

Hi @shanshanying How can I understand the meaning of "affect exsitng RUNNING redis clusters"? As I revert the config file changes, users who don't trigger PITR backup will not be affected, and there will be no any change. But when users apply PITR to a exsiting RUNNING redis clusters, there are two scenarios at this point:

  1. IMPLICT parameter modification
    User trigger PITR by only one command, then the script will change the value of aof-timestamp-enabled and aof-disable-auto-gc. These two parameters are only applied to the AOF logs. From the user's perspective, the parameter changes will not be noticeable.
kbcli cluster update <redis-cluster-name> --backup-enabled=true --backup-method=aof
  1. EXPLICIT parameter modification
    User trigger PITR by "Reconfig" the aof-timestamp-enabled yes first, then trigger PITR backup.
kbcli cluster edit-config <redis-cluster-name>
kbcli cluster update <redis-cluster-name> --backup-enabled=true --backup-method=aof

Whether it's Option 1 or Option 2, documentation will be provided later to explain the changes. My key question here is whether these parameters should be managed by the users, do these parameter modifications count as affecting the existing cluster? As I see, "update parameters through our "Reconfig" OpsRequest" may be an additional burden for the users, including understanding and operating. And PITR feature relies on these two parameters, so the two command's in Option 2 actually an atomic command. What is your opinion on this?

two independent operations is better for open-source users, first to set the flag, and then use the PITR, if one use PITR with aof timestamp disabled, please return an error code and exits immediately.

DP_save_backup_status_info "${total_size}" "${start_time}" "$(date +%s)"
}

function check_conf() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the aof-timestamp-enabled here, it can fail and return an error when both either of them is disabled. Prompt the user to manually edit the config items with 'kbcli edit-config' or 'kubectl + reconfigure ops'

@Chiwency
Copy link
Contributor Author

Chiwency commented Sep 4, 2024

two independent operations is better for open-source users, first to set the flag, and then use the PITR, if one use PITR with aof timestamp disabled, please return an error code and exits immediately.

Hi @nayutah
As kbcli doesn't offer complex logic validation releate querying the API server, it relies on validation by the manager after generating the opsReq. Therefore, we cannot return error codes immediately in kbcli. Instead, errors are detected through cluster status queries. The same applies to the restore-in-time parameter. The time validation process is placed in manager controller. I think enhancing the validation capabilities of kbcli is a future optimization direction.

@ldming ldming merged commit 9e2368f into apecloud:release-0.9 Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants