Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose WAL-G backup by start time when cloning and deleting #1

Conversation

@OlleLarsson
Copy link

Thanks! Will take a look early next week!

Copy link

@OlleLarsson OlleLarsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@davidumea davidumea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@simonklb
Copy link
Author

simonklb commented Jun 18, 2024

This fixes the immediate problem of cleaning up backups based on start time instead of modified time: fcb00b1

Note that this is not at all backwards compatible with wal-e, so don't merge this! ⚠️

However, when working on this I found this: wal-g/wal-g#816 which suggests that there is already built-in support for this. Also, using delete retain instead of delete before could also slim down the script even further as it currently tries to calculate how many backups to retain on it's own instead of relying on wal-g.

@simonklb simonklb force-pushed the simonklb/choose-backup-by-start-time branch from fcb00b1 to bb77d67 Compare June 20, 2024 12:54
@simonklb
Copy link
Author

simonklb commented Jun 20, 2024

Here's a new patch that makes use of wal-g delete retain --after: bb77d67
PTAL!

Diff it with ignore whitespace and it'll look less intimidating. :)

@simonklb simonklb changed the title Choose WAL-G backup by start time when cloning Choose WAL-G backup by start time when cloning and deleting Jun 20, 2024
@simonklb simonklb force-pushed the simonklb/choose-backup-by-start-time branch from bb77d67 to 30b4d13 Compare June 20, 2024 14:36
@simonklb
Copy link
Author

simonklb commented Jun 20, 2024

FWIW the Spilo tests passes with these changes cherry-picked on master and with USE_WALG=true.

However, without USE_WALG=true some tests fail logs similar to this:

2024-06-20 16:08:59,805 INFO: Trying s3://testbucket/spilo/demo/wal/12/ for clone
INFO: 2024/06/20 16:08:59.822165 List backups from storages: [default]
INFO: 2024/06/20 16:08:59.823516 No backups found
2024-06-20 16:08:59,825 INFO: Trying s3://testbucket/spilo/demo/wal/11/ for clone
INFO: 2024/06/20 16:08:59.840049 List backups from storages: [default]
ERROR: 2024/06/20 16:08:59.841804 object 'base_000000010000000000000005_00000208/metadata.json' not found in storage
2024-06-20 16:08:59,843 ERROR: Clone failed

This is expected because if the backups are created with wal-e the metadata.json file is never created, so if we want to try to get this merged upstream we'll probably have to work around this as well or query if wal-e support is planned to be dropped in the near future.

Note that this is should only be an issue if you are trying to migrate from wal-e to wal-g. Using wal-e exclusively should still work.

@OlleLarsson
Copy link

...
This is expected because if the backups are created with wal-e the metadata.json file is never created, so if we want to try to get this merged upstream we'll probably have to work around this as well or query if wal-e support is planned to be dropped in the near future.

Note that this is should only be an issue if you are trying to migrate from wal-e to wal-g. Using wal-e exclusively should still work.

I think it makes sense to ask if they are planning on removing wal-e support.

@OlleLarsson
Copy link

Ugh, running into some issues building the image from this branch, seem to fail on building wal-g 🤔

[100%] Built target brotli
+ bash link_libsodium.sh

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
The command '/bin/sh -c bash /builddeps/dependencies.sh' returned a non-zero code: 2

@simonklb
Copy link
Author

Ugh, running into some issues building the image from this branch, seem to fail on building wal-g 🤔

[100%] Built target brotli
+ bash link_libsodium.sh

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
The command '/bin/sh -c bash /builddeps/dependencies.sh' returned a non-zero code: 2

wal-g/wal-g#1738

@simonklb
Copy link
Author

Blocked by: wal-g/wal-g#1738

Temp fix to allow for testing while we wait for the fix to be released: 8416ba1

@davidumea
Copy link

Blocked by: wal-g/wal-g#1738

Temp fix to allow for testing while we wait for the fix to be released: 8416ba1

I saw that the fix on garry-t's fork was merged into upstream wal-g. Should we switch to using upstream repo, or sync and create a branch on our fork instead?

@simonklb
Copy link
Author

simonklb commented Jul 3, 2024

Blocked by: wal-g/wal-g#1738
Temp fix to allow for testing while we wait for the fix to be released: 8416ba1

I saw that the fix on garry-t's fork was merged into upstream wal-g. Should we switch to using upstream repo, or sync and create a branch on our fork instead?

Waiting for a release to materialize: wal-g/wal-g#1739 (comment)

@simonklb simonklb requested a review from OlleLarsson August 21, 2024 14:29
@simonklb
Copy link
Author

A wal-g release has been created with the fix. Upstream Spilo has brought it in.

I've made some changes to better conform with Spilo master. If you think the changes look good I'll give it a try to get it merged upstream as well.

@@ -70,20 +72,25 @@ def choose_backup(backup_list, recovery_target_time):

match_timestamp = match = None
for backup in backup_list:
last_modified = parse(backup['last_modified'])
last_modified = parse(backup['start_time' if os.getenv('USE_WALG_RESTORE') == 'true' else 'last_modified'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Simon, should this be finish_time instead of start_time of the backup to compare with the recovery_target_time? Postgres requires that the recovery target "must be after the ending time of the base backup" (https://www.postgresql.org/docs/13/continuous-archiving.html).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up!

The rational behind picking start_time over finish_time is that one backup might have been started first but finished second. I'm actually not sure how likely parallel backups are, if they are possible at all. Please let me know if you know more on this matter!

Either way, after having read the documentation you linked to I think you are absolutely correct. It does not make sense to pick a base backup that has not yet finished before the "recovery target".

272fbf1

@simonklb simonklb requested a review from annielzy August 27, 2024 07:12
@@ -57,8 +60,7 @@ def fix_output(output):
started = None
for line in output.decode('utf-8').splitlines():
if not started:
started = re.match(r'^name\s+last_modified\s+', line)
started = re.match(r'^name\s+last_modified\s+', line) or re.match(r'^name\s+modified\s+', line)
started = re.match(r'^(backup_)?name\s+(last_)?modified\s+', line)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this matches up with the fix upstream so that we don't reintroduce the bug: zalando#1026

@@ -43,7 +43,7 @@ ATTEMPT=0
server_version="-1"
while true; do
[[ -z $wal_segment_backup_start ]] && wal_segment_backup_start=$($WAL_E backup-list 2> /dev/null \
| sed '0,/^name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')
| sed '0,/^(backup_\)?name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this matches up with the fix upstream so that we don't reintroduce the bug: zalando#1026

@Zash Zash self-assigned this Nov 26, 2024
@@ -43,7 +43,7 @@ ATTEMPT=0
server_version="-1"
while true; do
[[ -z $wal_segment_backup_start ]] && wal_segment_backup_start=$($WAL_E backup-list 2> /dev/null \
| sed '0,/^name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')
| sed '0,/^(backup_\)?name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, missing \ ?

Suggested change
| sed '0,/^(backup_\)?name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')
| sed '0,/^\(backup_\)?name\s*\(last_\)\?modified\s*/d' | sort -bk2 | tail -n1 | awk '{print $3;}' | sed 's/_.*$//')

@Zash
Copy link

Zash commented Nov 26, 2024

Hey @elastisys/goto-postgresql is the target branch still correct? (as opposed to needing a rebase)

@OlleLarsson
Copy link

OlleLarsson commented Nov 26, 2024

Hey @elastisys/goto-postgresql is the target branch still correct? (as opposed to needing a rebase)

It should now be based in this branch

@Zash
Copy link

Zash commented Dec 9, 2024

New PR rebasing these changes (since it was opened from @simonklb 's private repo):

@Zash Zash closed this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants