Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable postgres checksums #1185

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Enable postgres checksums #1185

wants to merge 6 commits into from

Conversation

ololobus
Copy link
Member

@ololobus ololobus commented Jan 26, 2022

PR to postgres repo: neondatabase/postgres#122

Resolve neondatabase/cloud#536

@ololobus ololobus force-pushed the pg-checksums branch 4 times, most recently from 6b08b8b to c647e13 Compare February 18, 2022 14:30
@ololobus ololobus changed the title WIP: enable postgres checksums Enable postgres checksums Feb 18, 2022
@ololobus
Copy link
Member Author

This seems to be working fine. With Konstantin's help all tests are passed now :)

I added a few more checksum checks (suggested by Anton):

  • check WAL record checksum in the redo loop after receiving it from pageserver
  • check page checksum in the redo loop after receiving it from pageserver and before applying any WAL records on top of it

A couple of things that still bother me:

  1. now checksum checks are just turned on everywhere, should we have an option in the pageserver to turn them on / off? That'd allow us to setup new pageserver with checksums in the cloud to keep old clients intact. But we anyway won't be able to migrate them into a new pageserver with checksum without writing some sophisticated migration scripts, IIUC

  2. now Postgres redo process errors out, when it receives a damaged page or WAL record. For some reason ERROR is escalated to FATAL (because of single user mode?) so for me the whole redo process crashes with:

2022-02-17T18:23:32.481142Z ERROR pagestream{timeline=9c5c2bdb41d49035ad1efc98e2da0d57 tenant=e1c80ec1d3164bc29ce225d5d15e07a9}:get_page{rel=1663/13134/16396 blkno=0 req_lsn=0/2378F88}: wal-redo-postgres: 2022-02-17 21:23:32.481 MSK [2988786] FATAL:  invalid page in block 0 of relation base/13134/16396

2022-02-17T18:23:32.481195Z ERROR pagestream{timeline=9c5c2bdb41d49035ad1efc98e2da0d57 tenant=e1c80ec1d3164bc29ce225d5d15e07a9}:get_page{rel=1663/13134/16396 blkno=0 req_lsn=0/2378F88}: wal-redo-postgres: ---------------------------------------
seccomp: bad syscall 11

Which is probably not that good, as a single damaged tenant crashes the whole redo process for all tenants. Should we introduce some basic protocol changes, so the redo process was able to notify pageserver that some particular WAL record or page is damaged and we shouldn't try to push it again?

P.S. my new test_cluster_stop_start in e2e seems to be a bit flaky, or this is a symptom of a bug. I now do stop with kill signal and after that compute cannot start in 120 seconds, which is weird. I think we have to dump compute logs in CI

@knizhnik
Copy link
Contributor

Also I want to notice that checksums can be useful for ... storage manager architecture. Having checksums we may implement in-place updates for image layers. In this case checksums can be used to enforce page consistency after failure. Still not quite clear where to get previous (correct) page image if page is corrupted. But in-place updates can significantly reduce write amplification.

@stepashka stepashka requested review from knizhnik and kelvich and removed request for arssher April 7, 2022 09:39
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch has to be rebased.
But in principle I am voting for switching on checksums in Zenith.
And Alexey's patch is doning it in right way.

@stepashka
Copy link
Member

This branch has conflicts that must be resolved

@ololobus , did you plan to resolve them?
alternatively, @lubennikovaav can pick it up
please discuss with her

@ololobus
Copy link
Member Author

ololobus commented Apr 21, 2022

This branch has conflicts that must be resolved

@ ololobus , did you plan to resolve them?

@stepashka, I rebased it already, but new conflicts arrived. No problem with rebasing again, of course.

The only reason why I still haven't merged it yet is that @hlinnaka hasn't liked this approach, if I understood him correctly.

In this PR we do checksum checks in the redo process: for incoming WAL records and pages. It means that we don't have a feedback channel for pageserver to figure out that some WAL record or page is damaged. IIRC, @hlinnaka's idea was that we anyway trust pageserver <-> redo Postgres communication channel, so we can check checksum on the pageserver side / in the rust code.

So I wanted to move CRC checks into pageserver code, but haven't started it yet. Or we want to merge it as is and change later? As far I can see, pageserver starts redo process per each tenant, so one damaged page won't affect others and we will be able to investigate.

Another option is that I can remove these checks from redo process, but only keep stamping CRC in pageserver and redo process, so checks only happen on the compute side. So we can merge this first part mostly as is.

@knizhnik
Copy link
Contributor

So I wanted to move CRC checks into pageserver code, but haven't started it yet.

Please notice, that we are already calculating checksums at page server side (for FPI), so this code is already implemented in Rust. So it seems to be not difficult to add this checksum calculation to get_page_at_lsn...

@ololobus ololobus force-pushed the pg-checksums branch 3 times, most recently from 2dd4bcb to 8b90dea Compare April 29, 2022 19:16
@ololobus
Copy link
Member Author

I've removed checksum checking from redo loop. Now we do:

  • stamp checksum on page in the WAL redo loop
  • stamp checksum on page when we grab it from FPI during WAL ingest in pageserver
  • compute verifies checksum of every page after GetPage@LSN (that's the vanilla path)
  • verify checksum of WAL records received from safekeeper (done already in the WalStreamDecoder::poll_decode)

The only missing part is checksum checking in the pageserver, when it reads layer files from local disk / S3 back into memory. There are two issues for doing that #987 and #1083

So if no objections (cc @hlinnaka @kelvich @knizhnik), I'd merge this compute-related part of checksums now and we can add pageserver files checksum later.

IIUC, when we merge this, old tenants should continue to work as control file in their basebackup doesn't have data checksums turned on. Yet, we do a fresh initdb per each redo process startup. So for old tenants their WAL redo process will stamp checksum, but compute just won't check it

@knizhnik
Copy link
Contributor

knizhnik commented May 2, 2022

So if no objections (cc @hlinnaka @kelvich @knizhnik), I'd merge this compute-related part of checksums now and we can add pageserver files checksum later.

I do not have objections, but I wan to notice that enabling checksums actually implies wal_log_hints. Yes, we still have wal_log_hints=on by default,but we have discussed that it dramatically increased WAL size (~3 times) and it is safe to disable it.

But with enabled checksums there is no chance to turn off wal logging of hint bits. The only positive effect is that there is no need to discuss whether itis safe to disable wal_log_hints or not.

@ololobus
Copy link
Member Author

ololobus commented May 9, 2022

Just tested locally that enabling checksums doesn't affect existing tenants. They still have disabled checksums and work as before.

So I'm going to merge this PR, we can iterate on it and add missing parts as mentioned above.

@kelvich
Copy link
Contributor

kelvich commented May 9, 2022

So does this patch change the number of FPI's from compute nodes?

@ololobus
Copy link
Member Author

ololobus commented May 9, 2022

So does this patch change the number of FPI's from compute nodes?

Enabled data checksums effective set wal_log_hints=on. In the cloud we have wal_log_hints=off and this settings will be ignored. So yeah, this will increase the number of FPI's

@kelvich
Copy link
Contributor

kelvich commented May 9, 2022

Enabled data checksums effective set wal_log_hints=on

Can we change this behavior in postgres? IIUC wal_log_hints=on would dramatically affect performance and storage size. Is it correct that in our scenario we are safe without FPIs when CRC is enabled?

@ololobus ololobus force-pushed the pg-checksums branch 2 times, most recently from c3efca5 to 2127df1 Compare July 6, 2022 11:58
@ololobus
Copy link
Member Author

ololobus commented Jul 6, 2022

I've modified page checksum code in pageserver in two ways:

  • Refactored pg_checksum_page to work in a similar way as C Postgres implementation, i.e. if checksum is already present on page it doesn't affect the computation. I did a clone only of that 32-bit chunk where checksum is set, so I assume that performance shouldn't be affected. Also added some unit tests to check that it behaves as expected
  • Added a per-tenant config option data_checksums to don't affect old tenants. It's false by default, so we don't check checksums for old tenants, where they are not present. It's set to true for all new tenants explicitly, though.

@knizhnik @lubennikovaav please, have a look on the two latest commits:

@LizardWizzard you can have a look on the last one, where per-tenant config is modified :)

I will check that it's really backward compatible with old tenants locally later, if the logic seems OK to everyone

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments. The concerning thing is incorrect usage of unsafe.

Also thinking about the name of config param. Current one mimics postgres but typically booleans are named with something like is_ or _enabled. So leaving at your option :)

* excluded from the calculation and preserved.
*/
pub fn pg_checksum_page(data: &[u8], blkno: u32) -> u16 {
let page = unsafe { std::mem::transmute::<&[u8], &[u32]>(data) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against this kind of unsafe usage. Function receives just a u8 slice, and transmutes without any additional checks. This wont work for all possible inputs. There is no SAFETY: style comment.

Please check this playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f6c1aa72492e141e1b949bcaab86e797

To see the UB report please run Tools -> MIRI

For more info see transmute doc and follow the links to other documentation regarding unsafe

Is there any benefits in avoiding existing implementations? If this is the FNV hash there is an existing implementation for rust standard trait Hasher from stdlib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator

Copy link
Member Author

@ololobus ololobus Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the fact that Postgres hashing algorithm is based on FNV doesn't mean that it's identical. The comment in Postgres says:

 * PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of
 * high bits - high order bits in input data only affect high order bits in
 * output data. To resolve this we xor in the value prior to multiplication
 * shifted right by 17 bits. 

so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums. We can try this fnv crate, but I don't really think that it work out, do you?

This wont work for all possible inputs.

Yes, because the page size is fixed to the 8 KB, we can add size check though, just to be on the safe side, but it won't change anything really, except slowing it down a bit, I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator

u32::from_ne_bytes should work, but won't it be slower? It should involve copying the whole page chunk by chunk each time, which isn't a desired behavior as checksum calculation may happen quite often. Or I miss something and it's optimized / or we can optimize its usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums.

Gotcha, I'm just trying to avoid writing our own copies for things that already written and well maintain by the community, but I see that it is not possible here.

but it won't change anything really

My opinion here is that we cannot permit UB in our code. If this function is used improperly it can cause UB in safe code which is a violation of rust core principle. Such functions should be marked as unsafe and clearly document what kind of guarantees they expect. Even with that it just moves the responsibility to run these checks to the caller, so they cant be avoided.

Regarding performance, if we have the size guarantee compiler may actually decide to optimize away some checks, not sure regarding the copy, I think worth experimenting. Actually it is a good idea to have benchmark for this function using something like criterion

tmp.wrapping_mul(FNV_PRIME) ^ (tmp >> 17)
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@ololobus ololobus Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current one mimics postgres but typically booleans are named with something like is_ or _enabled.

Also thought about it, adding _enabled makes sense to me

@@ -412,6 +412,9 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
tenant_conf.compaction_target_size = request_data.compaction_target_size;
tenant_conf.compaction_threshold = request_data.compaction_threshold;

// Turn on data checksums for all new tenants
tenant_conf.data_checksums = Some(request_data.data_checksums.unwrap_or(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit hacky to me, it doesn't look like a permanent code. Some time in the future we'll need to clean it up, but there is no plan on how to do it. Should we modify deployment scripts for new pageserver set up so it enables it by default?

Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults module. There is /v1/tenant/config API so it is not needed to manually change files on the filesystem.

Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults module. There is /v1/tenant/config API so it is not needed to manually change files on the filesystem.

This sounds really bug-prone to me, as anything can go wrong during this migration process. And also new tenants could be create right during new code deploy, so migration will be tricky

Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)

If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.

Yet, I agree that it looks a bit ugly and I'd remove it and set default to true if we wouldn't need a backward compatibility. I personally think that either way isn't better.

We can also make it configurable in the pageserver config, so here we will look into it and set the value specified there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also new tenants could be create right during new code deploy,

Yeah, that needs to be combine either with the current solution or with the api setting it to true for every new tenant. One benefit of this approach is that we move it to desired state now and there is need to remember that we need to clean it up in the future. But I agree, migration is a bit tricky.

If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.

Your arguments make sense. Using It through the api without a enabling by
default just moves the hack away from pageserver, but adds it to console. I agree that both ways are hacky enough so I’m ok with any solution that won’t keep this code in pageserver for years.

We can also make it configurable in the pageserver config

It should be already possible. Default tenant config is part of pageserver config

@knizhnik
Copy link
Contributor

knizhnik commented Jul 6, 2022

As far as I remember the main @hlinnaka concern against this patch was that enabling postgres checksums doesn't protect all data: both delta and image layers contains more data except page images themselves. For example delta layer contains B-Tree. WAL records themselves are protected by CRC, but Value and ZenithWalRecord (which are serialized) - not. So looks like we really protect only half of data.

@ololobus
Copy link
Member Author

ololobus commented Jul 6, 2022

So looks like we really protect only half of data.

With this patch we protect two important paths for data pages and WAL records:

  • network failures between pageserver and compute
  • and on-disk data corruption on the pageserver side

Having a single checksum per layer file doesn't seem to be a good idea, as only one page could be corrupted. So we need a more fine-grained checksum and remaining parts of the data could be protected separately.

let chunk_i = i * N_SUMS + j;
let chunk: u32;
if chunk_i == 2 {
let mut chunk_copy = page[chunk_i].to_le_bytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it deserves more comments. This differs from the original implementation and there is no explanation. Requires some deciphering.

Just an idea, we already generate a proper struct for PageHeaderData so it is possible to use it here or at least give it a reference

Comment on lines 53 to 56
let mut chunk_copy = page[chunk_i].to_le_bytes();
chunk_copy[0] = 0;
chunk_copy[1] = 0;
chunk = u32::from_le_bytes(chunk_copy);
Copy link
Contributor

@funbringer funbringer Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has seen a handful of comments that revolve around various nits and styling issues, but this particular piece of code is very broken in multiple ways.

First of all, why not just

value & 0xFFFF0000

instead of this verbosity? The original code never said anything about endianness or LE in particular, but this striking detail drew my attention to the broader question: is this even correct?

Speaking of the original postgres code, it's very different. It uses PageHeaderData for resetting the checksum, which is hacky but still correct. But here we're taking it for granted that the lower half (I omit the to-le/from-le part) of a certain u32 will contain pd_checksum, which is not the case for big-endian!

Okay, I must admit that we'll probably never want to run our code on a big-endian arch nowadays, but we really shouldn't let such things slip in our code base unnoticed, without even a tiny comment clarifying that only little-endian is supported.

Also there's only a hint that this code has been adopted from postgres (like, you could grep for pg_checksum_page, but you'll soon learn that it's very different, like I did). In cases like this I'd love to see a comment saying "this code has not been copied verbatim from <original function name OR github URL>".

So here's a proof that mixing field- and ptr-to-u32-based access patterns is a bad idea:

#include <stdint.h>
#include <stdio.h>

struct Data {
   // Desugared PageXLogRecPtr
    uint32_t a;
    uint32_t b;

    // pd_checksum
    uint16_t c;
    
    // pd_flags
    uint16_t d;
};

int main() {
    struct Data data = (struct Data) {
        .a = 1,
        .b = 2,
        .c = 3,
        .d = 4,
    };

    uint32_t *ptr = (uint32_t *) &data;
    ptr[2] &= 0xffff0000;

    printf("%d\n", ((volatile struct Data *)&data)->c);
    return 0;
}

I picked a big-endian mips for the demonstration. For the cross-toolchain I'm using the zig compiler because it's super-convenient (I highly recommend it). We'll also need to install a pack of qemu-user interpreters (e.g. https://packages.debian.org/en/sid/qemu-user-static) to run the actual binaries.

Now, check this out:

$ zig cc test.c && ./a.out 
is little endian: 1
0

$ zig cc test.c --target=mips-linux-musl && ./a.out
is little endian: 0
3

Oops!

Copy link
Member Author

@ololobus ololobus Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this particular piece of code is very broken in multiple ways.

Thanks for looking on it and such a detailed comment! Assuming that we are not running any of our service components on big-endian archs and, yes, likely not going to, what are the multiple ways we are talking about? Sorry if I missed something in your comment. Because yes, it certainly not intended to be run on big-endian and it's worth marking / commenting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the only place in Neon code where we assume that target system is little-endian. May be it is not good, but I do not think that it is so critical at this moment.
It is possible now to find big-endian machine, but not in cloud and I do not see any reasons why somebody wants to run Neon on such system.

@knizhnik
Copy link
Contributor

knizhnik commented Jul 6, 2022

pg_checksum_page anyway ignores the checksum and we override the current checksum one line bellow, so it's not needed, IIUC

It is not true. pg_checksum is applied to the whole page, including checksum field.
This is original Postgres code:

	save_checksum = cpage->phdr.pd_checksum;
	cpage->phdr.pd_checksum = 0;
	checksum = pg_checksum_block(cpage);
	cpage->phdr.pd_checksum = save_checksum;

@ololobus
Copy link
Member Author

ololobus commented Jul 6, 2022

It is not true. pg_checksum is applied to the whole page, including checksum field

I mean that I changed your pg_checksum_page rust version to do the same. It sets zeroes to the checksum field and calculates the checksum. So it's not needed to do it twice in the page_set_checksum. I also added a test, which checks that having a checksum on the pages doesn't affect the calculation, which passes fine. Do I miss something?

@knizhnik
Copy link
Contributor

knizhnik commented Jul 6, 2022

So looks like we really protect only half of data.

With this patch we protect two important paths for data pages and WAL records:

* network failures between pageserver and compute

* and on-disk data corruption on the pageserver side

It is not completely protecting data from disk failure. If corruption affect B-Tree page, then Postgres page-level cheksums will not help to detect such error.

Having a single checksum per layer file doesn't seem to be a good idea, as only one page could be corrupted. So we need a more fine-grained checksum and remaining parts of the data could be protected separately.

File level checksums are really unusable, but not because to be able to calculate it we need to load all file in memory. So we need checksums for individual B-Tree pages.
May be we need our own checksum mechanism. But it is not clear where to store this checksum. as far as postgres page header already contains field for checksum, it seems to be natural to also use it and implement postgres-compatible checksum calculation algorithm. But I am not sure.

knizhnik and others added 5 commits July 7, 2022 19:44
We need checksums to verify data integrity, when we read it from
untrusted place (e.g. local disk) or via untrusted communication channel
(e.g. network). At the same time, we trust pageserver <-> redo process
communication channel, as it is just a pipe.

Here we enable calculation of data checksums in the wal redo process and
when we extract FPI during WAL injestion. Compute node (Postgres) will
verify checksum of every page after receiving it back from pageserver.
So it is pretty similar to how vanilla Postgres checks them.

There are two other places where we should verify checksums to
detect data corruption earlier:
- when we receive WAL records from safekeepers (already implemented,
  see: WalStreamDecoder::poll_decode)
- when we write layer files to disk and read back in memory from local
  disk or S3
@ololobus
Copy link
Member Author

ololobus commented Jul 7, 2022

Added safety notes, unsafe where needed, criterion benchmark. Confirmed @knizhnik result that having an if in a loop slows it down by ~3 times. With a bit of refactoring put it down to ~700ns again.

@LizardWizzard
Copy link
Contributor

As discussed moving to draft.

@LizardWizzard LizardWizzard marked this pull request as draft March 21, 2023 15:19
@ololobus ololobus marked this pull request as ready for review January 9, 2025 18:15
@ololobus ololobus requested review from a team as code owners January 9, 2025 18:15
@ololobus ololobus requested a review from jcsp January 9, 2025 18:15
@ololobus ololobus marked this pull request as draft January 9, 2025 18:16
@ololobus ololobus removed the request for review from jcsp January 9, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants