Enable postgres checksums #1185

ololobus · 2022-01-26T11:03:12Z

PR to postgres repo: neondatabase/postgres#122

Resolve neondatabase/cloud#536

ololobus · 2022-02-18T15:19:21Z

This seems to be working fine. With Konstantin's help all tests are passed now :)

I added a few more checksum checks (suggested by Anton):

check WAL record checksum in the redo loop after receiving it from pageserver
check page checksum in the redo loop after receiving it from pageserver and before applying any WAL records on top of it

A couple of things that still bother me:

now checksum checks are just turned on everywhere, should we have an option in the pageserver to turn them on / off? That'd allow us to setup new pageserver with checksums in the cloud to keep old clients intact. But we anyway won't be able to migrate them into a new pageserver with checksum without writing some sophisticated migration scripts, IIUC
now Postgres redo process errors out, when it receives a damaged page or WAL record. For some reason ERROR is escalated to FATAL (because of single user mode?) so for me the whole redo process crashes with:

2022-02-17T18:23:32.481142Z ERROR pagestream{timeline=9c5c2bdb41d49035ad1efc98e2da0d57 tenant=e1c80ec1d3164bc29ce225d5d15e07a9}:get_page{rel=1663/13134/16396 blkno=0 req_lsn=0/2378F88}: wal-redo-postgres: 2022-02-17 21:23:32.481 MSK [2988786] FATAL:  invalid page in block 0 of relation base/13134/16396

2022-02-17T18:23:32.481195Z ERROR pagestream{timeline=9c5c2bdb41d49035ad1efc98e2da0d57 tenant=e1c80ec1d3164bc29ce225d5d15e07a9}:get_page{rel=1663/13134/16396 blkno=0 req_lsn=0/2378F88}: wal-redo-postgres: ---------------------------------------
seccomp: bad syscall 11

Which is probably not that good, as a single damaged tenant crashes the whole redo process for all tenants. Should we introduce some basic protocol changes, so the redo process was able to notify pageserver that some particular WAL record or page is damaged and we shouldn't try to push it again?

P.S. my new test_cluster_stop_start in e2e seems to be a bit flaky, or this is a symptom of a bug. I now do stop with kill signal and after that compute cannot start in 120 seconds, which is weird. I think we have to dump compute logs in CI

knizhnik · 2022-02-22T09:42:45Z

Also I want to notice that checksums can be useful for ... storage manager architecture. Having checksums we may implement in-place updates for image layers. In this case checksums can be used to enforce page consistency after failure. Still not quite clear where to get previous (correct) page image if page is corrupted. But in-place updates can significantly reduce write amplification.

knizhnik

This branch has to be rebased.
But in principle I am voting for switching on checksums in Zenith.
And Alexey's patch is doning it in right way.

stepashka · 2022-04-21T10:21:35Z

This branch has conflicts that must be resolved

@ololobus , did you plan to resolve them?
alternatively, @lubennikovaav can pick it up
please discuss with her

ololobus · 2022-04-21T14:59:46Z

This branch has conflicts that must be resolved

@ ololobus , did you plan to resolve them?

@stepashka, I rebased it already, but new conflicts arrived. No problem with rebasing again, of course.

The only reason why I still haven't merged it yet is that @hlinnaka hasn't liked this approach, if I understood him correctly.

In this PR we do checksum checks in the redo process: for incoming WAL records and pages. It means that we don't have a feedback channel for pageserver to figure out that some WAL record or page is damaged. IIRC, @hlinnaka's idea was that we anyway trust pageserver <-> redo Postgres communication channel, so we can check checksum on the pageserver side / in the rust code.

So I wanted to move CRC checks into pageserver code, but haven't started it yet. Or we want to merge it as is and change later? As far I can see, pageserver starts redo process per each tenant, so one damaged page won't affect others and we will be able to investigate.

Another option is that I can remove these checks from redo process, but only keep stamping CRC in pageserver and redo process, so checks only happen on the compute side. So we can merge this first part mostly as is.

knizhnik · 2022-04-21T17:22:53Z

So I wanted to move CRC checks into pageserver code, but haven't started it yet.

Please notice, that we are already calculating checksums at page server side (for FPI), so this code is already implemented in Rust. So it seems to be not difficult to add this checksum calculation to get_page_at_lsn...

ololobus · 2022-04-29T20:17:23Z

I've removed checksum checking from redo loop. Now we do:

stamp checksum on page in the WAL redo loop
stamp checksum on page when we grab it from FPI during WAL ingest in pageserver
compute verifies checksum of every page after GetPage@LSN (that's the vanilla path)
verify checksum of WAL records received from safekeeper (done already in the WalStreamDecoder::poll_decode)

The only missing part is checksum checking in the pageserver, when it reads layer files from local disk / S3 back into memory. There are two issues for doing that #987 and #1083

So if no objections (cc @hlinnaka @kelvich @knizhnik), I'd merge this compute-related part of checksums now and we can add pageserver files checksum later.

IIUC, when we merge this, old tenants should continue to work as control file in their basebackup doesn't have data checksums turned on. Yet, we do a fresh initdb per each redo process startup. So for old tenants their WAL redo process will stamp checksum, but compute just won't check it

knizhnik · 2022-05-02T16:10:45Z

So if no objections (cc @hlinnaka @kelvich @knizhnik), I'd merge this compute-related part of checksums now and we can add pageserver files checksum later.

I do not have objections, but I wan to notice that enabling checksums actually implies wal_log_hints. Yes, we still have wal_log_hints=on by default,but we have discussed that it dramatically increased WAL size (~3 times) and it is safe to disable it.

But with enabled checksums there is no chance to turn off wal logging of hint bits. The only positive effect is that there is no need to discuss whether itis safe to disable wal_log_hints or not.

ololobus · 2022-05-09T11:49:31Z

Just tested locally that enabling checksums doesn't affect existing tenants. They still have disabled checksums and work as before.

So I'm going to merge this PR, we can iterate on it and add missing parts as mentioned above.

kelvich · 2022-05-09T11:53:15Z

So does this patch change the number of FPI's from compute nodes?

ololobus · 2022-05-09T13:21:33Z

So does this patch change the number of FPI's from compute nodes?

Enabled data checksums effective set wal_log_hints=on. In the cloud we have wal_log_hints=off and this settings will be ignored. So yeah, this will increase the number of FPI's

kelvich · 2022-05-09T13:25:24Z

Enabled data checksums effective set wal_log_hints=on

Can we change this behavior in postgres? IIUC wal_log_hints=on would dramatically affect performance and storage size. Is it correct that in our scenario we are safe without FPIs when CRC is enabled?

ololobus · 2022-07-06T12:19:12Z

I've modified page checksum code in pageserver in two ways:

Refactored pg_checksum_page to work in a similar way as C Postgres implementation, i.e. if checksum is already present on page it doesn't affect the computation. I did a clone only of that 32-bit chunk where checksum is set, so I assume that performance shouldn't be affected. Also added some unit tests to check that it behaves as expected
Added a per-tenant config option data_checksums to don't affect old tenants. It's false by default, so we don't check checksums for old tenants, where they are not present. It's set to true for all new tenants explicitly, though.

@knizhnik @lubennikovaav please, have a look on the two latest commits:

@LizardWizzard you can have a look on the last one, where per-tenant config is modified :)

I will check that it's really backward compatible with old tenants locally later, if the logic seems OK to everyone

LizardWizzard

I've left a few comments. The concerning thing is incorrect usage of unsafe.

Also thinking about the name of config param. Current one mimics postgres but typically booleans are named with something like is_ or _enabled. So leaving at your option :)

LizardWizzard · 2022-07-06T12:54:43Z

libs/utils/src/pg_checksum_page.rs

+ * excluded from the calculation and preserved.
+ */
+pub fn pg_checksum_page(data: &[u8], blkno: u32) -> u16 {
+    let page = unsafe { std::mem::transmute::<&[u8], &[u32]>(data) };


I'm against this kind of unsafe usage. Function receives just a u8 slice, and transmutes without any additional checks. This wont work for all possible inputs. There is no SAFETY: style comment.

Please check this playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f6c1aa72492e141e1b949bcaab86e797

To see the UB report please run Tools -> MIRI

For more info see transmute doc and follow the links to other documentation regarding unsafe

Is there any benefits in avoiding existing implementations? If this is the FNV hash there is an existing implementation for rust standard trait Hasher from stdlib.

One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator

Hm, the fact that Postgres hashing algorithm is based on FNV doesn't mean that it's identical. The comment in Postgres says:

* PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of * high bits - high order bits in input data only affect high order bits in * output data. To resolve this we xor in the value prior to multiplication * shifted right by 17 bits.

so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums. We can try this fnv crate, but I don't really think that it work out, do you?

This wont work for all possible inputs.

Yes, because the page size is fixed to the 8 KB, we can add size check though, just to be on the safe side, but it won't change anything really, except slowing it down a bit, I suppose.

One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator

u32::from_ne_bytes should work, but won't it be slower? It should involve copying the whole page chunk by chunk each time, which isn't a desired behavior as checksum calculation may happen quite often. Or I miss something and it's optimized / or we can optimize its usage?

so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums.

Gotcha, I'm just trying to avoid writing our own copies for things that already written and well maintain by the community, but I see that it is not possible here.

but it won't change anything really

My opinion here is that we cannot permit UB in our code. If this function is used improperly it can cause UB in safe code which is a violation of rust core principle. Such functions should be marked as unsafe and clearly document what kind of guarantees they expect. Even with that it just moves the responsibility to run these checks to the caller, so they cant be avoided.

Regarding performance, if we have the size guarantee compiler may actually decide to optimize away some checks, not sure regarding the copy, I think worth experimenting. Actually it is a good idea to have benchmark for this function using something like criterion

LizardWizzard · 2022-07-06T12:54:57Z

libs/utils/src/pg_checksum_page.rs

+    tmp.wrapping_mul(FNV_PRIME) ^ (tmp >> 17)
+}
+
+/*


Please use rust style doc comments: https://doc.rust-lang.org/rust-by-example/meta/doc.html#doc-comments

Current one mimics postgres but typically booleans are named with something like is_ or _enabled.

Also thought about it, adding _enabled makes sense to me

LizardWizzard · 2022-07-06T13:18:29Z

pageserver/src/http/routes.rs

@@ -412,6 +412,9 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
    tenant_conf.compaction_target_size = request_data.compaction_target_size;
    tenant_conf.compaction_threshold = request_data.compaction_threshold;

+    // Turn on data checksums for all new tenants
+    tenant_conf.data_checksums = Some(request_data.data_checksums.unwrap_or(true));


This looks a bit hacky to me, it doesn't look like a permanent code. Some time in the future we'll need to clean it up, but there is no plan on how to do it. Should we modify deployment scripts for new pageserver set up so it enables it by default?

Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults module. There is /v1/tenant/config API so it is not needed to manually change files on the filesystem.

Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)

What do you think?

Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults module. There is /v1/tenant/config API so it is not needed to manually change files on the filesystem.

This sounds really bug-prone to me, as anything can go wrong during this migration process. And also new tenants could be create right during new code deploy, so migration will be tricky

Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)

If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.

Yet, I agree that it looks a bit ugly and I'd remove it and set default to true if we wouldn't need a backward compatibility. I personally think that either way isn't better.

We can also make it configurable in the pageserver config, so here we will look into it and set the value specified there?

And also new tenants could be create right during new code deploy,

Yeah, that needs to be combine either with the current solution or with the api setting it to true for every new tenant. One benefit of this approach is that we move it to desired state now and there is need to remember that we need to clean it up in the future. But I agree, migration is a bit tricky.

If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.

Your arguments make sense. Using It through the api without a enabling by
default just moves the hack away from pageserver, but adds it to console. I agree that both ways are hacky enough so I’m ok with any solution that won’t keep this code in pageserver for years.

We can also make it configurable in the pageserver config

It should be already possible. Default tenant config is part of pageserver config

knizhnik · 2022-07-06T14:58:27Z

As far as I remember the main @hlinnaka concern against this patch was that enabling postgres checksums doesn't protect all data: both delta and image layers contains more data except page images themselves. For example delta layer contains B-Tree. WAL records themselves are protected by CRC, but Value and ZenithWalRecord (which are serialized) - not. So looks like we really protect only half of data.

ololobus · 2022-07-06T15:52:57Z

So looks like we really protect only half of data.

With this patch we protect two important paths for data pages and WAL records:

network failures between pageserver and compute
and on-disk data corruption on the pageserver side

Having a single checksum per layer file doesn't seem to be a good idea, as only one page could be corrupted. So we need a more fine-grained checksum and remaining parts of the data could be protected separately.

LizardWizzard · 2022-07-06T16:24:51Z

libs/utils/src/pg_checksum_page.rs

+            let chunk_i = i * N_SUMS + j;
+            let chunk: u32;
+            if chunk_i == 2 {
+                let mut chunk_copy = page[chunk_i].to_le_bytes();


I think it deserves more comments. This differs from the original implementation and there is no explanation. Requires some deciphering.

Just an idea, we already generate a proper struct for PageHeaderData so it is possible to use it here or at least give it a reference

funbringer · 2022-07-06T17:59:59Z

libs/utils/src/pg_checksum_page.rs

+                let mut chunk_copy = page[chunk_i].to_le_bytes();
+                chunk_copy[0] = 0;
+                chunk_copy[1] = 0;
+                chunk = u32::from_le_bytes(chunk_copy);


This PR has seen a handful of comments that revolve around various nits and styling issues, but this particular piece of code is very broken in multiple ways.

First of all, why not just

value & 0xFFFF0000

instead of this verbosity? The original code never said anything about endianness or LE in particular, but this striking detail drew my attention to the broader question: is this even correct?

Speaking of the original postgres code, it's very different. It uses PageHeaderData for resetting the checksum, which is hacky but still correct. But here we're taking it for granted that the lower half (I omit the to-le/from-le part) of a certain u32 will contain pd_checksum, which is not the case for big-endian!

Okay, I must admit that we'll probably never want to run our code on a big-endian arch nowadays, but we really shouldn't let such things slip in our code base unnoticed, without even a tiny comment clarifying that only little-endian is supported.

Also there's only a hint that this code has been adopted from postgres (like, you could grep for pg_checksum_page, but you'll soon learn that it's very different, like I did). In cases like this I'd love to see a comment saying "this code has not been copied verbatim from <original function name OR github URL>".

So here's a proof that mixing field- and ptr-to-u32-based access patterns is a bad idea:

#include <stdint.h> #include <stdio.h> struct Data { // Desugared PageXLogRecPtr uint32_t a; uint32_t b; // pd_checksum uint16_t c; // pd_flags uint16_t d; }; int main() { struct Data data = (struct Data) { .a = 1, .b = 2, .c = 3, .d = 4, }; uint32_t *ptr = (uint32_t *) &data; ptr[2] &= 0xffff0000; printf("%d\n", ((volatile struct Data *)&data)->c); return 0; }

I picked a big-endian mips for the demonstration. For the cross-toolchain I'm using the zig compiler because it's super-convenient (I highly recommend it). We'll also need to install a pack of qemu-user interpreters (e.g. https://packages.debian.org/en/sid/qemu-user-static) to run the actual binaries.

Now, check this out:

$ zig cc test.c && ./a.out is little endian: 1 0 $ zig cc test.c --target=mips-linux-musl && ./a.out is little endian: 0 3

Oops!

but this particular piece of code is very broken in multiple ways.

Thanks for looking on it and such a detailed comment! Assuming that we are not running any of our service components on big-endian archs and, yes, likely not going to, what are the multiple ways we are talking about? Sorry if I missed something in your comment. Because yes, it certainly not intended to be run on big-endian and it's worth marking / commenting.

It is not the only place in Neon code where we assume that target system is little-endian. May be it is not good, but I do not think that it is so critical at this moment.
It is possible now to find big-endian machine, but not in cloud and I do not see any reasons why somebody wants to run Neon on such system.

knizhnik · 2022-07-06T18:59:15Z

pg_checksum_page anyway ignores the checksum and we override the current checksum one line bellow, so it's not needed, IIUC

It is not true. pg_checksum is applied to the whole page, including checksum field.
This is original Postgres code:

	save_checksum = cpage->phdr.pd_checksum;
	cpage->phdr.pd_checksum = 0;
	checksum = pg_checksum_block(cpage);
	cpage->phdr.pd_checksum = save_checksum;

ololobus · 2022-07-06T19:07:35Z

It is not true. pg_checksum is applied to the whole page, including checksum field

I mean that I changed your pg_checksum_page rust version to do the same. It sets zeroes to the checksum field and calculates the checksum. So it's not needed to do it twice in the page_set_checksum. I also added a test, which checks that having a checksum on the pages doesn't affect the calculation, which passes fine. Do I miss something?

knizhnik · 2022-07-06T19:12:54Z

So looks like we really protect only half of data.

With this patch we protect two important paths for data pages and WAL records:
* network failures between pageserver and compute

* and on-disk data corruption on the pageserver side

It is not completely protecting data from disk failure. If corruption affect B-Tree page, then Postgres page-level cheksums will not help to detect such error.

Having a single checksum per layer file doesn't seem to be a good idea, as only one page could be corrupted. So we need a more fine-grained checksum and remaining parts of the data could be protected separately.

File level checksums are really unusable, but not because to be able to calculate it we need to load all file in memory. So we need checksums for individual B-Tree pages.
May be we need our own checksum mechanism. But it is not clear where to store this checksum. as far as postgres page header already contains field for checksum, it seems to be natural to also use it and implement postgres-compatible checksum calculation algorithm. But I am not sure.

…e/cloud#536)

We need checksums to verify data integrity, when we read it from untrusted place (e.g. local disk) or via untrusted communication channel (e.g. network). At the same time, we trust pageserver <-> redo process communication channel, as it is just a pipe. Here we enable calculation of data checksums in the wal redo process and when we extract FPI during WAL injestion. Compute node (Postgres) will verify checksum of every page after receiving it back from pageserver. So it is pretty similar to how vanilla Postgres checks them. There are two other places where we should verify checksums to detect data corruption earlier: - when we receive WAL records from safekeepers (already implemented, see: WalStreamDecoder::poll_decode) - when we write layer files to disk and read back in memory from local disk or S3

…o process

…ble for new ones

ololobus · 2022-07-07T17:48:19Z

Added safety notes, unsafe where needed, criterion benchmark. Confirmed @knizhnik result that having an if in a loop slows it down by ~3 times. With a bit of refactoring put it down to ~700ns again.

LizardWizzard · 2023-03-21T15:18:56Z

As discussed moving to draft.

ololobus force-pushed the pg-checksums branch 4 times, most recently from 6b08b8b to c647e13 Compare February 18, 2022 14:30

ololobus changed the title ~~WIP: enable postgres checksums~~ Enable postgres checksums Feb 18, 2022

ololobus requested review from hlinnaka, LizardWizzard and arssher February 18, 2022 15:19

lubennikovaav self-requested a review February 21, 2022 15:14

stepashka requested review from knizhnik and kelvich and removed request for arssher April 7, 2022 09:39

knizhnik approved these changes Apr 7, 2022

View reviewed changes

ololobus force-pushed the pg-checksums branch from c647e13 to 280f0e1 Compare April 8, 2022 12:43

ololobus force-pushed the pg-checksums branch 3 times, most recently from 2dd4bcb to 8b90dea Compare April 29, 2022 19:16

lubennikovaav approved these changes May 2, 2022

View reviewed changes

ololobus force-pushed the pg-checksums branch from 8b90dea to 244a25c Compare May 9, 2022 11:21

ololobus force-pushed the pg-checksums branch 2 times, most recently from c3efca5 to 2127df1 Compare July 6, 2022 11:58

LizardWizzard suggested changes Jul 6, 2022

View reviewed changes

LizardWizzard reviewed Jul 6, 2022

View reviewed changes

funbringer reviewed Jul 6, 2022

View reviewed changes

knizhnik and others added 5 commits July 7, 2022 19:44

Calculate postgres checksum for FPI stored in pageserver (neondatabas…

2501afb

…e/cloud#536)

Bump vendor/postgres

b135dbb

Verify checksum of the page and WAL records before sending to the red…

cc6ffb5

…o process

Turn off data-checksums for old tenants by default and explicitly ena…

53b9cb9

…ble for new ones

ololobus force-pushed the pg-checksums branch from 2127df1 to 0ff7b40 Compare July 7, 2022 17:44

Add safety notes, benchmark. Optimize checksum calculation

0eca1d1

ololobus force-pushed the pg-checksums branch from 0ff7b40 to 0eca1d1 Compare July 7, 2022 18:46

LizardWizzard mentioned this pull request Jul 11, 2022

Epic: s3 in pageserver stage 2 #1556

Closed

12 tasks

LizardWizzard mentioned this pull request Nov 9, 2022

Epic: checksums for s3 data from pageservers (DRAFT) #2784

Closed

LizardWizzard marked this pull request as draft March 21, 2023 15:19

ololobus marked this pull request as ready for review January 9, 2025 18:15

ololobus requested review from a team as code owners January 9, 2025 18:15

ololobus requested a review from jcsp January 9, 2025 18:15

ololobus marked this pull request as draft January 9, 2025 18:16

ololobus removed the request for review from jcsp January 9, 2025 18:16

Enable postgres checksums #1185

Are you sure you want to change the base?

Enable postgres checksums #1185

Conversation

ololobus commented Jan 26, 2022 • edited Loading

ololobus commented Feb 18, 2022

knizhnik commented Feb 22, 2022

knizhnik left a comment

Choose a reason for hiding this comment

stepashka commented Apr 21, 2022

ololobus commented Apr 21, 2022 • edited Loading

knizhnik commented Apr 21, 2022

ololobus commented Apr 29, 2022

knizhnik commented May 2, 2022

ololobus commented May 9, 2022

kelvich commented May 9, 2022

ololobus commented May 9, 2022

kelvich commented May 9, 2022 • edited Loading

ololobus commented Jul 6, 2022 • edited Loading

LizardWizzard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ololobus Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ololobus Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Jul 6, 2022

ololobus commented Jul 6, 2022

Choose a reason for hiding this comment

funbringer Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

ololobus Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Jul 6, 2022

ololobus commented Jul 6, 2022

knizhnik commented Jul 6, 2022

ololobus commented Jul 7, 2022

LizardWizzard commented Mar 21, 2023

ololobus commented Jan 26, 2022 •

edited

Loading

ololobus commented Apr 21, 2022 •

edited

Loading

kelvich commented May 9, 2022 •

edited

Loading

ololobus commented Jul 6, 2022 •

edited

Loading

ololobus Jul 6, 2022 •

edited

Loading

ololobus Jul 6, 2022 •

edited

Loading

funbringer Jul 6, 2022 •

edited

Loading

ololobus Jul 6, 2022 •

edited

Loading