-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable postgres checksums #1185
base: main
Are you sure you want to change the base?
Conversation
6b08b8b
to
c647e13
Compare
This seems to be working fine. With Konstantin's help all tests are passed now :) I added a few more checksum checks (suggested by Anton):
A couple of things that still bother me:
Which is probably not that good, as a single damaged tenant crashes the whole redo process for all tenants. Should we introduce some basic protocol changes, so the redo process was able to notify pageserver that some particular WAL record or page is damaged and we shouldn't try to push it again? P.S. my new test_cluster_stop_start in e2e seems to be a bit flaky, or this is a symptom of a bug. I now do stop with kill signal and after that compute cannot start in 120 seconds, which is weird. I think we have to dump compute logs in CI |
Also I want to notice that checksums can be useful for ... storage manager architecture. Having checksums we may implement in-place updates for image layers. In this case checksums can be used to enforce page consistency after failure. Still not quite clear where to get previous (correct) page image if page is corrupted. But in-place updates can significantly reduce write amplification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch has to be rebased.
But in principle I am voting for switching on checksums in Zenith.
And Alexey's patch is doning it in right way.
@ololobus , did you plan to resolve them? |
@stepashka, I rebased it already, but new conflicts arrived. No problem with rebasing again, of course. The only reason why I still haven't merged it yet is that @hlinnaka hasn't liked this approach, if I understood him correctly. In this PR we do checksum checks in the redo process: for incoming WAL records and pages. It means that we don't have a feedback channel for pageserver to figure out that some WAL record or page is damaged. IIRC, @hlinnaka's idea was that we anyway trust So I wanted to move CRC checks into pageserver code, but haven't started it yet. Or we want to merge it as is and change later? As far I can see, pageserver starts redo process per each tenant, so one damaged page won't affect others and we will be able to investigate. Another option is that I can remove these checks from redo process, but only keep stamping CRC in pageserver and redo process, so checks only happen on the compute side. So we can merge this first part mostly as is. |
Please notice, that we are already calculating checksums at page server side (for FPI), so this code is already implemented in Rust. So it seems to be not difficult to add this checksum calculation to get_page_at_lsn... |
2dd4bcb
to
8b90dea
Compare
I've removed checksum checking from redo loop. Now we do:
The only missing part is checksum checking in the pageserver, when it reads layer files from local disk / S3 back into memory. There are two issues for doing that #987 and #1083 So if no objections (cc @hlinnaka @kelvich @knizhnik), I'd merge this compute-related part of checksums now and we can add pageserver files checksum later. IIUC, when we merge this, old tenants should continue to work as control file in their basebackup doesn't have data checksums turned on. Yet, we do a fresh initdb per each redo process startup. So for old tenants their WAL redo process will stamp checksum, but compute just won't check it |
I do not have objections, but I wan to notice that enabling checksums actually implies wal_log_hints. Yes, we still have wal_log_hints=on by default,but we have discussed that it dramatically increased WAL size (~3 times) and it is safe to disable it. But with enabled checksums there is no chance to turn off wal logging of hint bits. The only positive effect is that there is no need to discuss whether itis safe to disable wal_log_hints or not. |
Just tested locally that enabling checksums doesn't affect existing tenants. They still have disabled checksums and work as before. So I'm going to merge this PR, we can iterate on it and add missing parts as mentioned above. |
So does this patch change the number of FPI's from compute nodes? |
Enabled data checksums effective set |
Can we change this behavior in postgres? IIUC |
c3efca5
to
2127df1
Compare
I've modified page checksum code in pageserver in two ways:
@knizhnik @lubennikovaav please, have a look on the two latest commits: @LizardWizzard you can have a look on the last one, where per-tenant config is modified :) I will check that it's really backward compatible with old tenants locally later, if the logic seems OK to everyone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments. The concerning thing is incorrect usage of unsafe.
Also thinking about the name of config param. Current one mimics postgres but typically booleans are named with something like is_ or _enabled. So leaving at your option :)
libs/utils/src/pg_checksum_page.rs
Outdated
* excluded from the calculation and preserved. | ||
*/ | ||
pub fn pg_checksum_page(data: &[u8], blkno: u32) -> u16 { | ||
let page = unsafe { std::mem::transmute::<&[u8], &[u32]>(data) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against this kind of unsafe usage. Function receives just a u8 slice, and transmutes without any additional checks. This wont work for all possible inputs. There is no SAFETY:
style comment.
Please check this playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f6c1aa72492e141e1b949bcaab86e797
To see the UB report please run Tools -> MIRI
For more info see transmute doc and follow the links to other documentation regarding unsafe
Is there any benefits in avoiding existing implementations? If this is the FNV hash there is an existing implementation for rust standard trait Hasher from stdlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the fact that Postgres hashing algorithm is based on FNV doesn't mean that it's identical. The comment in Postgres says:
* PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of
* high bits - high order bits in input data only affect high order bits in
* output data. To resolve this we xor in the value prior to multiplication
* shifted right by 17 bits.
so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums. We can try this fnv crate, but I don't really think that it work out, do you?
This wont work for all possible inputs.
Yes, because the page size is fixed to the 8 KB, we can add size check though, just to be on the safe side, but it won't change anything really, except slowing it down a bit, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible way to fix current code is to use chunks_exact(4) and mapping to u32::from_ne_bytes. And check that chunks iterator remainder is empty or assert on buffer size before creating an iterator
u32::from_ne_bytes
should work, but won't it be slower? It should involve copying the whole page chunk by chunk each time, which isn't a desired behavior as checksum calculation may happen quite often. Or I miss something and it's optimized / or we can optimize its usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I'm afraid that it will just produce different hashes and Postgres won't accept these checksums.
Gotcha, I'm just trying to avoid writing our own copies for things that already written and well maintain by the community, but I see that it is not possible here.
but it won't change anything really
My opinion here is that we cannot permit UB in our code. If this function is used improperly it can cause UB in safe code which is a violation of rust core principle. Such functions should be marked as unsafe and clearly document what kind of guarantees they expect. Even with that it just moves the responsibility to run these checks to the caller, so they cant be avoided.
Regarding performance, if we have the size guarantee compiler may actually decide to optimize away some checks, not sure regarding the copy, I think worth experimenting. Actually it is a good idea to have benchmark for this function using something like criterion
libs/utils/src/pg_checksum_page.rs
Outdated
tmp.wrapping_mul(FNV_PRIME) ^ (tmp >> 17) | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use rust style doc comments: https://doc.rust-lang.org/rust-by-example/meta/doc.html#doc-comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current one mimics postgres but typically booleans are named with something like is_ or _enabled.
Also thought about it, adding _enabled
makes sense to me
pageserver/src/http/routes.rs
Outdated
@@ -412,6 +412,9 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo | |||
tenant_conf.compaction_target_size = request_data.compaction_target_size; | |||
tenant_conf.compaction_threshold = request_data.compaction_threshold; | |||
|
|||
// Turn on data checksums for all new tenants | |||
tenant_conf.data_checksums = Some(request_data.data_checksums.unwrap_or(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit hacky to me, it doesn't look like a permanent code. Some time in the future we'll need to clean it up, but there is no plan on how to do it. Should we modify deployment scripts for new pageserver set up so it enables it by default?
Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults
module. There is /v1/tenant/config
API so it is not needed to manually change files on the filesystem.
Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution is to stamp every existing tenant with proper config disabling the checksum and enable it by default in defaults module. There is /v1/tenant/config API so it is not needed to manually change files on the filesystem.
This sounds really bug-prone to me, as anything can go wrong during this migration process. And also new tenants could be create right during new code deploy, so migration will be tricky
Another option is to hardcode this option in console for tenant create API call. (This is the way I imagined when you asked about it)
If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.
Yet, I agree that it looks a bit ugly and I'd remove it and set default to true if we wouldn't need a backward compatibility. I personally think that either way isn't better.
We can also make it configurable in the pageserver config, so here we will look into it and set the value specified there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also new tenants could be create right during new code deploy,
Yeah, that needs to be combine either with the current solution or with the api setting it to true for every new tenant. One benefit of this approach is that we move it to desired state now and there is need to remember that we need to clean it up in the future. But I agree, migration is a bit tricky.
If this option is recommended to be turned on by default, why is it better to let the API caller to do this? IMO, reversing it is better, i.e. you can turn it off if needed.
Your arguments make sense. Using It through the api without a enabling by
default just moves the hack away from pageserver, but adds it to console. I agree that both ways are hacky enough so I’m ok with any solution that won’t keep this code in pageserver for years.
We can also make it configurable in the pageserver config
It should be already possible. Default tenant config is part of pageserver config
As far as I remember the main @hlinnaka concern against this patch was that enabling postgres checksums doesn't protect all data: both delta and image layers contains more data except page images themselves. For example delta layer contains B-Tree. WAL records themselves are protected by CRC, but |
With this patch we protect two important paths for data pages and WAL records:
Having a single checksum per layer file doesn't seem to be a good idea, as only one page could be corrupted. So we need a more fine-grained checksum and remaining parts of the data could be protected separately. |
libs/utils/src/pg_checksum_page.rs
Outdated
let chunk_i = i * N_SUMS + j; | ||
let chunk: u32; | ||
if chunk_i == 2 { | ||
let mut chunk_copy = page[chunk_i].to_le_bytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it deserves more comments. This differs from the original implementation and there is no explanation. Requires some deciphering.
Just an idea, we already generate a proper struct for PageHeaderData so it is possible to use it here or at least give it a reference
libs/utils/src/pg_checksum_page.rs
Outdated
let mut chunk_copy = page[chunk_i].to_le_bytes(); | ||
chunk_copy[0] = 0; | ||
chunk_copy[1] = 0; | ||
chunk = u32::from_le_bytes(chunk_copy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has seen a handful of comments that revolve around various nits and styling issues, but this particular piece of code is very broken in multiple ways.
First of all, why not just
value & 0xFFFF0000
instead of this verbosity? The original code never said anything about endianness or LE in particular, but this striking detail drew my attention to the broader question: is this even correct?
Speaking of the original postgres code, it's very different. It uses PageHeaderData
for resetting the checksum, which is hacky but still correct. But here we're taking it for granted that the lower half (I omit the to-le/from-le part) of a certain u32 will contain pd_checksum
, which is not the case for big-endian!
Okay, I must admit that we'll probably never want to run our code on a big-endian arch nowadays, but we really shouldn't let such things slip in our code base unnoticed, without even a tiny comment clarifying that only little-endian is supported.
Also there's only a hint that this code has been adopted from postgres (like, you could grep for pg_checksum_page
, but you'll soon learn that it's very different, like I did). In cases like this I'd love to see a comment saying "this code has not been copied verbatim from <original function name OR github URL>
".
So here's a proof that mixing field- and ptr-to-u32-based access patterns is a bad idea:
#include <stdint.h>
#include <stdio.h>
struct Data {
// Desugared PageXLogRecPtr
uint32_t a;
uint32_t b;
// pd_checksum
uint16_t c;
// pd_flags
uint16_t d;
};
int main() {
struct Data data = (struct Data) {
.a = 1,
.b = 2,
.c = 3,
.d = 4,
};
uint32_t *ptr = (uint32_t *) &data;
ptr[2] &= 0xffff0000;
printf("%d\n", ((volatile struct Data *)&data)->c);
return 0;
}
I picked a big-endian mips for the demonstration. For the cross-toolchain I'm using the zig compiler because it's super-convenient (I highly recommend it). We'll also need to install a pack of qemu-user interpreters (e.g. https://packages.debian.org/en/sid/qemu-user-static) to run the actual binaries.
Now, check this out:
$ zig cc test.c && ./a.out
is little endian: 1
0
$ zig cc test.c --target=mips-linux-musl && ./a.out
is little endian: 0
3
Oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this particular piece of code is very broken in multiple ways.
Thanks for looking on it and such a detailed comment! Assuming that we are not running any of our service components on big-endian archs and, yes, likely not going to, what are the multiple ways we are talking about? Sorry if I missed something in your comment. Because yes, it certainly not intended to be run on big-endian and it's worth marking / commenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the only place in Neon code where we assume that target system is little-endian. May be it is not good, but I do not think that it is so critical at this moment.
It is possible now to find big-endian machine, but not in cloud and I do not see any reasons why somebody wants to run Neon on such system.
It is not true. pg_checksum is applied to the whole page, including checksum field.
|
I mean that I changed your |
It is not completely protecting data from disk failure. If corruption affect B-Tree page, then Postgres page-level cheksums will not help to detect such error.
File level checksums are really unusable, but not because to be able to calculate it we need to load all file in memory. So we need checksums for individual B-Tree pages. |
We need checksums to verify data integrity, when we read it from untrusted place (e.g. local disk) or via untrusted communication channel (e.g. network). At the same time, we trust pageserver <-> redo process communication channel, as it is just a pipe. Here we enable calculation of data checksums in the wal redo process and when we extract FPI during WAL injestion. Compute node (Postgres) will verify checksum of every page after receiving it back from pageserver. So it is pretty similar to how vanilla Postgres checks them. There are two other places where we should verify checksums to detect data corruption earlier: - when we receive WAL records from safekeepers (already implemented, see: WalStreamDecoder::poll_decode) - when we write layer files to disk and read back in memory from local disk or S3
Added safety notes, |
As discussed moving to draft. |
PR to postgres repo: neondatabase/postgres#122
Resolve neondatabase/cloud#536