Explicitly log relation extension #140

antons-antons · 2022-03-03T19:40:40Z

This is not a final code (e.g. it doesn't address certain index build cases)

Issue a new transaction log record to explicitly persist relation size.
The log record will initialize a new page by placing a verifiable cookie
Buffer manager would have an ability to validate such cookie to ensure
correct block is in the cache.

Given that page is no longer a true zero page, page checksum can be
enabled on such pages.

Make smgr API pluggable. Add smgr_hook that can be used to define custom smgrs. Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() loads f_smgr implementation using smgr_hook. Also add smgr_init_hook and smgr_shutdown_hook. And a lot of mechanical changes in smgr.c functions. This patch is proposed to community: https://commitfest.postgresql.org/33/3216/ Author: anastasia <[email protected]>

Add contrib/zenith that handles interaction with remote pagestore. To use it add 'shared_preload_library = zenith' to postgresql.conf. It adds a protocol for network communications - see libpagestore.c; and implements smgr API. Also it adds several custom GUC variables: - zenith.page_server_connstring - zenith.callmemaybe_connstring - zenith.zenith_timeline - zenith.wal_redo Authors: Stas Kelvich <[email protected]> Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

Add WAL redo helper for zenith - alternative postgres operation mode to replay wal by pageserver request. To start postgres in wal-redo mode, run postgres with --wal-redo option It requires zenith shared library and zenith.wal_redo Author: Heikki Linnakangas <[email protected]>

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

In the test_createdb test, we created a new database, and created a new branch after that. I was seeing the test fail with: PANIC: could not open critical system index 2662 The WAL contained records like this: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0163E8F0, prev 0/0163C8A0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/01640940, prev 0/0163E8F0, desc: FPI , blkref #0: rel 1663/12985/1249 fork fsm blk 2 FPW rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642990, prev 0/01640940, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/016429C8, prev 0/01642990, desc: CHECKPOINT_ONLINE redo 0/163C8A0; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Database len (rec/tot): 42/ 42, tx: 540, lsn: 0/01642A40, prev 0/016429C8, desc: CREATE copy dir 1663/1 to 1663/16390 rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 0/01642A70, prev 0/01642A40, desc: RUNNING_XACTS nextXid 541 latestCompletedXid 539 oldestRunningXid 540; 1 xacts: 540 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642AA8, prev 0/01642A70, desc: CHECKPOINT_ONLINE redo 0/1642A70; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 540; online rmgr: Transaction len (rec/tot): 66/ 66, tx: 540, lsn: 0/01642B20, prev 0/01642AA8, desc: COMMIT 2021-05-21 15:55:46.363728 EEST; inval msgs: catcache 21; sync rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/01642B68, prev 0/01642B20, desc: CHECKPOINT_SHUTDOWN redo 0/1642B68; tli 1; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown The compute node had correctly replayed all the WAL up to the last record, and opened up. But when you tried to connect to the new database, the very first requests for the critical relations, like pg_class, were made with request LSN 0/01642990. That's the last record that's applicable to a particular block. Because the database CREATE record didn't bump up the "last written LSN", the getpage requests were made with too old LSN. I fixed this by adding a SetLastWrittenLSN() call to the redo of database CREATE record. It probably wouldn't hurt to also throw in a call at the end of WAL replay, but let's see if we bump into more cases like this first. This doesn't seem to be happening with page server as of 'main'; I was testing with a version where I had temporarily reverted all the recent changes to reconstruct control file, checkpoints, relmapper files etc. from the WAL records in the page server, so that the compute node was redoing all the WAL. I'm pretty sure we need this fix even with 'main', even though this test case wasn't failing there right now.

Some operations in PostgreSQL are not WAL-logged at all (i.e. hint bits) or delay wal-logging till the end of operation (i.e. index build). So if such page is evicted, we will lose the update. To fix it, we introduce PD_WAL_LOGGED bit to track whether the page was wal-logged. If the page is evicted before it has been wal-logged, then zenith smgr creates FPI for it. Authors: Konstantin Knizhnik <[email protected]> anastasia <[email protected]>

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

Request relation size via smgr function, not just stat(filepath).

Author: Konstantin Knizhnik <[email protected]>

…mmon error. TODO: add a comment, why this is fine for zenith.

…d of WAL page header, then return it back to the page origin

…of WAL at compute node + Check for presence of replication slot

…t inside. WAL proposer (as bgw without BGWORKER_BACKEND_DATABASE_CONNECTION) previously ignored SetLatch, so once caught up it stuck inside WalProposerPoll infinitely. Futher, WaitEventSetWait didn't have timeout, so we didn't try to reconnect if all connections are dead as well. Fix that. Also move break on latch set to the end of the loop to attempt ReconnectWalKeepers even if latch is constantly set. Per test_race_conditions (Python version now).

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

@knizhnik

This patch aims to make our bespoke WAL redo machinery more robust in the presence of untrusted (in other words, possibly malicious) inputs. Pageserver delegates complex WAL decoding duties to postgres, which means that the latter might fall victim to carefully designed malicious WAL records and start doing harmful things to the system. To prevent this, it has been decided to limit possible interactions with the outside world using the Secure Computing BPF mode. We use this mode to disable all syscalls not in the allowlist. Please refer to src/backend/postmaster/seccomp.c to learn more about the pros & cons of the current approach. + Fix some bugs in seccomp bpf wrapper * Use SCMP_ACT_TRAP instead of SCMP_ACT_KILL_PROCESS to receive signals. * Add a missing variant of select() syscall (thx to @knizhnik). * Write error messages to an fd stderr's currently pointing to.

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

…recovery. Rust's postgres_backend currently is too dummy to handle it properly: reading happens in separate thread which just ignores CopyDone. Instead, writer thread must get aware of termination and send CommandComplete. Also reading socket must be transferred back to postgres_backend (or connection terminated completely after COPY). Let's do that after more basic safkeeper refactoring and right now cover this up to make tests pass. ref #388

…ion position in wal_proppser to segment boundary

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

The constructed StringInfoData 'z' variable wasn't used for anything, we passed the original 's' StringInfo directly to ParseZenithFeedbackMessage. That's fine, but let's remove the dead code.

refer #1015

Co-authored-by: Heikki Linnakangas <[email protected]>

* Expose reading a relation page at a specific LSN * Addressing comments

Use function pointer to perform a cross-extension calls.

refer #1262

Issue a new transaction log record to explicitly persist relation size. The log record will initialize a new page by placing a verifiable cookie Buffer manager would have an ability to validate such cookie to ensure correct block is in the cache. Given that page is no longer a true zero page, page checksum can be enabled on such pages.

antons-antons · 2022-03-04T16:55:59Z

opening the code for discussion

knizhnik

I wonder if we really need to introduce new wal record type?
May be we can use XLOG_HEAP_TRUNCATE instead.
Right now it is used only for logical decoding so we can provide our own handler.
Or alternatively we can use FPI record. For empty page it is quite compact.
I think it will be better to avoid any changes in postgres core unless it is absolutely needed.

antons-antons · 2022-03-11T05:40:27Z

what advantages are you seeing in reuse of an existing record type?

knizhnik · 2022-03-11T05:54:00Z

what advantages are you seeing in reuse of an existing record type?

Do not change postgres core.
Any changes in postgres sources increase maintainability cost: we will have to port our changes for all new PG releases.
We have such experience of maintaining own fork in pgpro. And it is awful. Certainly, we can't avoid our fork at all. But the less changes in core we have, the easier will be porting.

Also I can ask opposite question: why do we need separate record type for relation extension. What are the advantages comparing with just wal lloging FPI for extended pages?

antons-antons · 2022-03-11T06:46:44Z

there's a point in keeping the core Postgres intact, but we also would benefit from not adding new logic implicitly (issues with implicit logic will be much harder to untangle later)

There're tricks we can play to reduce merge pain (we may need a write up on how to produce changes that minimize merge conflicts)

regarding the advantages over XLOG_FPI, while possible XLOG_FPI is meant to initialize valid pages, while the page type in zenith_extens is not initialized yet (thus pd_special and pd_upper can't be deterministically set).
Secondly XLOG_FPI replay is not controlled by redo logic but handled via XLogReadBufferForRedoExtended; Given that we do not want to init a page, but only put a marker about explicit extension, we need a redo that does exactly that. Marker is important for various reasons, one of them is a protection from reading wrong pages into the buffer cache (a 0 page doesn't have checksum in Postgres, and thus we can't tell if 0 page belongs to the correct relation-block, or a memory corruption)

knizhnik · 2022-03-11T09:18:54Z

while the page type in zenith_extens is not initialized yet

I am not sure that it is true. smgrextend (and so zenith_extend) is given page image which is most likey intialized by PageInit.

Secondly XLOG_FPI replay is not controlled by redo logic but handled via XLogReadBufferForRedoExtended

Not sure that I correctly understand you. But right now XLOG_FPI is handled by walingest.rs (it just stores page image).
We can add here any logic we need.

Marker is important for various reasons, one of them is a protection from reading wrong pages into the buffer cache

What is the difference with vanilla postgres where mdextend actually writes data to the disk?

MMeent · 2022-04-11T12:09:30Z

src/backend/access/Makefile

@@ -8,7 +8,7 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global

-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
+SUBDIRS	    = brin common gin gist hash heap index nbtree pagestore rmgrdesc spgist \


Why add a pagestore subdirectory for what is essentially a feature we want in core PostgreSQL?

I'm not sure that POPS would benefit from this.
From a community Postgres perspective, I would start it at the access methods (and it suddenly becomes a massive change) with a buffer cache change.

MMeent · 2022-04-11T12:11:23Z

src/backend/access/pagestore/pagestore.c

+		break;
+
+	default:
+		break;


Shouldn't this be elog(ERROR, "unexpected info mask %x", info) or something like that?

MMeent · 2022-04-11T12:12:01Z

src/backend/access/pagestore/pagestore.c

+	EmptyPageMarker *pm = (EmptyPageMarker *) (page + SizeOfPageHeaderData);
+	EmptyPageMarker marker = {
+			.rnode = rnode,
+		.forknum = forknum,


nit: inconsistent indentation

MMeent · 2022-04-11T12:16:36Z

src/backend/access/pagestore/pagestore.c

+	Buffer buf;
+
+	/* Should we support multi-page extend here? */
+	Assert(record->max_block_id == 1);


I think we should allow multi-page extension; as in, we shouldn't implicitly fill pages < the recorded page.

MMeent · 2022-04-11T12:17:36Z

src/backend/access/pagestore/pagestore.c

+}
+
+bool
+validate_marked_page(Page page, RelFileNode rnode, ForkNumber forknum, BlockNumber blknum)


Please add some documentation at the functions on why we need to mark and validate the page as empty.

src/include/access/rmgrlist.h

lubennikovaav and others added 30 commits February 9, 2022 19:34

lastWrittenPageLSN.patch

4783dfc

Save lastWrittenPageLSN in XLogCtlData to know what pages to request from remote pageserver. Authors: Konstantin Knizhnik <[email protected]> Heikki Linnakangas <[email protected]>

[walproposer] wal_proposer.patch

3ee8149

Add WalProposer background worker to broadcast WAL stream to Zenith WAL acceptors Author: Konstantin Knizhnik <[email protected]>

persist_unlogged_tables.patch

782eacf

Ignore unlogged table qualifier. Add respective changes to regression test outputs. Author: Konstantin Knizhnik <[email protected]>

fix_pg_table_size.patch

31a077b

Request relation size via smgr function, not just stat(filepath).

[walredo] fix_gin_redo.patch

917462d

Author: Konstantin Knizhnik <[email protected]>

[walredo] fix_brin_redo.patch

7276b7f

Author: Konstantin Knizhnik <[email protected]>

speculative_records_workaround.patch

6d7bafb

wallog_t_ctid.patch

87b826f

vacuumlazy_debug_stub.patch

9f457bf

[test] zenith_test_evict.patch

ac7bd68

fix_sequence_wallogging.patch

74ae59f

Bring back change that got lost in refactoring. silence ReadBuffer_co…

e27aaea

…mmon error. TODO: add a comment, why this is fine for zenith.

[contrib/zenith] [refer #225] if insert WAL position points at the en…

6ac8b6e

…d of WAL page header, then return it back to the page origin

[walproposer] Create replication slot for walproposer to avoid loose …

8360d74

…of WAL at compute node + Check for presence of replication slot

[walproposer] Skip absent WAL segment removed by pg_resetwal

b9e24d1

[walproposer] Make it possible to start postgres without reading chec…

45e9074

…kpoint from WAL + Check for presence of zenith.signal file to allow skip reading checkpoint record from WAL + Pass prev_record_ptr through zenith.signal file to postgres

[walproposer] Simplify WL_LATCH_SET testing in the walproposer

5085205

[smgr_api] [contrib/zenith] 1. Do not call mdinit from smgrinit() bec…

fb8c2a1

…ause it cause memory leak in wal-redo-postgres 2. Add check for local relations to make it possible to use DEBUG_COMPARE_LOCAL mode in SMGR + Call smgr_init_standard from smgr_init_zenith

[walproposer] [contrib/zenith] support zenith_tenant

15becb6

this patch adds support for zenith_tenant variable. it has similar format as zenith_timeline. It is used in callmemaybe query to pass tenant to pageserver and in ServerInfo structure passed to wal acceptor

[walproposer] [contrib/zenith] [refer #395] Do no align sart replicat…

4319a6f

…ion position in wal_proppser to segment boundary

[test] Add contrib/zenith_test_utils with helpers for testing and deb…

48fc26e

…ugging. Now it contains only one function test_consume_xids() for xid wraparound testing.

[walproposer] Change condition for triggering recovery

9dedb55

hlinnaka and others added 8 commits February 16, 2022 12:09

Remove dead code in handling ZenithFeedback part of an AppendResponse.

b426ffe

The constructed StringInfoData 'z' variable wasn't used for anything, we passed the original 's' StringInfo directly to ParseZenithFeedbackMessage. That's fine, but let's remove the dead code.

Prevent recursive call of XLogBeginInsert

ecbcefb

refer #1015

Update src/backend/access/gin/ginfast.c

d455371

Co-authored-by: Heikki Linnakangas <[email protected]>

Expose reading a relation page at a specific LSN (#131)

2c2b362

* Expose reading a relation page at a specific LSN * Addressing comments

Add backpressure_lsns() function. Fix zenith feedback processing

a3709cc

Fix zenith_test_utils linkage on macOS

1872ba6

Use function pointer to perform a cross-extension calls.

Add warning fr unrecgonized GUCs with zenith prefix

31dc24a

refer #1262

antons-antons requested review from knizhnik, hlinnaka and MMeent March 4, 2022 16:54

antons-antons marked this pull request as ready for review March 4, 2022 16:56

knizhnik reviewed Mar 11, 2022

View reviewed changes

MMeent reviewed Apr 11, 2022

View reviewed changes

MMeent force-pushed the main branch from 35ad142 to 9c99008 Compare July 8, 2022 12:51

MMeent force-pushed the main branch from 3f315a1 to a479855 Compare August 19, 2022 17:59

vadim2404 changed the base branch from main to REL_14_STABLE_neon November 10, 2022 10:18

lubennikovaav force-pushed the REL_14_STABLE_neon branch from 360ff1c to da50d99 Compare November 22, 2022 08:27

MMeent force-pushed the REL_14_STABLE_neon branch from 125d0bd to a9f5034 Compare February 23, 2023 10:14

MMeent force-pushed the REL_14_STABLE_neon branch from a2daebc to 1144aee Compare May 16, 2023 11:37

tristan957 force-pushed the REL_14_STABLE_neon branch from 28bf5cc to 5d5cfee Compare August 11, 2023 14:42

tristan957 force-pushed the REL_14_STABLE_neon branch from dd067cf to 0bb356a Compare December 11, 2023 16:12

tristan957 force-pushed the REL_14_STABLE_neon branch from be7a65f to 018fb05 Compare February 7, 2024 16:40

tristan957 force-pushed the REL_14_STABLE_neon branch 2 times, most recently from b8e5379 to 21ec61d Compare May 20, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly log relation extension #140

Explicitly log relation extension #140

antons-antons commented Mar 3, 2022 •

edited

Loading

antons-antons commented Mar 4, 2022

knizhnik left a comment •

edited

Loading

antons-antons commented Mar 11, 2022

knizhnik commented Mar 11, 2022

antons-antons commented Mar 11, 2022

knizhnik commented Mar 11, 2022 •

edited

Loading

MMeent Apr 11, 2022

antons-antons Apr 11, 2022

MMeent Apr 11, 2022

antons-antons Apr 11, 2022

MMeent Apr 11, 2022

MMeent Apr 11, 2022

MMeent Apr 11, 2022

Explicitly log relation extension #140

Are you sure you want to change the base?

Explicitly log relation extension #140

Conversation

antons-antons commented Mar 3, 2022 • edited Loading

antons-antons commented Mar 4, 2022

knizhnik left a comment • edited Loading

Choose a reason for hiding this comment

antons-antons commented Mar 11, 2022

knizhnik commented Mar 11, 2022

antons-antons commented Mar 11, 2022

knizhnik commented Mar 11, 2022 • edited Loading

MMeent Apr 11, 2022

Choose a reason for hiding this comment

antons-antons Apr 11, 2022

Choose a reason for hiding this comment

MMeent Apr 11, 2022

Choose a reason for hiding this comment

antons-antons Apr 11, 2022

Choose a reason for hiding this comment

MMeent Apr 11, 2022

Choose a reason for hiding this comment

MMeent Apr 11, 2022

Choose a reason for hiding this comment

MMeent Apr 11, 2022

Choose a reason for hiding this comment

antons-antons commented Mar 3, 2022 •

edited

Loading

knizhnik left a comment •

edited

Loading

knizhnik commented Mar 11, 2022 •

edited

Loading