Reconsider trimming based on persistent lsn #2808

tillrohrmann · 2025-02-28T09:37:43Z

Currently, Restate supports trimming when uploading a snapshot to an object store (archived lsn based trimming). If snapshotting is not enabled, then Restate also supports trimming by requiring that all nodes that run a PP for a given partition have persisted the log up to given point before trimming the log at this point (persistent lsn based trimming). This strategy is potentially very dangerous because after the first trim operation we can't run partition processors on a node that wasn't running the PP before trimming (e.g. when adding new nodes or moving a PP from one node to another). The problem is that the log is no longer complete and there is no way for the newly started PP to fetch the latest partition store state snapshot.

The persistent lsn based trimming strategy is primarily intended for single node Restate deployments where the placements of PPs won't change. However, it is not disabled in a multi node setup (which is also hard to do because we can change a single-node deployment into a multi-node one).

One way to mitigate the problem is to have an in-band mechanism to exchange partition store state snapshots. However, this requires that at least one of the nodes that has the latest partition store snapshot is still available. Alternatively, we can drop support for the persistent lsn trimming strategy and require users to configure a snapshot directory if they want to have support for log trimming.

Until the problem is fixed, we should update the documentation to make people aware of the limitations when using persistent lsn based log trimming.

pcholakov · 2025-02-28T10:08:55Z

We could also print a warning on startup - or even periodically - if there are multiple nodes in the config but no snapshot repository is configured. Personally, I would prefer that we pause trimming if we detect that situation. We still have a potential window of vulnerability while nodes are still joining, but we could make sure we just don't trim at all for an initial window of LSNs to cover for that.

Persisted LSN based trimming is only ever a good long-term strategy for single nodes. We can make an exception for "throwaway" clusters so that you don't need to setup snapshots for throwaway multi-node tests but for these it should be okay, by definition, to stop working when the disks fill up.

tillrohrmann · 2025-02-28T10:39:45Z

Yeah, maybe we can start by disabling trimming once we detect that we are running in a multi-node setup and check that the error we report on failing PPs helps users figure out what to do (configuring a snapshot repository). A potential next step could be the in-band mechanism to exchange snapshots.

pcholakov · 2025-03-01T12:27:40Z

Opened #2814 - disables trimming by persisted LSN in clusters - and restatedev/documentation#556 for this.

pcholakov self-assigned this Feb 28, 2025

pcholakov mentioned this issue Mar 1, 2025

Add troubleshooting steps for recovery from missing snapshots restatedev/documentation#556

Merged

pcholakov mentioned this issue Mar 1, 2025

Only trim multi-node clusters when Archived LSN is reported #2814

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider trimming based on persistent lsn #2808

Reconsider trimming based on persistent lsn #2808

tillrohrmann commented Feb 28, 2025

pcholakov commented Feb 28, 2025

tillrohrmann commented Feb 28, 2025

pcholakov commented Mar 1, 2025

Reconsider trimming based on persistent lsn #2808

Reconsider trimming based on persistent lsn #2808

Comments

tillrohrmann commented Feb 28, 2025

pcholakov commented Feb 28, 2025

tillrohrmann commented Feb 28, 2025

pcholakov commented Mar 1, 2025