Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only trim multi-node clusters when Archived LSN is reported #2814

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Feb 28, 2025

This change makes it so that in Restate deployments with more than one node, a snapshot repository is a requirement for auto-trimming of logs. Previously we would have trimmed logs based on the reported persisted LSN, which isn't always safe considering that additional nodes may be added at a later time.

Related docs PR: restatedev/documentation#556

Related: #2808

This change makes it so that in Restate deployments with more than one node,
a snapshot repository is a requirement for auto-trimming of logs. Previously
we would have trimmed logs based on the reported persisted LSN, which isn't
always safe considering that additional nodes may be added at a later time.
Copy link

github-actions bot commented Feb 28, 2025

Test Results

  7 files  ±0    7 suites  ±0   3m 17s ⏱️ -14s
 45 tests ±0   44 ✅ ±0  1 💤 ±0  0 ❌ ±0 
174 runs  ±0  171 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 0418b50. ± Comparison against base commit 1044219.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for hardening the log trim strategies @pcholakov. It's a bit of a pity that we can't fully guard against the case where trimming happens on a single node based on persisted lsn and then the user wants to expand its cluster :-( Apart from this, I think the changes make sense to me.

@@ -1106,7 +1160,7 @@ mod tests {
let applied_lsn = Arc::new(AtomicU64::new(0));
let persisted_lsn = Arc::new(AtomicU64::new(0));

let (_node_env, bifrost, _) = create_test_env(config, |builder| {
let (_node_env, bifrost, _) = create_test_env_with_nodes(config, 2, |builder| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name is a bit confusing. Shouldn't we always stop trimming when running with multiple nodes and not having an archived lsn?

config: Configuration,
n: u8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can spare a few more letters to give it a more expressive name.

@@ -367,24 +353,19 @@ fn create_log_trim_check_interval(options: &AdminOptions) -> Option<Interval> {
}

enum TrimMode {
/// Trim logs by the reported persisted LSN. This strategy is only appropriate for
/// single-node deplyments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// single-node deplyments.
/// single-node deployments.

if snapshots_repository_configured
|| Self::any_processors_report_archived_lsn(cluster_state)
{
if snapshots_repository_configured || cluster_state.nodes.len() > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could make users aware of the fact that trimming won't work when running w/o a snapshot repository and multi node deployment.

Comment on lines +410 to +414
// if let Some(blocking_nodes) = self.get_trim_blockers() {
// return Err(TrimPointsUnavailable::UnknownPartitionStatus(
// blocking_nodes.clone(),
// ));
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed.

@@ -905,7 +907,7 @@ impl PartitionProcessorManager {
///
/// This allows multiple partition processors to be started concurrently without holding
/// and exclusive lock to [`Self`]
fn start_partition_processor_task(
fn crate_start_partition_processor_task(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn crate_start_partition_processor_task(
fn create_start_partition_processor_task(

@@ -294,18 +294,32 @@ async fn create_or_recreate_store(
(maybe_snapshot, Some(fast_forward_lsn)) => {
// Play it safe and keep the partition store intact; we can't do much else at this
// point. We'll likely halt again as soon as the processor starts up.
let recovery_guide_msg =
"The partition's log is trimmed to a point from which this processor can not resume.
Visit https://docs.restate.dev/operate/troubleshooting#handling-missing-snapshots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on where the troubleshooting guide will be place, this probably needs updating.

Comment on lines 302 to 323
if let Some(snapshot) = maybe_snapshot {
warn!(
%partition_id,
%snapshot.min_applied_lsn,
%fast_forward_lsn,
"Latest available snapshot is older than the the fast-forward target LSN!",
"The latest available snapshot is from an LSN before the target LSN! {}",
recovery_guide_msg,
);
} else if snapshot_repository.is_none() {
warn!(
%partition_id,
%fast_forward_lsn,
"A log trim gap was encountered, but no snapshot repository is configured! {}",
recovery_guide_msg,
);
} else {
warn!(
%partition_id,
%fast_forward_lsn,
"A fast-forward target LSN is set, but no snapshot available for partition!",
"A log trim gap was encountered, but no snapshot is available for this partition! {}",
recovery_guide_msg,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's helpful for our users :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants