Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impr: propagate VirtualFile metrics via RequestContext #7202

Draft
wants to merge 44 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
5831573
draft impl
problame Feb 24, 2025
26da696
WIP: integrate (this scope() stuff is messy, too many places to get i…
problame Feb 24, 2025
a866402
some more WIP on impl
problame Feb 24, 2025
b688d1e
half-borked attempt at propagating to more parts of the implementatio…
problame Feb 24, 2025
d096da1
salvage the idea by upgrading the WeakHandle twice, plus wrapping the…
problame Feb 24, 2025
24c69f6
implement more propagation to see whether this has legs; method: alwa…
problame Feb 24, 2025
62ed544
first 15% of tests pass on my mac, let's see what CI says
problame Feb 24, 2025
9c2c728
ignore missing timeline scope in unit tests + clippy & cargo fmt fixes
problame Feb 24, 2025
6682e2e
Merge remote-tracking branch 'origin/main' into problame/virtual-file…
problame Feb 24, 2025
abcc319
don't keep the gate open / require two Handles
problame Feb 24, 2025
330e69b
propagate RequestContext to layer downloads
problame Feb 24, 2025
97e8515
context propagation for detach ancestor
problame Feb 24, 2025
9ffdad3
more missing timeline scope discovered by test suite
problame Feb 24, 2025
6e2690f
clippy
problame Feb 24, 2025
60c6178
fixup(propagate RequestContext to layer downloads)
problame Feb 24, 2025
8cbbc40
warning should print backtrace
problame Feb 24, 2025
f821017
secondary downloader
problame Feb 25, 2025
0ebdffa
fix more tests
problame Feb 25, 2025
982edfb
workaround for missing timeline context in on-demand download
problame Feb 25, 2025
2397c44
more(propagate RequestContext to layer downloads)
problame Feb 25, 2025
01c3f55
more(propagate RequestContext to layer downloads)
problame Feb 25, 2025
533fba7
more(propagate RequestContext to layer downloads)
problame Feb 25, 2025
a555f69
more(propagate RequestContext to layer downloads)
problame Feb 25, 2025
e30f9a5
Merge remote-tracking branch 'origin/main' into problame/virtual-file…
problame Feb 25, 2025
6f0c3d6
refactor(pageserver): propagate RequestContext to layer downloads
problame Feb 24, 2025
35388d3
Merge branch 'problame/layer-download-context-propagation' into probl…
problame Feb 25, 2025
c341b99
Merge remote-tracking branch 'origin/main' into problame/virtual-file…
problame Feb 26, 2025
d1e68ff
Merge remote-tracking branch 'origin/main' into problame/virtual-file…
problame Feb 27, 2025
5f83705
fix all the unit tests
problame Feb 27, 2025
37b70a9
https://github.com/neondatabase/neon/pull/7202#discussion_r1970708706
problame Feb 27, 2025
b6cc0c3
don't allocate for the pagestream usecase; avoid the Arc<> around the…
problame Feb 27, 2025
f25b7e5
fixup: forgot to rename one function
problame Feb 27, 2025
935bd87
refactor (extract before merge!): timeline handle KeepingTimelineGate…
problame Feb 27, 2025
42a20d4
store Arc<TimelineMetrics> instead of Arc<Timeline> to avoid risk of …
problame Feb 27, 2025
a0ba1d6
refactor(handle stuff): push gate business into the TenantManger impl
problame Feb 27, 2025
ab7271a
fixup(store Arc<TimelineMetrics> instead of Arc<Timeline> to avoid ri…
problame Feb 27, 2025
f536cf4
don't keep the gate open / require two Handles
problame Feb 24, 2025
839dd10
refactor (extract before merge!): timeline handle KeepingTimelineGate…
problame Feb 27, 2025
3aef56e
wip
problame Feb 27, 2025
c293bf1
store Arc<TimelineMetrics> instead of Arc<Timeline> to avoid risk of …
problame Feb 27, 2025
a1a8d0d
refactor(handle stuff): push gate business into the TenantManger impl
problame Feb 27, 2025
ecaacdb
we actually don't need another Arc around the GateGuard
problame Feb 27, 2025
1778a3e
review doc comment
problame Feb 27, 2025
5767947
Merge branch 'problame/timeline-handle-orthogonality' into problame/v…
problame Feb 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 188 additions & 3 deletions pageserver/src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -89,16 +89,112 @@
//! [`RequestContext`] argument. Functions in the middle of the call chain
//! only need to pass it on.

use crate::task_mgr::TaskKind;
use std::sync::Arc;

use once_cell::sync::Lazy;
use tracing::warn;
use utils::{id::TimelineId, shard::TenantShardId};

use crate::{
metrics::{StorageIoSizeMetrics, TimelineMetrics},
task_mgr::TaskKind,
tenant::Timeline,
};

// The main structure of this module, see module-level comment.
#[derive(Debug)]
pub struct RequestContext {
task_kind: TaskKind,
download_behavior: DownloadBehavior,
access_stats_behavior: AccessStatsBehavior,
page_content_kind: PageContentKind,
read_path_debug: bool,
scope: Scope,
}

// We wrap the `Arc<StorageIoSizeMetrics>`s inside another Arc to avoid child
// context creation contending for the ref counters of the Arc<Timeline>,
// which are shared among all tasks that operate on the timeline, especially
// concurrent page_service connections.
#[derive(Clone)]
pub(crate) enum Scope {
Global {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
SecondaryTenant {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
SecondaryTimeline {
io_size_metrics: crate::metrics::StorageIoSizeMetrics,
},
Timeline {
#[allow(clippy::redundant_allocation)]
arc_arc: Arc<Arc<TimelineMetrics>>,
},
#[cfg(test)]
UnitTest {
io_size_metrics: &'static crate::metrics::StorageIoSizeMetrics,
},
}

static GLOBAL_IO_SIZE_METRICS: Lazy<crate::metrics::StorageIoSizeMetrics> =
Lazy::new(|| crate::metrics::StorageIoSizeMetrics::new("*", "*", "*"));

impl Scope {
pub(crate) fn new_global() -> Self {
Scope::Global {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
/// NB: this allocates, so, use only at relatively long-lived roots, e.g., at start
/// of a compaction iteration.
pub(crate) fn new_timeline(timeline: &Timeline) -> Self {
Scope::Timeline {
arc_arc: Arc::new(Arc::clone(&timeline.metrics)),
}
}
pub(crate) fn new_page_service_pagestream(
timeline_handle: &crate::tenant::timeline::handle::Handle<
crate::page_service::TenantManagerTypes,
>,
) -> Self {
Scope::Timeline {
arc_arc: Arc::clone(&timeline_handle.metrics),
}
}
pub(crate) fn new_secondary_timeline(
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
) -> Self {
// NB: Secondaries never took proper care of metrics lifecycle, and we don't start doing it now.

let tenant_id = tenant_shard_id.tenant_id.to_string();
let shard_id = tenant_shard_id.shard_slug().to_string();
let timeline_id = timeline_id.to_string();

let io_size_metrics =
crate::metrics::StorageIoSizeMetrics::new(&tenant_id, &shard_id, &timeline_id);
Scope::SecondaryTimeline { io_size_metrics }
}
pub(crate) fn new_secondary_tenant(_tenant_shard_id: &TenantShardId) -> Self {
// Before propagating metrics via RequestContext, the labels were inferred from file path.
// The only user of VirtualFile at tenant scope is the heatmap download & read.
// The inferred labels for the path of the heatmap file on local disk were that of the global metric (*,*,*).
// Thus, we do the same here, and extend that for anything secondary-tenant scoped.
//
// If we want to have (tenant_id, shard_id, '*') labels for secondary tenants in the future,
// we will need to think about the metric lifecycle, i.e., remove them during secondary tenant shutdown,
// like we do for attached timelines. (We don't have attached-tenant-scoped usage of VirtualFile
// at this point, so, we were able to completely side-step tenant-scoped stuff there).
Scope::SecondaryTenant {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
#[cfg(test)]
pub(crate) fn new_unit_test() -> Self {
Scope::UnitTest {
io_size_metrics: &GLOBAL_IO_SIZE_METRICS,
}
}
}

/// The kind of access to the page cache.
Expand Down Expand Up @@ -157,6 +253,7 @@ impl RequestContextBuilder {
access_stats_behavior: AccessStatsBehavior::Update,
page_content_kind: PageContentKind::Unknown,
read_path_debug: false,
scope: Scope::new_global(),
},
}
}
Expand All @@ -171,10 +268,16 @@ impl RequestContextBuilder {
access_stats_behavior: original.access_stats_behavior,
page_content_kind: original.page_content_kind,
read_path_debug: original.read_path_debug,
scope: original.scope.clone(),
},
}
}

pub fn task_kind(mut self, k: TaskKind) -> Self {
self.inner.task_kind = k;
self
}

/// Configure the DownloadBehavior of the context: whether to
/// download missing layers, and/or warn on the download.
pub fn download_behavior(mut self, b: DownloadBehavior) -> Self {
Expand All @@ -199,6 +302,11 @@ impl RequestContextBuilder {
self
}

pub(crate) fn scope(mut self, s: Scope) -> Self {
self.inner.scope = s;
self
}

pub fn build(self) -> RequestContext {
self.inner
}
Expand Down Expand Up @@ -281,7 +389,50 @@ impl RequestContext {
}

fn child_impl(&self, task_kind: TaskKind, download_behavior: DownloadBehavior) -> Self {
Self::new(task_kind, download_behavior)
RequestContextBuilder::extend(self)
.task_kind(task_kind)
.download_behavior(download_behavior)
.build()
}

pub fn with_scope_timeline(&self, timeline: &Arc<Timeline>) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_timeline(timeline))
.build()
}

pub(crate) fn with_scope_page_service_pagestream(
&self,
timeline_handle: &crate::tenant::timeline::handle::Handle<
crate::page_service::TenantManagerTypes,
>,
) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_page_service_pagestream(timeline_handle))
.build()
}

pub fn with_scope_secondary_timeline(
&self,
tenant_shard_id: &TenantShardId,
timeline_id: &TimelineId,
) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_secondary_timeline(tenant_shard_id, timeline_id))
.build()
}

pub fn with_scope_secondary_tenant(&self, tenant_shard_id: &TenantShardId) -> Self {
RequestContextBuilder::extend(self)
.scope(Scope::new_secondary_tenant(tenant_shard_id))
.build()
}

#[cfg(test)]
pub fn with_scope_unit_test(&self) -> Self {
RequestContextBuilder::new(TaskKind::UnitTest)
.scope(Scope::new_unit_test())
.build()
}

pub fn task_kind(&self) -> TaskKind {
Expand All @@ -303,4 +454,38 @@ impl RequestContext {
pub(crate) fn read_path_debug(&self) -> bool {
self.read_path_debug
}

pub(crate) fn io_size_metrics(&self) -> &StorageIoSizeMetrics {
match &self.scope {
Scope::Global { io_size_metrics } => {
let is_unit_test = cfg!(test);
let is_regress_test_build = cfg!(feature = "testing");
if is_unit_test || is_regress_test_build {
panic!("all VirtualFile instances are timeline-scoped");
} else {
use once_cell::sync::Lazy;
use std::sync::Mutex;
use std::time::Duration;
use utils::rate_limit::RateLimit;
static LIMIT: Lazy<Mutex<RateLimit>> =
Lazy::new(|| Mutex::new(RateLimit::new(Duration::from_secs(1))));
let mut guard = LIMIT.lock().unwrap();
guard.call2(|rate_limit_stats| {
warn!(
%rate_limit_stats,
backtrace=%std::backtrace::Backtrace::force_capture(),
"all VirtualFile instances are timeline-scoped",
);
});

io_size_metrics
}
}
Scope::Timeline { arc_arc } => &arc_arc.storage_io_size,
Scope::SecondaryTimeline { io_size_metrics } => io_size_metrics,
Scope::SecondaryTenant { io_size_metrics } => io_size_metrics,
#[cfg(test)]
Scope::UnitTest { io_size_metrics } => io_size_metrics,
}
}
}
Loading
Loading