Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable pprof cpu profiling #5081

Merged
merged 3 commits into from
Jun 5, 2024
Merged

enable pprof cpu profiling #5081

merged 3 commits into from
Jun 5, 2024

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Jun 5, 2024

This adds two routes to control pprof
Enable CPU profiling (100hz)
http://localhost:7280/pprof/start
Get the profile as flamegraph
http://localhost:7280/pprof/flamegraph

The routes are behind the pprof feature flag

cargo install --locked --path . --features pprof

This adds two routes to control pprof
Enable CPU profiling (100hz)
http://localhost:7280/pprof/start
Get the profile as flamegraph
http://localhost:7280/pprof/stop

The routes are behind the `pprof` feature flag
Copy link

github-actions bot commented Jun 5, 2024

On SSD:

Average search latency is 1.0x that of the reference (lower is better).
Ref run id: 1870, ref commit: b57eb12
Link

On GCS:

Average search latency is 1.2x that of the reference (lower is better).
Ref run id: 1871, ref commit: b57eb12
Link

})
};

fn get_flamegraph(profiler_guard: Arc<Mutex<Option<ProfilerGuard>>>) -> impl warp::Reply {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the following have worked too?

Suggested change
fn get_flamegraph(profiler_guard: Arc<Mutex<Option<ProfilerGuard>>>) -> impl warp::Reply {
fn get_flamegraph(profiler_guard: &Mutex<Option<ProfilerGuard>>) -> impl warp::Reply {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the old version yes, in the new version we need a static lifetime

if let Some(profiler) = guard.take() {
if let Ok(report) = profiler.report().build() {
let mut buffer = Vec::new();
if report.flamegraph(&mut buffer).is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if generating the report is cpu heavy? if so we should probably run it on the thread pool for cpu intensive stuff.

run_cpu_intensive in quickwit_common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they can be cpu intensive, I moved it to a new thread

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good for experimentations but this is quite the footgun. Are statistics gathered really bounded?

What would happen if someone called start and never called stop?
Can we guard ourselves against that?

I assume your point is that it feels silly to hope for a GET request to run over 30s.
Could it be two endpoints like you did, but

pprof/run would start profiling for 30s, and
pprof/flamegraph would return the last computed profile, or something like that?

In your test were 100Hz sufficient? Should we make the frequency and the duration configurable maybe?

add safeguards
add query parameters
keep and return last flamegraph
async fn save_flamegraph(
profiler_state: Arc<Mutex<ProfilerState>>,
) -> tokio::task::JoinHandle<()> {
spawn_blocking(move || {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it is better to use quickwit_common:run_cpu_intensive
because spawn blocking is an infinite pool.

That being said, here, I think we don't care much.
Fix if you want.

@PSeitz
Copy link
Contributor Author

PSeitz commented Jun 5, 2024

I didn't see memory issues while collecting, but there's a performance hit usually, so we should avoid to run it indefinitely by accident.

100hz is usually fine, except very short running queries.

@PSeitz PSeitz merged commit 01571db into main Jun 5, 2024
5 checks passed
@PSeitz PSeitz deleted the pprof_rs branch June 5, 2024 12:14
@guilload
Copy link
Member

guilload commented Jun 5, 2024

I'm going to move this endpoint to api/developer/pprof*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants