Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High caf thread utilization with 512 workers and 96 loggers #352

Open
awelzel opened this issue May 12, 2023 · 10 comments
Open

High caf thread utilization with 512 workers and 96 loggers #352

awelzel opened this issue May 12, 2023 · 10 comments

Comments

@awelzel
Copy link
Contributor

awelzel commented May 12, 2023

On the zeekorg Slack, users reported seeing high logger CPU usages in a cluster of 512 Zeek workers and 96 Zeek loggers (distributed over multiple physical systems).

A perf top and flamegraph of a single logging process indicates most time is spent in the caf layer, roughly ~72% in on_consumed_data(). The stacks might not be fully valid.

current_perf

In contrast, Zeek's logging threads, librdkafka threads and lz4 compression represent ~10% of the profile to the right. This looks like a pathological case being triggered.

@awelzel
Copy link
Contributor Author

awelzel commented May 12, 2023

The following supervisor setup starting 80 workers, 12 loggers and setting Broker::log_batch_interval down to 0.25, doing artificial logging to a single stream does provoke some overhead in the caf thread, also the on_consumed_data. I didn't have easy access to a system where I could run a 512 workers experiment, but maybe that gives something as a start.

broker-caf-on-consumed-data

# Run with zeek -j this-script.zeek

@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = T;


redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

event do_create_node(sn: Supervisor::NodeConfig)
	{
	local res = Supervisor::create(sn);
	print "starting",  sn$name;

	if ( res != "" )
		print fmt("supervisor failed to create node '%s': %s", sn, res);
	}

event zeek_init()
	{
	# print Cluster::local_node_type();
	if ( ! Supervisor::is_supervisor() )
		return;

	Broker::listen("127.0.0.1", 9999/tcp);

	local cluster: table[string] of Supervisor::ClusterEndpoint;
	cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
	cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];

	local loggers = 12;
	local i = 0;
	while ( i < loggers )
		{
		++i;
		local lp = count_to_port(10010 + i, tcp);
		local logger_name = fmt("l-%d", i);
		cluster[logger_name] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=lp];
		}

	local workers = 80;
	local worker_port_offset = 10100;
	i = 0;
	while ( i < workers )
		{
		++i;
		local name = fmt("w-%d", i);
		# local p = count_to_port(worker_port_offset + i, tcp);
		cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
		}

	i = 0;
	for ( n, ep in cluster )
		{
		++i;
		local sn = Supervisor::NodeConfig($name=n);
		sn$cluster = cluster;
		sn$directory = n;
		sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "16");

		if ( ep?$interface )
			sn$interface = ep$interface;

		schedule double_to_interval(i * 0.1) { do_create_node(sn) };
		}
	}

@if ( ! Supervisor::is_supervisor() )

redef Broker::log_batch_interval = 0.25sec;

module TestLog;

redef enum Log::ID += { LOG };

type Info: record {
	ts: time &log &default=current_time();
	node: string &log &default=Cluster::node;
	code: count &log;
	status: string &log;
	rand: count &log &default=rand(10000);
};

event log_write() {
	Log::write(LOG, [$code=200, $status="Hello"]);
	schedule 0.01sec { log_write() };
}

event zeek_init()
	{
	Log::create_stream(LOG, [$path="my-http", $columns=Info]);
        if ( Cluster::local_node_type() == Cluster::WORKER )
	        schedule 0.01sec { log_write() };
	}
@endif

@awelzel
Copy link
Contributor Author

awelzel commented May 13, 2023

On Slack, the user provided a new flamegraph that looks promising: There's ~55% time spent in std::max_element() within mcast::max_buffered().

https://github.com/zeek/actor-framework/blob/b50e2249761c2db5d28f94b64a6df92d0c3d0cc2/libcaf_core/caf/flow/op/mcast.hpp#L137-L147

Can probably be assumed that _states grew large and the linear searching for any of min/max/ buffered/demand ends up taking most CPU time.

perf_20230513102803

@Neverlord
Copy link
Member

Nice, thanks for doing all that digging. 👍

@awelzel
Copy link
Contributor Author

awelzel commented May 15, 2023

I can somewhat reproduce this on a bigger machine with the original reproducer, but also trigger a high percentage of native_queued_spin_lock_slowpath on that system and don't have great symbols for libc. If there's a fix available that improves the op::mcast parts whould still be reasonable to test.

8-loggers-196-workers

@Neverlord
Copy link
Member

I can somewhat reproduce this on a bigger machine with the original reproducer

Thanks, Arne. That's good to know! I'll ping you when I have a patch ready to try.

@Neverlord
Copy link
Member

@awelzel can you try topic/neverlord/gh-352? The branch contains the Broker-side fix for #366 plus a CAF-side patch to address frequent max_element calls.

I've written a small benchmark that pushes 1k messages (integers) to up to 1024 observers. With the patch, the runtime drops significantly:

Screenshot 2023-06-18 at 14 14 03

Hopefully the synthetic benchmark also translates to real results in Broker.

@awelzel
Copy link
Contributor Author

awelzel commented Jun 22, 2023

I've re-tested master with 196-workers, 16 loggers and a log_batch_interval of 0.10sec.

The workers script was tweaked to write nlogger messages every log_batch_interval to two log streams.

This results in ~65% CPU usage per logger process with the following break-down / flamegraph - most overhead is in the finalize() -> on_consumed_some() path.
master

Same reproducer with branch topic/neverlord/gh-352 reduced CPU usage of logger process to ~35%. Flamegraph as follows, the overhead in on_consumed_some() is not apparent anymore. Nice job!

196-16-gh-352

awelzel added a commit to zeek/zeek that referenced this issue Jun 24, 2023
@awelzel
Copy link
Contributor Author

awelzel commented Jun 28, 2023

External user has reported that logger CPU usage decreased significantly:

Single logger(16c64g) cpu usage decreased from 76% to 24% .

However, they have observed two occurrences were worker memory increased rapidly from 16GB to 64GB after running seemingly stable for ~20hours. Have requested more telemetry/metrics of the involved logger processes.

Flamegraph one one of their loggers after deploy the patch does not show max_element().
perf

@awelzel
Copy link
Contributor Author

awelzel commented Jun 28, 2023

From Slack, suspicion is that loggers stop accepting log messages causing workers to run OOM. User reports this is not happening with their vanilla 5.2.0 deployment.

Thanks for the details. CPU usage of logger plummets around the time when worker memory increases rapidly, so one suspicion is that loggers stop processing log messages and workers queue them on their end. Will need to see if this can be reproduced in a testing setup.

This triggered in the user's environment only after ~20 hours with a pretty large cluster under production load, so not sure what the chances are to reproduce this in a testing environment, but we could give it a shot.

(https://zeekorg.slack.com/archives/CSZBXF6TH/p1687920539949679?thread_ts=1683797785.775629&cid=CSZBXF6TH)

@Neverlord
Copy link
Member

Thank you for taking care of playing this back to the reporter. 🙂

Let me try to reproduce this bug with a stress test. I'll ping you if I have a new version to test.

@Neverlord Neverlord moved this to In Progress in Zeek 6.1 Jul 6, 2023
@ckreibich ckreibich removed this from Zeek 6.1 Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: No status
Development

No branches or pull requests

2 participants