High caf thread utilization with 512 workers and 96 loggers #352

awelzel · 2023-05-12T15:28:13Z

On the zeekorg Slack, users reported seeing high logger CPU usages in a cluster of 512 Zeek workers and 96 Zeek loggers (distributed over multiple physical systems).

A perf top and flamegraph of a single logging process indicates most time is spent in the caf layer, roughly ~72% in on_consumed_data(). The stacks might not be fully valid.

In contrast, Zeek's logging threads, librdkafka threads and lz4 compression represent ~10% of the profile to the right. This looks like a pathological case being triggered.

The text was updated successfully, but these errors were encountered:

awelzel · 2023-05-12T15:46:35Z

The following supervisor setup starting 80 workers, 12 loggers and setting Broker::log_batch_interval down to 0.25, doing artificial logging to a single stream does provoke some overhead in the caf thread, also the on_consumed_data. I didn't have easy access to a system where I could run a 512 workers experiment, but maybe that gives something as a start.

# Run with zeek -j this-script.zeek

@load base/frameworks/cluster
@load base/frameworks/reporter

redef LogAscii::use_json = T;
redef Broker::disable_ssl = T;


redef Reporter::info_to_stderr = T;
redef Reporter::warnings_to_stderr = T;
redef Reporter::errors_to_stderr = T;

event do_create_node(sn: Supervisor::NodeConfig)
	{
	local res = Supervisor::create(sn);
	print "starting",  sn$name;

	if ( res != "" )
		print fmt("supervisor failed to create node '%s': %s", sn, res);
	}

event zeek_init()
	{
	# print Cluster::local_node_type();
	if ( ! Supervisor::is_supervisor() )
		return;

	Broker::listen("127.0.0.1", 9999/tcp);

	local cluster: table[string] of Supervisor::ClusterEndpoint;
	cluster["manager"] = [$role=Supervisor::MANAGER, $host=127.0.0.1, $p=10000/tcp];
	cluster["proxy"] = [$role=Supervisor::PROXY, $host=127.0.0.1, $p=10001/tcp];

	local loggers = 12;
	local i = 0;
	while ( i < loggers )
		{
		++i;
		local lp = count_to_port(10010 + i, tcp);
		local logger_name = fmt("l-%d", i);
		cluster[logger_name] = [$role=Supervisor::LOGGER, $host=127.0.0.1, $p=lp];
		}

	local workers = 80;
	local worker_port_offset = 10100;
	i = 0;
	while ( i < workers )
		{
		++i;
		local name = fmt("w-%d", i);
		# local p = count_to_port(worker_port_offset + i, tcp);
		cluster[name] = [$role=Supervisor::WORKER, $host=127.0.0.1, $p=0/tcp, $interface="lo"];
		}

	i = 0;
	for ( n, ep in cluster )
		{
		++i;
		local sn = Supervisor::NodeConfig($name=n);
		sn$cluster = cluster;
		sn$directory = n;
		sn$env = table(["ZEEK_DEFAULT_CONNECT_RETRY"] = "16");

		if ( ep?$interface )
			sn$interface = ep$interface;

		schedule double_to_interval(i * 0.1) { do_create_node(sn) };
		}
	}

@if ( ! Supervisor::is_supervisor() )

redef Broker::log_batch_interval = 0.25sec;

module TestLog;

redef enum Log::ID += { LOG };

type Info: record {
	ts: time &log &default=current_time();
	node: string &log &default=Cluster::node;
	code: count &log;
	status: string &log;
	rand: count &log &default=rand(10000);
};

event log_write() {
	Log::write(LOG, [$code=200, $status="Hello"]);
	schedule 0.01sec { log_write() };
}

event zeek_init()
	{
	Log::create_stream(LOG, [$path="my-http", $columns=Info]);
        if ( Cluster::local_node_type() == Cluster::WORKER )
	        schedule 0.01sec { log_write() };
	}
@endif

awelzel · 2023-05-13T07:51:32Z

On Slack, the user provided a new flamegraph that looks promising: There's ~55% time spent in std::max_element() within mcast::max_buffered().

https://github.com/zeek/actor-framework/blob/b50e2249761c2db5d28f94b64a6df92d0c3d0cc2/libcaf_core/caf/flow/op/mcast.hpp#L137-L147

Can probably be assumed that _states grew large and the linear searching for any of min/max/ buffered/demand ends up taking most CPU time.

Neverlord · 2023-05-13T08:02:59Z

Nice, thanks for doing all that digging. 👍

awelzel · 2023-05-15T15:34:38Z

I can somewhat reproduce this on a bigger machine with the original reproducer, but also trigger a high percentage of native_queued_spin_lock_slowpath on that system and don't have great symbols for libc. If there's a fix available that improves the op::mcast parts whould still be reasonable to test.

Neverlord · 2023-05-15T18:14:21Z

I can somewhat reproduce this on a bigger machine with the original reproducer

Thanks, Arne. That's good to know! I'll ping you when I have a patch ready to try.

Neverlord · 2023-06-18T12:17:06Z

@awelzel can you try topic/neverlord/gh-352? The branch contains the Broker-side fix for #366 plus a CAF-side patch to address frequent max_element calls.

I've written a small benchmark that pushes 1k messages (integers) to up to 1024 observers. With the patch, the runtime drops significantly:

Hopefully the synthetic benchmark also translates to real results in Broker.

awelzel · 2023-06-22T13:33:41Z

I've re-tested master with 196-workers, 16 loggers and a log_batch_interval of 0.10sec.

The workers script was tweaked to write nlogger messages every log_batch_interval to two log streams.

This results in ~65% CPU usage per logger process with the following break-down / flamegraph - most overhead is in the finalize() -> on_consumed_some() path.

Same reproducer with branch topic/neverlord/gh-352 reduced CPU usage of logger process to ~35%. Flamegraph as follows, the overhead in on_consumed_some() is not apparent anymore. Nice job!

awelzel · 2023-06-28T08:36:49Z

External user has reported that logger CPU usage decreased significantly:

Single logger(16c64g) cpu usage decreased from 76% to 24% .

However, they have observed two occurrences were worker memory increased rapidly from 16GB to 64GB after running seemingly stable for ~20hours. Have requested more telemetry/metrics of the involved logger processes.

Flamegraph one one of their loggers after deploy the patch does not show max_element().

awelzel · 2023-06-28T10:53:06Z

From Slack, suspicion is that loggers stop accepting log messages causing workers to run OOM. User reports this is not happening with their vanilla 5.2.0 deployment.

Thanks for the details. CPU usage of logger plummets around the time when worker memory increases rapidly, so one suspicion is that loggers stop processing log messages and workers queue them on their end. Will need to see if this can be reproduced in a testing setup.

This triggered in the user's environment only after ~20 hours with a pretty large cluster under production load, so not sure what the chances are to reproduce this in a testing environment, but we could give it a shot.

(https://zeekorg.slack.com/archives/CSZBXF6TH/p1687920539949679?thread_ts=1683797785.775629&cid=CSZBXF6TH)

Neverlord · 2023-06-28T17:50:26Z

Thank you for taking care of playing this back to the reporter. 🙂

Let me try to reproduce this bug with a stress test. I'll ping you if I have a new version to test.

awelzel mentioned this issue May 12, 2023

Creating large clusters with Supervisor framework fails: safe_write error: 11 zeek/zeek#3043

Closed

Neverlord mentioned this issue May 29, 2023

Performance issue with the mcast operator actor-framework/actor-framework#1418

Closed

Neverlord self-assigned this Jun 11, 2023

Neverlord added the Type: Performance label Jun 11, 2023

awelzel mentioned this issue Jun 15, 2023

Metrics scalability performance issue in 6.0 #366

Closed

awelzel added a commit to zeek/zeek that referenced this issue Jun 24, 2023

Bump broker for zeek/broker#352 fix

b7af2cb

Neverlord added this to Zeek 6.1 Jul 6, 2023

Neverlord moved this to In Progress in Zeek 6.1 Jul 6, 2023

Neverlord added the Type: Bug label Jul 6, 2023

ckreibich removed this from Zeek 6.1 Oct 24, 2023

ckreibich added this to Zeek 6.2 Oct 24, 2023

ckreibich added this to Zeek 7.0 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High caf thread utilization with 512 workers and 96 loggers #352

High caf thread utilization with 512 workers and 96 loggers #352

awelzel commented May 12, 2023 •

edited

Loading

awelzel commented May 12, 2023

awelzel commented May 13, 2023

Neverlord commented May 13, 2023

awelzel commented May 15, 2023

Neverlord commented May 15, 2023

Neverlord commented Jun 18, 2023

awelzel commented Jun 22, 2023 •

edited

Loading

awelzel commented Jun 28, 2023

awelzel commented Jun 28, 2023

Neverlord commented Jun 28, 2023

High caf thread utilization with 512 workers and 96 loggers #352

High caf thread utilization with 512 workers and 96 loggers #352

Comments

awelzel commented May 12, 2023 • edited Loading

awelzel commented May 12, 2023

awelzel commented May 13, 2023

Neverlord commented May 13, 2023

awelzel commented May 15, 2023

Neverlord commented May 15, 2023

Neverlord commented Jun 18, 2023

awelzel commented Jun 22, 2023 • edited Loading

awelzel commented Jun 28, 2023

awelzel commented Jun 28, 2023

Neverlord commented Jun 28, 2023

awelzel commented May 12, 2023 •

edited

Loading

awelzel commented Jun 22, 2023 •

edited

Loading