Implement new variant type and builder API #368

Neverlord · 2023-06-29T14:49:38Z

This is a huge "code dump", so let me first explain the big picture here. 🙂

The performance in Broker is held back by broker::data. Performance problems related to this type keep coming up. In #318, I've built a barebones alternative to broker::data (named data_view at the time) that achieved a ~3x speedup over broker::data when deserializing a simple (Zeek) event-like structure.

It's not really a "view", though. After a bit of thinking, I've settled on broker::variant for "the new thing". The reasoning is simply that broker::variant is a tagged union type just like std::variant, only that it hard-codes what types are allowed. How can it be so much faster than broker::data? Custom allocation strategy. broker::data uses "regular" STL types that allocate stuff on the heap. As a result, a Zeek event will allocate multiple memory regions and scatter an object all over the heap.

A broker::variant always uses a monotonic buffer resource (albeit a Broker replacement type that we use since the STL type isn't available everywhere yet) to build the nested data structure. The work flow is now always that we start from a serialized version of the value, iterate it, and build the nested structure as we go. Once a variant is assembled, it is immutable and can be (thread-safe) passed around freely.

The interaction between types is as follows:

variant_data simply wraps the std::variant that will hold the actual data. This is very similar to broker::data except that we will only store pointers to the container types. This is because we allocate everything in the same allocator, so we can just safely pass around pointers. Further, we don't need to call any destructors. We allocate everything into the same memory region. If that region gets discarded, all of the objects that lived in there just conveniently disappear.
envelope holds a shared pointer to the buffer resource alongside the variant_data that lives in the buffer.
variant is a high-level wrapper type that gives us a nice and clean interface without having to care about the std::variant and the "trivial" types. It holds a (raw) pointer to the actual variant_data plus a (shared) pointer to the envelope that owns the memory of the data.
variant_list is a high-level wrapper to replace broker::vector.
variant_set is a high-level wrapper to replace broker::set.
variant_table is a high-level wrapper to replace broker::table.

Because we need to build an envelope now, before we can construct the actual broker::variant, we also add a couple of builder types. The builders generate the serialized version of the object. Side note: the variant will simply point back to the buffer it was deserialized from for string data. This means ultimately, we can have this workflow in Broker:

Copy bytes from the network into a buffer.
Hand the buffer to an envelope.
Deserialize the variant from the buffer "in-place".

Because the envelope holds on to the serialized version of the object, we can trivially serialize it when sending it over to the network: just copy the bytes from the buffer to the network.

This first PR just adds all of the new tools to the code base. Ultimately, we want to get rid of the old types. The data_message will be replaced by envelope and data will be replaced by variant. Of course we'll do this in small increments. We will also keep the data type around for a bit to not break APIs immediately and for Python bindings.

By the way, the envelope also holds the topic. But it's accessible as string_view to give us more freedom on how and where to store it. Our broker::topic will also get the axe eventually because it forces us to commit to a std::string.

The critical part of the equation of all of this is of course performance. I've re-used the serialization micro benchmark. Here are the results:

$ ./build/broker/release/bin/broker-serialization-benchmark 
2023-06-29T08:14:35+02:00
Running ./build/broker/release/bin/broker-serialization-benchmark
Run on (20 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x10)
  L3 Unified 20480 KiB
Load Average: 15.28, 5.60, 2.80
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
serialization/event_1_data              327 ns          326 ns      1884933
serialization/table_1_data              226 ns          226 ns      3443956
serialization/event_1_builder           137 ns          137 ns      4898016
serialization/table_1_builder           130 ns          130 ns      5298855
deserialization/event_1_data            561 ns          561 ns      1243052
deserialization/table_1_data            220 ns          220 ns      3162898
deserialization/event_1_variant         116 ns          116 ns      6054054
deserialization/table_1_variant         103 ns          103 ns      6742244
deserialization/event_1_envelope        200 ns          200 ns      3534515
deserialization/table_1_envelope        186 ns          186 ns      3686908

The event_1 is a simple, nested representation that models a simple Zeek event: (1, 1, ("event_1", (42, "test"))). This is basically how a broker::zeek::Event looks like.

The table_1 is a simple table with two key/value pairs: {"first-name": "John", "last-name": "Doe"}.

Serializing a broker::data of this type is ~330ns. Now, serializing a broker::variant wouldn't be a good comparison here. In the new design, the builders have the job of producing the binary format. And they do this much faster with a speedup of around ~2x.

The real deal is deserializing, though. For real code, we would be interested in how long it takes to deserialize an envelope vs. a data. The values for deserializing a "naked" variant are just available for the sake of completeness. So here, we look at 560ns vs. 200ns for event_1 (almost 3x speedup) and just a small gain (around 20%) for table_1. Of course this would become a much bigger speedup real quick if the table had more keys or more complex key/value entries.

In practice, Broker will mostly operate on Zeek events and Zeek log entries. They come from the nesting implemented in broker/zeek.hh, so this will give us a nice boost. A ~3x speedup for the deserialization path just from switching to a different data structure. We will combine this with the additional gains from putting radix trees to work for the filtering.

Neverlord · 2023-07-11T17:33:54Z

FreeBSD and Windows are unrelated failures. We'll have to sort that out separately.

@ckreibich, @awelzel, @timwoj: anyone up for reviewing this? 🙂

I know it looks like a lot, but it's actually mostly boilerplate code that doesn't require that much attention in a review. It's there to make the API easy and safe to use. Classes like variant_data, variant_list, variant_table are basically wrappers with repeating patterns and the builders are similar. Also, around 700 of the new lines are just the tests plus another ~200 lines benchmarking code. There are a few key spots in there. For those, I'm happy to schedule a video call where we can go over the key pieces to show where the actual meat is (and where a review probably should focus on).

awelzel · 2023-07-13T11:45:04Z

Very general high-level question and possibly not suitable for the PR, but anyhow:

Does this change the wire format or is it compatible? There's new low-level parsing and serializing function added, so wondering.
If it were incompatible (but even if not), have there been prior thoughts around using a tool/library with an IDL and code generated (protobuf, Cap'n Proto)? Does that add unwanted external dependencies or has too much overhead?

I've dabbled a bit writing up a schema with Cap 'n Proto. To me it would sound much less intimidating if we had a broker.capnp and used that. Reading a bit of the description the ideas are similar around in-memory and on-wire format being identical. This could also open the possibility to use that format over websocket as an alternative to JSON and have other languages generate the parsing code based on the schema file.

Disclaimer: I have no experience with Cap'n Proto otherwise, I had worked with Protobuf previously.

CMakeLists.txt

Neverlord · 2023-07-13T16:12:28Z

Does this change the wire format or is it compatible? There's new low-level parsing and serializing function added, so wondering.

Sorry for not mentioning this. 🙂

Currently, Broker just uses CAF classes to serialize and deserialize its data. To parse binary data into views, we need to implement the format in Broker. This is 100% the exact same binary representation as before, just implemented in Broker directly. You can serialize a broker::data and read it back as a broker::variant and vice versa. This also means that we no longer need to use the CAF classes after fully migrating to broker::variant.

If it were incompatible (but even if not), have there been prior thoughts around using a tool/library with an IDL and code generated (protobuf, Cap'n Proto)? Does that add unwanted external dependencies or has too much overhead?

I don't think there are prior thoughts to doing this. But it would require us to break both API and network protocol at the same time. What I wanted to do with this approach was to have a way to keep everything on the wire and (including handshake) stable and then migrate Broker internally to the new broker::variant and fading out broker::data on the "edges" later.

Reading a bit of the description the ideas are similar around in-memory and on-wire format being identical.

The in-memory format for broker::variant is actually different. We parse the binary protocol and build a nested structure out of it, so it's not just a wrapper around a memory block. We keep the original memory block to be able to do shallow parsing and to get serialization for free (because we kept the binary representation around).

Interoperability with broker::data was an important design goal. If we use a completely different approach, we'd basically have to start the API all over and probably have no way to do a slow migration path since we break compatibility on all fronts: network, data stores (they store serialized broker::data objects) and API. When I read your schema correctly, there's no distinct set type, so we cannot even provide a compatibility layer.

I don't have too much of an opinion which would ultimately be better, but if we decide to go with a schema-based approach, we'll have a different discussion and different set of decisions to make.

This could also open the possibility to use that format over websocket as an alternative to JSON and have other languages generate the parsing code based on the schema file.

To do that, I think we wouldn't need to use Protobuf/msgpack/etc. internally. As long as there is a 1:1 mapping between two formats, we can translate binary representation on-the-fly, just as we convert to and from JSON on the fly. It does invoke overhead, of course, since we cannot just write out the in-memory representation. Just pointing out that this is not black & white. 🙂

ckreibich · 2023-07-13T22:43:38Z

A few comments on this:

Dominik and I had prior conversations to the arrival of this PR. I did not realize it would turn into this large a changeset, but I am in support of the general goal here because Broker obviously remains key to cluster performance, so substantial performance gains are worth exploring.
That said, I consider all of this experimental until we have Zeek-level (not Broker-level) measurements of the resulting performance impact.
There is a growing reality that Broker is the new CAF for us — a black box that only one person on the team can meaningfully maintain. There are many reasons we have gotten here and it is not a good development. It is questionable whether we can (or even should) collectively review a 4K SLOCs Broker changeset in depth given available cycles and project priorities. Let's discuss this further in one of our upcoming team calls.

For now, it looks to me like the next step is to understand how to complete this changeset so we can explore it from Zeek. Could you sketch what remains to be done for this, Dominik, or if it's already there, how to toggle use of the variant vs current code?

Neverlord · 2023-07-30T09:22:10Z

It is questionable whether we can (or even should) collectively review a 4K SLOCs Broker changeset in depth given available cycles and project priorities. Let's discuss this further in one of our upcoming team calls.

Not to downplay the patch, but the actual bulk of this PR is new tests, benchmarks and wrapper types that exist to make the API safe and easy to use. Most of this can be skimmed in a review since it has low complexity and is often quite repetitive. The bulk of code that does add complexity is mostly under src/variant_data.cc, include/broker/format/*, src/broker/format/* and include/broker/enum_value.hh. So it's probably a SLOCcount of ~1k that a review would require attention in a review.

For now, it looks to me like the next step is to understand how to complete this changeset so we can explore it from Zeek. Could you sketch what remains to be done for this, Dominik, or if it's already there, how to toggle use of the variant vs current code?

As mentioned earlier, this only adds the new abstractions to Broker. My idea was to have one PR only adding the new pieces to the code base before we start integrating them. So there's no switch to toggle old vs. new yet.

But of course you're right. If there's no sufficient gain for Zeek, it would be pointless to integrate this. I'll prepare a separate branch with a quick & dirty integration of the new variant to measure the performance impact that we can get out of this. I'll get back if I have new numbers, ideally from a Zeek cluster. 🙂

Neverlord · 2023-08-23T08:05:12Z

I have prepared a "quick & dirty" integration branch that switches from the old data-based messages to the new envelopes with variant.

Here is the current master as a point of reference:

$ broker-benchmark --verbose --server :3009
rate: 92596 events/s
rate: 456198 events/s
rate: 453580 events/s
rate: 456796 events/s
rate: 453562 events/s
rate: 453822 events/s
rate: 452204 events/s
rate: 436038 events/s
rate: 448977 events/s
rate: 451120 events/s
rate: 443763 events/s
rate: 430116 events/s
rate: 360788 events/s
rate: 449060 events/s
rate: 448680 events/s

The current version can do about ~450k events/s with one server and one client process.

Now the same setup with the new message envelopes and variant:

$ broker-benchmark --verbose --server :3009
rate: 401195 events/s
rate: 655294 events/s
rate: 646988 events/s
rate: 663163 events/s
rate: 664540 events/s
rate: 653369 events/s
rate: 634064 events/s
rate: 633869 events/s
rate: 653853 events/s
rate: 621516 events/s
rate: 571799 events/s
rate: 642486 events/s
rate: 644301 events/s
rate: 655255 events/s
rate: 625987 events/s

So for this benchmark, we get an increase of ~40% events/s (or additional ~200k in total numbers).

That's actually much more than I had anticipated for full end-to-end performance.

Of course that's just a Broker benchmark. I'll port Zeek to the new API next and then see how the full system behaves. We can't really do a maximum throughput measurement with Zeek, because Zeek doesn't use the backpressure. So I'll probably need some help putting together a meaningful benchmark with Zeek. Maybe we can throw a fixed rate at Broker and just compare CPU/Memory load?

Fun facts:

Using the new server with the old client gives ~500k events/s.
Using the old server with the new client gives ~450k events/s again.

Neverlord · 2023-10-02T07:33:40Z

I'll re-integrate this after the other PRs are merged. Closing for now since it'll need rebasing and a Zeek-side PR once it's ready.

Neverlord added the Type: Performance label Jun 29, 2023

Neverlord mentioned this pull request Jun 29, 2023

Metrics scalability performance issue in 6.0 #366

Closed

Neverlord force-pushed the topic/neverlord/variant branch 7 times, most recently from 95b535f to 036fc9a Compare July 11, 2023 17:21

awelzel reviewed Jul 13, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

Neverlord added 3 commits September 24, 2023 16:10

Implement new variant type and builder API

2bcea1b

Fix build on CentOS 7

5710c46

Fix clang-tidy warnings and clang-format findings

ab98502

Neverlord force-pushed the topic/neverlord/variant branch from 036fc9a to ab98502 Compare September 24, 2023 14:13

Neverlord closed this Oct 2, 2023

Neverlord mentioned this pull request Oct 8, 2023

Broker variant: immutable "data views" with monotonic buffer ressources #318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement new variant type and builder API #368

Implement new variant type and builder API #368

Neverlord commented Jun 29, 2023

Neverlord commented Jul 11, 2023

awelzel commented Jul 13, 2023

Neverlord commented Jul 13, 2023

ckreibich commented Jul 13, 2023

Neverlord commented Jul 30, 2023

Neverlord commented Aug 23, 2023

Neverlord commented Oct 2, 2023

Implement new variant type and builder API #368

Implement new variant type and builder API #368

Conversation

Neverlord commented Jun 29, 2023

Neverlord commented Jul 11, 2023

awelzel commented Jul 13, 2023

Neverlord commented Jul 13, 2023

ckreibich commented Jul 13, 2023

Neverlord commented Jul 30, 2023

Neverlord commented Aug 23, 2023

Neverlord commented Oct 2, 2023