diff --git a/content/blog/2023-04-01-key-expressions-1.md b/content/blog/2023-04-01-key-expressions-1.md new file mode 100644 index 00000000..94155f4e --- /dev/null +++ b/content/blog/2023-04-01-key-expressions-1.md @@ -0,0 +1,159 @@ +--- +title: "Key Expression: Addressing infinite ressources at (Zetta)scale" +date: 2023-04-03 +menu: "blog" +weight: 20230403 +description: "03 April 2023 -- Paris." +draft: false +--- +One of the major changes introduced by Zenoh 0.6 Bahamut was a new definition of the [Key Expression Language](https://github.com/eclipse-zenoh/roadmap/blob/main/rfcs/ALL/Key%20Expressions.md). +With Zenoh 0.7.1, we introduce new data structures that let you interact with this language more easily than ever before. + +Come with me on a journey into Zenoh's vision of a Named Data Address Space: Key Expressions (KE). The major parts of this post are all rather independent, so feel free to skip past one if it's not to your taste. + +# What are Key Expressions? + +Before we talk about _expressions_, let talk about _keys_. +Much like telephone networks have numbers and Internet has IP addresses, Named Data protocols all have their address space, and elements of that space have been known by many names: "demon slayer" (probably not), "topic", "path", "name". We call them "keys", in reference to the fact that one of Zenoh's goals is to allow the easy setup of distributed Key-Value stores. + +Early in Zenoh's design, we found that the ability to act not just on a single key, but on a whole set of keys in one operation was a very powerful concept; be it to reduce network usage, or to address keys that we don't _specifically_ know of, but that belong to a known set. + +To do so, a "wildcard" syntax similar to those of glob patterns was introduced, and the Key Expression Language (KEL) was born. + +## Specifying the KEL. + +But there still was a major caveat: the language was largely underspecified, leaving the behaviour of certain patterns up to interpretation. To remedy this, we went the way most languages tend to, and stopped considering any text string as a valid KE, by defining a proper language for KEs. + +This one done in a classic 3 steps program: make KEs better, ???, profit. Want the actual steps? +1. Redefine KEs as a `/`-separated list of non-empty UTF-8 strings called "chunks". With this, the ambiguities (and many internal debates) about the expected behaviours of `a/b/` and `a//b` in relation to `a/b` were finally laid to rest, since only the latter was a valid KE. +2. Make the wildcards special chunks, and not special characters. That way `*` now means "any chunk", and `**` means "any amount of any chunks". By raising them above the character level, we make the syntax easier and the parser (which has to be used _a lot_ for routing) faster. `a/*` will intersect with `a/b` and `*/c`, but not with `b/c` nor `a/b/c`. `a/**/c` will intersect with `a/b/c`, but also `a/b/d/c` and `a/c`. +3. To keep the ability to have sub-chunk wilds we define `$*` as the subchunk equivalent of `*`: it matches any amount of any characters, but cannot expand accross chunks: `a$*/c` will intersect with `ab/c`, but not with `ba/c`. Actually, let's also reserve `$` as the marker for future sub-languages that will allow more precise sub-chunk expressions. +4. (I lied about the program having 3 steps) Make KEs bijective: by introducing a set of substitution rules to convert certain wildcard combinations into semantically identical combinations, and enforcing that these rules be applied until the expression is stabilized before considering it a valid KE, we can ensure that any KE is the only one that describes its exact set of keys. + For example, `*/**/*`, `*/**/**/*` and `**/*/*/**` would mean the same thing (the set of all keys that are made of at least 2 chunks), but by repeatedly applying the `**/* -> */**` and `**/** -> **` rules, they all come down to the same `*/*/**`. + +With steps 1 and 2, ambiguities in the language disappear. With step 3, we gain extensibility for future features, which will one day allow us to express more precise sets than currently possible. But step 4 is the true hero of the story: thanks to bijectivity, there's no longer a need to worry about different strings meaning the same thing, which may have trapped many people that haven't spent the last year obsessing over KEs. Bijectivity also greatly simplifies the implementation of data structures tailor-made for KEs, such as the one we'll explore in the next part. + +# The KE Tree: a Zenoh flavoured data structure. + +One thing you might have noticed is that with all these slashes, KEs definitely take after paths. One other aspect they take from paths is their intrisic hierarchical nature: like most good address spaces, KEs are hierarchical. + +And what data-structure rhymes with hierarchy? The section title spoiled it, it's the _tree_. But the KeTree isn't just any old tree: it's a tree that's made to help you treat KEs as the sets they represent, complete with _intersection_ and _inclusion_ comparisons. + +## Wait, what does it mean for KEs to _intersect_ and _include_? + +Glad you asked, inner monologue of my future reader, forcing you to read this question in your mind was definitely helpful! + +As aluded to repeatedly in this post, KEs define sets of keys, and operations in Zenoh are addressed by KE, so they affect sets of things. In fact, there are four ways an operation on a KE can affect data associated with a given KE: +1. If the KEs define disjoint sets (where there doesn't exist a key that belongs to both), they just don't interact. +2. If the operation KE and the existing data's KE define sets that have keys in common, like `a/*` and `*/b` both contain the `a/b` key, they are said to _intersect_. That means the operation will affect at least a subset of the existing data. Intersection is always symmetric: A intersects with B implies that B intersects with A. +3. If the operation KE's set contains all keys defined by the existing data's KE, the operation KE is said to _include_ the existing data KE. This means that the operation will affect _all_ of the existing data. Note that inclusion is generally asymmetric, but that "A includes B" or "B includes A" implies that "A and B intersect". +4. If the operation KE and data KE define the same set, they are equal. This is the only situation where inclusion is symmetrical, and thanks to our previously discussed [3 steps program](#specifying-the-kel), this is equivalent to string equality. + +Intersection is often the most important comparison, since it means that things addressed by two KEs have to interact in some way, because they share a region of interest. This is the criterion that Zenoh uses to route samples to subscribers, and you'll likely use this criterion too, applied to your business logic, when working with queryables. + +Inclusion is a bit more "optimizy": if some writes to A that includes B and C, the records for B and C may be erased, since A has now taken over both of their regions of interest. + +Equality and disjunction are generally not very useful: equality because inclusion is generally sufficient for most optimizations, and disjunction because it usually just means that two things do not care about each other, which we in turn don't care about. + +## Can we go back to the subject at hand now? + +Ah, yes. Well, KE trees are just a data structure that lets you efficiently insert and fetch values associated to a KE by equality, but also lets you iterate over KE-value pairs that have KEs that are either intersecting with or included by your query in the fastest manner available. While most of their performance at scale can be attributed to their tree-like structure, they also employ various techniques to reduce the linear factors in CPU-time consumption. + +While there are some interesting things to say about its implementation, its main interest for this current post is that it exists, and will help you handle sets of KE-value pairs in a KE-compliant way much more efficiently and easily than rolling your own implementation (if you disagree, feel free to make your own data structure that does that better, and don't forget to send it to us in a PR, Zenoh is always open to contributions). + +Its current implementations are made in a few ways that may warrant blog posts that are more Rust-centric, in which we'll also delve deeper into how to use them. If you have questions about it, be it Rust-centric or about the abstract concept of a KE Tree, feel free to [join our Discord](discord) and ask. + +For now, there exists two categories of KeTrees: +- Fully owned trees (`KeBoxTree`) own all of their nodes. This means you won't be able to safely keep references to its nodes, but they're simpler to use and generally fit well where you would have normally used a `HashMap`. +- Shared ownership trees, such as `KeArcTree`, allow you to keep references to their nodes outside the tree. `KeArcTree` leverages [`token_cell`](https://crates.io/crates/token-cell) to allow sharing ownership of its nodes and safely mutating it, without needing distinct mutexes on each node. We have plans to experiment with a `petgraph`-based KeTree, which will likely offer the exact same API, using the graph itself as the token, and the node indices as nodes. + +Here's what it looks like to use a KeTree: +```rust +let mut tree = KeBoxTree::new(); +tree.insert(keyexpr::new("a/b").unwrap(), 1); +tree.insert(keyexpr::new("a/c").unwrap(), 2); +tree.insert(keyexpr::new("b/c").unwrap(), 3); +let intersection = tree.intersection(keyexpr::new("a/**").unwrap()); +// intersection is an iterator which will yield the nodes for "a", +// "a/b" and "a/c", which will have weights `None`, `Some(1)` and `Some(2)` +// respectively. The order isn't guaranteed. +``` + +# Good addressing with Key Expressions... + +Now that you're all caught up on KEs and all of their wonderful properties, let's finally talk about some guidelines that you can follow to design easy to use APIs around your KEs, while lightening the load on our infrastructure to keep getting the best performance. + +
declare
imply that you are creating state on the infrastructure. No need to panic, that's what your infrastructure is for. One nice property of KE Trees is that as long as they aren't storing any wild KEs (KEs that contain wildcards), they can shortcut intersection and inclusion into a simple fetch, which is generally much faster than iterating. Note that this of course depends on your specific needs: if you end up using only wild KEs in your puts and queries to compensate for not declaring wild KEs, all of these puts and queries won't be able to take the shortcut either, but the iteration through intersection will have many more steps.**
, or other fully wild KEs, you're technically consuming your entire address space, which means there won't be any left when you want to add new features, or you'll have to resort to receiver-side filtering. Much like the creation of the universe, occupying the entire address space will make many a lot of people very angry and will be widely regarded as a bad move.org/${org:*}/factory/${factory:*}/path/${path:**}
into org/factor/path/-/${org:*}/${factory:*}/${path:**}
, you gain the ability to later add org/factor/path/extension/-/${org:*}/${factory:*}/${path:**}/${extension:*}
to your address space without disrupting the pre-existing one, whereas the former pattern's naive extension would have been org/${org:*}/factory/${factory:*}/path/${path:**}/extension/${extension:*}
, which is included in the first org/${org:*}/factory/${factory:*}/path/${path:**}
, and may therefore cause conflicts in your system's semantics.query
's parameters
. The same goes for publication: non-routing parameters should probably stay in the value
.