Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Succinct Tree support? #15

Open
faassen opened this issue Dec 13, 2024 · 31 comments
Open

Succinct Tree support? #15

faassen opened this issue Dec 13, 2024 · 31 comments
Labels
enhancement New feature or request

Comments

@faassen
Copy link

faassen commented Dec 13, 2024

I recently became interested in succinct trees, in particular the balanced parentheses (BP) version with an eye on fully-functional succinct trees. I've been staring at various ways to implement fwd-search, bwd-search and rmq which appear fundamental to support various operations efficiently. I am daunted as I cannot find Rust implementations of this.

I saw an old reddit discussion where you mentioned wanting to look into succinct trees. This is basically just an inquiry about where your thinking about this is now, in the hopes that there's been secret progress on this. :) Thank you!

A brief literature reference:

  • "Fully-Functional Succinct Trees" - Sadakana and Navarro. This is the big one on BP trees. Proposes range min-max tree to support the operations.
  • "Succinct Trees in Practice" - Arroyuelo et al . This concludes that the fully functional representation offers good practical trade-offs too.
  • "Simple and efficient fully-functional succinct trees" - Cordova and Navarro. Contains a refinement on the range min-max tree which may be easier to implement.
@Cydhra
Copy link
Owner

Cydhra commented Dec 13, 2024

I haven't made much progress since then, instead focussing on improving the overall API and adding WaveletMatrices. There is an unpushed branch where I began with a BP implementation, but that hasn't gone very far yet. There is also the dev_dfuds branch which contains another attempt at Depth-First-Unary-Degree-Sequence encoding of a succinct tree, but I am not happy with that codebase and its performance, which is why I switched to BP (it's simpler and sufficient for most usecases).

Maybe I will continue with it in the future. Technically the dfuds branch already implements fwd_search and bwd_search, but the implementation contains bugs iirc.

Thanks for the paper references, I will have a look, especially at the last one because I am very unhappy with my current min-max tree attempts.

@faassen
Copy link
Author

faassen commented Dec 13, 2024

Thanks for your reply! I'll take a look at the branches. I'm out of my depth here as I only learned about the existence of the whole field of succinct data structures two days ago. :)

I ended up here because I was looking at optimizing xpath queries over XML of all things. BP trees seem to offer good trade-offs in that domain.

Note that the code referenced by the last paper is unfortunately not available. I asked Navarro about that yesterday - if you don't know anything about succinct data structures you don't know he's super important yet and just drop him an email.

@Cydhra Cydhra added the enhancement New feature or request label Dec 13, 2024
@Cydhra

This comment has been minimized.

@faassen

This comment has been minimized.

@faassen

This comment has been minimized.

@faassen

This comment was marked as resolved.

@Cydhra

This comment was marked as resolved.

@faassen
Copy link
Author

faassen commented Dec 16, 2024

I'm going to give some feedback based on various papers I read. This is not criticism of the code, but a bit of analysis to see whether we can learn from these papers for future improvements, or if we deviate from it, do so consciously. I'm sure you're already aware of much of this, but it is also for my own greater understanding.

First the "Succinct Trees in Practice" paper.

In section 2.3, it mentions the block size is "b = w / 2", where "w" is the bit length of a machine pointer (w is defined in section 1). So on 64 bit machine this would mean a block size of 32 bits.

It also mentions a "superblock". A group of k blocks is a superblock. I'm having trouble figuring out what k is, but it does say "k = Θ(w/ log w)". With w 64 bits, w / log w is a bit over 15. So perhaps a superblock size k of about 16 would make sense. This would have a superblock be 1024 bits. But actually in the implementation section it says:

Storing 32-bit numbers and using, say, k = 32 and b = 512, the space is 2 × 2 · 32 · (2n)/(bk) ≈ 0.78% overhead.

So is b 32 or is b 512? I'm confused. I'm curious to explore the motivations for shorter and longer block sizes with this algorithm.

Now to go to paper "Simple and Efficient Fully-Functional Succinct Trees":

In section 2.3 "range min-max trees" it describes the simple binary tree with a base block as being the practical solution implemented for this structure (referring to "Succinct Trees in Practice" by Arroyuelo et al), so that's I think the current vers implementation as well, right?

In section 2.2 it mentions primitives:

fwdsearch, bwdsearch, rmq, rMq, mincount, and minselect

We currently implement fwdsearch and bwdsearch, but not rmq, rMQ, mincount and minselect.

mincount is used to implement degree(i) (number of children), childrank(i) (children to the left of node i).

minselect is used to be able to implement child(i, q), the qth child of i.

rmq is used to implement lca(i, j) (lowest common ancestor of i and j),

rMq is used to implement deepestnode(i) (first deepest node in subtree i) and height(i) (the height of i, distance to deepest node).

It also mentions that required are:

rank and select on 0, 1, and 10

I don't think vers has these over 10. It's used to implement leafrank(i) (number of leaves preceding node i), leafselect(k) (kth leaf of the tree), numleaves(i) (number of leaves in subtree i), and leftmostleaf(i) and rightmostleaf(i) (leftmost/rightmost leaf of node i).

I'll note that at the present time I don't think any of these additional operations vers doesn't have yet are particularly useful for my XML purposes, and they may be more trouble to implement that they're worth. The possible exceptions are degree(i) and child(i, q).

One interesting section in 2.3 is:

By using small precomputed tables that allow us to scan any block of c = (lg n)/2 bits in constant time (i.e., computing the analogous to fields e, m, M, and n for any chunk of c bits), the total time of the operations is O (b/c + lg n) bits.

It doesn't specify how in detail it could be done, and at first sight it would imply quite small block sizes in order to be able for the precomputed tables to be small. But it suggests a possibly more compact and/or efficient way to arrive at this information. An 8 bit lookup table looks doable. I'm also curious whether we couldn't calculate this information using bit twiddling - count_zeros() and count_ones() would seem useful in order to calculate excess for instance, and using rsvector's get_bits().

In section 2.3 it also has:

v.n is the number of times the minimum excess occurs in the area

In section 4.1 it has:

or 4 bytes if the field v.n is not required, as it is used only in the more complicated operations

I see v.n is used to support mincount and minselect, so we can see it supports the operations degree, childrank, child.

In section 3 it then goes into a O(lg lg n) time solution. It immediately says in section 4 that:

Still, we further simplify some parts to speed them up in practice. As a result, our
implementation does not fully guarantee O(lg lg n) time complexity, but it turns out to be faster than the state-of-the-art
implementation that uses O(lg n) time. As this latter implementation essentially uses one binary rmM-tree for the whole
sequence, our experiments show that our new way to handle inter-bucket queries is useful in practice, reducing both space
and time.

My understanding of section 3 is that multiple mM-trees are maintained, one per bucket. The block size b is different again. Now it's

b = lg n lg lg n

which by my calculations comes down to very small block sizes compared to 512, like 36 bits for a 1 million node tree.

Then in 3.1 it describes how to do inter-bucket forward and backward searching, the details of which I haven't digested yet.

In 4.1 it says the bucket size is 4 kilobytes, which allows 16 bit integers to store the excess in the nodes in each tree. v.e (total excess) is not stored, and instead v.m. and .vM (min and max excess) are stored in absolute form relative to their current bucket. So this would allow the tree to shrink a bit.

Ah, more information about b: 512 and 1024 are practical and finally another hint about lookup table size.

Preliminary tests yielded the following values to be reasonable for b: 512 bits (with lookup tables of 8 bits) and 1024/2048 bits (with lookup tables of 16 bits). In particular, for b = 1024 our rmM-trees have height h = lg(β/b) = 5 and a sequential scan of a block requires up to 64 table lookups.

There are also interesting details on how to implement section 3.1.

The measurement sections demonstrate that the bucket approach is usually faster than the simple block-based approach, but that also depends a lot on the specifics of implementation, I think. What do you think?

The lowest hanging fruit I think is in the lookup tables or other methods to determine excess, maximum and minimum excess for a block.

@Cydhra
Copy link
Owner

Cydhra commented Dec 16, 2024

b = w / 2

not sure what they mean by that, it doesnt make a lot of sense to make blocks this small in practise (unless you're building a k-ary tree without optimization of the excess search in leaf nodes i suppose?)

So is b 32 or is b 512?

The current implemenation allows the user to choose that value anyway, so I can postpone this decision until I can benchmark.

superblocks

I need to think about that idea, because currently it isn't worth it to store values in the tree with less than 64 bit, since the first two nodes of the minmax tree span half the tree and so in theory could have excess of 2^63 and -2^63 respectively. If I use a multilevel approach i may be able to save bits in lower blocks where the block spans less bits in the bitvector and thus the excess must be smaller.
Storing less bits per tree node saves space, but more importantly, increases cache efficiency. So this is potentially an important optimization.

fwdsearch, bwdsearch, rmq, rMq, mincount, and minselect

The missing primitives are what I mentioned above with "There are still a bunch of operations that I need to implement". It specifically needs the rmq values in the min-max tree and I have to see if it's worth to store them and how.

I don't think vers has [rank, select] over 10

That is true, however support for this is planned for a future 2.0 release for other reasons anyway. But I am not actively working on that right now.

It doesn't specify how in detail it could be done

I have an optimization planned that sounds a bit like this, so maybe this resolves itself once I get to that optimization. When I tried to implement dfuds, I already implemented that optimization, but it got me into a lot of trouble with off-by-one errors, so I skipped it for now and plan to add once I have a working implementation.

My understanding of section 3 is that multiple mM-trees are maintained, one per bucket. The block size b is different again. Now it's b = lg n lg lg n

If that isn't a misunderstanding, I think this gets into theoretical territory and is probably not worth it. Ideally the benchmarks find a fixed value for the block size that is a good tradeoff for all tree sizes, or we leave it to the user to decide, with some good defaults or heuristics as fallback. I don't think something this involved is worth implementing: The bitvector has a block size of 512 for its rank/select structures because in practise that is way more worth it than adjusting it by the vector length (even though the vector is technically no longer succinct with a fixed block size).

The lowest hanging fruit I think is in the lookup tables or other methods to determine excess, maximum and minimum excess for a block.

Yes, that is also already planned to be the first optimization that I will implement

@faassen
Copy link
Author

faassen commented Dec 16, 2024

As to implementing additional operations, it depends on whether they are of much practical use to implement them. I can see practical use for degree and child myself but the current set is already pretty good. Of course it depends on your use case, but there's something to be said for simplicity.

The note in "Succinct Trees in Practice" is interesting concerning rmq:

The most practical implementation we know of for rmq [9] requires 62.5% extra space on top of P [1, 2n], and this would put the representation out of competition in terms of space requirement. A theoretical, not implemented, alternative solution to lca(x, y) is given by operation double-enclose [19]. We opt, therefore, by taking parent on the less deep node until it becomes an ancestor of the deeper one, then this is the lca.

It could be simpler ways to implement rmq exist now, but it may still not be worth it. I found Practical Range Mininum Queries Revisited and Faster range minimum queries but I haven't gone into this stuff.

The bitvector has a block size of 512 for its rank/select structures because in practise that is way more worth it than adjusting it by the vector length (even though the vector is technically no longer succinct with a fixed block size).

Yes, in the implementation section it immediately goes back to using 512 and 1024 anyway! So while the buckets approach may potentially be worthwhile, we can ignore the theoretical sizes of blocks which are really all over the place.

@Cydhra

This comment has been minimized.

@faassen

This comment has been minimized.

@Cydhra

This comment has been minimized.

@faassen
Copy link
Author

faassen commented Dec 16, 2024

A terminology note: what the paper calls a "superblock" is not entirely clear to me, I think they're the same as a block in practice. What "Simple and Efficient Fully-Functional Succinct Trees" calls a bucket is something else though, as trees are stored in separate buckets, which can then allow a the representation of the min-max tree to be smaller - they're skipping the excess altogether, and store an absolute min and max in i16. If you're interested in cache locality this may be worthwhile then?

@Cydhra
Copy link
Owner

Cydhra commented Dec 16, 2024

The latter approach seems viable, the superblocks are just the higher tree levels, yes.

@Cydhra
Copy link
Owner

Cydhra commented Dec 21, 2024

Plan of action:

Currently, I am quite sure I can support the standard navigation primitives (except child(i, q)), level-order access, DFS-order access (pre- and post-order), and some of the simple extra queries (like depth(i)).
I would like to support child(i, q) since that is a pretty important query, but it requires minselect support, and I don't want to add that just now.

So here is what I am planning to do:

I will begin benchmarking the current implementation, especially normal navigation queries. I don't think I will focus on the level-order operations, since using BP and then traversing the tree in level order sounds dumb, just use LOUDS at that point. So I am fine if level-order traversal isn't a focus for optimization.

I will then implement the lookup table optimization, because that should be relatively independent of any other optimizations and provide me with an idea of how big the block size can be.

Next, I will try to implement the bucket approach of the Simple and Efficient fully-functional Trees paper. They explain how they chose bucket sizes, as you mentioned earlier, but then later fix the bucket size to 2^15 in practice, which is obviously the best choice because it lets you fix the variable sizes in the mM-tree to 16 bit, so I will be doing that as well.

After I did that, I will see how I can support minselect. I want to keep memory footprint in mind, though, so maybe I will arrive at the conclusion that supporting child(i, q) isn't worth it. Do you need it for your work, or is it just a nice thing to have?

@faassen
Copy link
Author

faassen commented Dec 21, 2024

On level order: it may make sense to support this in a general DOM library; I had a need for it and added it to my Xot XML library for instance, even though its main axes are preorder. But it's definitely not a performance concern and I don't use it for XPath so that's fine.

On getting access to the nth child of a node and getting the total number of children of a node: sounds good to defer this to later. I need to change my xpath bytecode engine where I go through tree axes to use iterators for other reasons too.

There is a use in xpath to get a count of descendants or children quickly, but since filters are in play anyway (queries like "all nodes with this node name") the tree level support is of limited utility I think.

All this is a long way of saying: your plan sounds great! Focusing on optimization what you have over additional features sounds like the right way forward.

PS fm-index is now released using vers so I will retire my fork. I made progress in my thinking about how to best use it too. But I am only going to be active in a limited way over the holidays, and will pick it all up again in January. Happy to chat here though!

@faassen
Copy link
Author

faassen commented Dec 24, 2024

I am reading through commits, enjoying seeing the progress made.

One question I have is about the benchmarks: you mention benchmarking all the important operations and I see parent, previous sibling and next sibling, but to my surprise I saw last child rather than first child.

I am used to implementing forward axes (children, descendants, following) using first child and next sibling, and backwards axes (ancestors, preceding) using parent and previous sibling.

Maybe I read it wrong, but it made me curious - why is last child important rather than first child?

@Cydhra
Copy link
Owner

Cydhra commented Dec 24, 2024

First child is trivial because in BP the nodes are ordered depth-first, so the first child of node i is always at i+1, unless i+1 is 0, in which case ì is a leaf. It's important, but not important to benchmark.

@Cydhra
Copy link
Owner

Cydhra commented Dec 26, 2024

I implemented lookup tables of 8 bit, which yields 20-40% speedup depending on tree size and operation. Larger trees get less speedup because the query runtime is less dependent on the scanning of blocks in the vector, and more on the search in the mM tree.

@faassen

This comment has been minimized.

@Cydhra

This comment has been minimized.

@Cydhra
Copy link
Owner

Cydhra commented Dec 29, 2024

Well so, I have implemented an 8-bit lookup scheme that can lookup all possible excess queries in an 8-bit block, as well as a 16-bit scheme that can only lookup if a 16-bit block contains the query result but then has to search the block. The 8-bit scheme requires 4 KiB for the table, the 16-bit scheme requires 128 KiB.

The 16-bit lookup is faster, and may allow for larger block sizes, which will reduce the size of the mM trees, however this reduction doesn't seem to affect runtime itself, so this is a consideration of memory footprint.

I have gated the 16 bit scheme behind a crate feature and intend to keep it this way to reduce binary size if the feature isn't required. At the moment there is also code to support the 16-bit scheme for 8-bit blocks but that is slower than the current 8-bit scheme and isn't possible to activate with crate features, the code just exists (it's just a few constants that are currently disabled by the 16-bit crate feature)

@faassen
Copy link
Author

faassen commented Dec 29, 2024

Cool! I saw various commits in this direction.

I am going to sit down and try to build the XML tree library with it next week. I have all the pieces to do so now I think. The XPath accelerations which brought me here are going to be more involved and require a significant rework of the XPath engine but using the new tree implementation for read only requirements should be a smaller step.

I am really curious about the implications of that: compared to the non succinct approach where each node has a previous sibling, next sibling, parent and first child reference (plus a hashmap for preorder info) some core operations will be slower. But more will fit in CPU cache plus I can get preorder without additional storage.

The real benefits have to do with the additional indexes I can provide on node name and text but as I said using those is more involved.

@Cydhra
Copy link
Owner

Cydhra commented Dec 29, 2024

I will work on a bucket approach for the MMTs next, maybe optimize the MMT struct size, and then look into rmq/mincount

@faassen

This comment has been minimized.

@Cydhra

This comment has been minimized.

@faassen

This comment has been minimized.

@faassen

This comment has been minimized.

@Cydhra

This comment has been minimized.

@faassen

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants