Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RRB Trees: Efficient Immutable Vectors #72

Open
rmunn opened this issue Apr 18, 2017 · 15 comments
Open

RRB Trees: Efficient Immutable Vectors #72

rmunn opened this issue Apr 18, 2017 · 15 comments

Comments

@rmunn
Copy link
Contributor

rmunn commented Apr 18, 2017

I propose to add an implementation of Bagwell and Rompf's RRB-Tree Vectors (Relaxed Radix-Balanced Trees) to FSharpx.Collections. They are similar to PersistentVector, but allow for efficient slicing and concatenation. Slicing an RRB vector is effectively constant-time since it takes O(log32 N) time. Concatenating two RRB Vectors together is also O(log32 N) but with a large constant multiplier, as much as 1024 in the worst case, so it should really be considered O(log N) for all practical purposes.

Clojure implementation of RRB vectors: https://github.com/clojure/core.rrb-vector
Scala implementation: https://github.com/nicolasstucki/scala-rrb-vector

Papers about RRB vectors:

@rmunn
Copy link
Contributor Author

rmunn commented Apr 18, 2017

I've been working on an implementation of RRB Trees (I call them RRBVectors in my implementation) in a private Bitbucket repo for a while now, and it's getting close to completion. I've started cleaning it up, removing duplicate code, and fleshing out the API to get it ready to add to the FSharpx.Collections. Once it's fully ready (which might take another couple months since I only have sporadic free time to work on it), I'll submit it as a PR. There are a few implementation details, though, that are worth discussing before the code is ready because they would require a few changes to PersistentVector:

  1. Since RRBVectors are a tweak of the PersistentVector implementation, I've implemented RRBVector nodes as inheriting from PersistentVector nodes. This will allow O(1) conversion of any PersistentVector to an RRBVector, but the "cost" is that the PersistentVector class needs to acquire some internal properties for its root, shift, tail and tailOffset properties so that the Persistent->RRB conversion code can access them. Without said access, the RRBVector code could only convert a PersistentVector in O(N) time, losing pretty much all benefit.

  2. As I've been developing the RRBVector implementation, I've kept it in the FSharpx.Collections.Experimental namespace. However, that required me to add an InternalsVisibleTo in FSharpx.Collections, making its internals visible to FSharpx.Collections.Experimental in order to have access to the PersistentVector's root, tail, shift and tailOffset properties. In the long run, it will be better to have RRBVector in the FSharpx.Collections namespace, so I've been doing extensive unit testing with FsCheck (and Expecto) to make sure it has no flaws. I am now turning up flaws at the rate of about one per eight-hour stress test (Expecto is great!), so it's close to ready, but not quite yet. However, there will need to be a decision about whether the RRBVector data structure can go into FSharpx.Collections directly (in which case no additional InternalsVisibleTo attribute will be needed) or if it will need to spend some time in FSharpx.Collections.Experimental first to get hammered on by real code. If the decision is to keep it in FSharpx.Collections.Experimental for a little while, then the additional InternalsVisibleTo attribute will need to be added in FSharpx.Collections.

@rmunn
Copy link
Contributor Author

rmunn commented Oct 9, 2017

I want to clean up the code some more before I submit a PR, but anyone interested in seeing my first pass at an implementation can start looking at https://github.com/rmunn/FSharpx.Collections/tree/rrb-vector. A couple of concerns I need to discuss with other people before merging this in:

  • The question of whether the RRBVector code should be in a separate file, or folded into the PersistentVector.fs file, is still open. I'm leaning towards "Make it part of PersistentVector.fs", because then I won't have to expose some of the internals of PersistentVector that I exposed.

  • I wrote a lot of Expecto+FsCheck tests while developing this, but the current FSharpx.Collections tests only use FsCheck. Current versions of Expecto depend on .NET Standard 1.6, whereas the FSharpx.Collections code still only requires .NET 4.0 or later. If the FSharpx.Collections code is revised to be based on .NET Standard (see .NET Standard support #77 as well), then I may be able to keep the Expecto tests, otherwise I'll need to rewrite them to be used with just FsCheck + NUnit. I'd rather not start that rewrite unless I know it will be necessary, so I'd appreciate some comments on whether the FSharpx.Collections project will be moving to .NET Standard or not.

But I'm ready for other people to look through my code so far and review it. The main code is found in https://github.com/rmunn/FSharpx.Collections/blob/rrb-vector/src/FSharpx.Collections.Experimental/RRBVector.fs and the main tests are found in https://github.com/rmunn/FSharpx.Collections/blob/rrb-vector/tests/ExpectoRRB/RRBVectorExpectoTest.fs.

@gsomix
Copy link

gsomix commented Mar 8, 2018

@rmunn Is there any way I can help you? Would be great to merge it someday. :)

@rmunn
Copy link
Contributor Author

rmunn commented Mar 9, 2018

@gsomix Pinging me was a good reminder. :-) But also, if you can do anything to get #87 merged that would be a big help. I need master to build before I can make any more progress on getting the RRB Vector merged.

@forki
Copy link
Member

forki commented Mar 9, 2018

merged #87

@rmunn
Copy link
Contributor Author

rmunn commented Mar 9, 2018

Thanks @forki! I'll see what I can do about getting the RRB vector code into a mergeable state now.

@gsomix
Copy link

gsomix commented May 21, 2018

Just a friendly reminder. ;)

@jackfoxy
Copy link
Contributor

jackfoxy commented Jul 7, 2018

@rmunn are you still interested in submitting a PR to Collections.Experimental? I'm hoping to get a 2.0 release out in the next few weeks.

@rmunn
Copy link
Contributor Author

rmunn commented Jul 8, 2018

@jackfoxy I am still interested, but I've had practically zero time to work on coding the past couple of months. I do have the code into a better state than it was back in March, but I think it needs more work before it's mergeable. OTOH, the right answer to "I have zero time" is probably to submit a PR in its current state, and let other people do any remaining work that's lacking. The two biggest issues with the code as I have it right now are:

  1. There's almost no XML documentation, and I think putting that in would be a good idea before a release.

  2. The Expecto tests I wrote require FParsec as a dependency: I wrote a custom DSL for expressing the shape of a given RRB tree succinctly, and it's proven quite difficult to refactor my tests to get rid of the FParsec dependency. OTOH, it's only a build-time dependency and there's no FParsec dependency in the library code itself, so perhaps adding an FParsec build-time dependency to FSharpx.Collections is acceptable?

If an FParsec build-time dependency is acceptable, then my code is already mergeable and I can get a PR in the next few days.

@jackfoxy
Copy link
Contributor

jackfoxy commented Jul 8, 2018

@rmunn I think this is all fine for submission to Collections.Experimental, which is after all "experimental".

@rmunn
Copy link
Contributor Author

rmunn commented Jul 25, 2018

@jackfoxy I've had a little bit more time to work on RRBVector, and discovered that my code isn't quite ready to release yet: my most recent change (implementing transients so that things like RRBVector.ofSeq will be efficient) still has a couple of bugs that I have to hammer down. (Which I would never have found if it weren't for FsCheck: one bug requires you to create a vector with at least 1057 items in it via RRBVector.ofSeq, then pop the list down to 1024 items or fewer, then remove an item from somewhere in the front of the list, at which point it throws an exception. I would never have found that buggy sequence of operations without FsCheck's model-based testing). I don't want to release this code until I know it won't lose people's data or throw exceptions on perfectly valid operations. I also don't want to delay the FSharpx.Collections release waiting on me, so here's what I plan to do:

  • I'll make a new repo for my work on RRBVector.
  • I'll iterate that repo quickly, and release pre-alphas on NuGet (releasing early and often) so that people can start testing it even though there are known bugs, with a big "DO NOT USE IN PRODUCTION" warning. That will get me useful feedback.
  • Once I'm confident that there are no bugs, I'll submit a PR to FSharpx.Collections.

Please don't hold up the FSharpx.Collections release on my account, since it might be another month or two before I truly have the code ready to go. (Without going into details of my personal life, free time is a vanishingly rare commodity for me right now, and my estimate of "a PR in the next few days" was wildly optimistic). But for the sake of @gsomix and anyone else waiting to see the code, I'll definitely get it up in its own repo and start releasing NuGet packages of the work-in-progress.

@jackfoxy
Copy link
Contributor

@rmunn No worries. At a minimum I want to document all the main Collections data structures and fix know issues in the main Collection before releasing, and I haven't touched it in a couple of weeks.

@realvictorprm
Copy link

Friendly ping here :)

@rmunn
Copy link
Contributor Author

rmunn commented Sep 29, 2019

I'm getting rather close to having something releasable. Right now I'm polishing up my tests for 100% code coverage, to make sure I don't have any lurking bugs in this rather complex piece of code. (I thought I had that licked last month, but just last week I found and fixed more bugs). I'll post another comment in this thread when it's ready for alpha testing.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 14, 2020

After far more bugfixing than I had anticipated, I've finally got the last bug (that I know of) licked, so I'm ready to make my RRBVector implementation public! To enable me to more easily respond quickly to bug reports and feature requests, I've created a separate GitHub project named Ficus and created a v0.0.1 NuGet package for my implementation. Anyone interested in alpha-testing my 0.0.1 release, please submit bug reports and/or suggestions for API improvements in the Ficus issue tracker.

Once I'm confident that the code is not only bug-free, but also has an API that's useful to people, I intend to submit it as a PR to FSharpx.Collections. My plan moving forward is to use Ficus as a sort of "staging" repo for my work on the RRBVector implementation, and submit any major improvements to FSharpx.Collections. Basically, SemVer patch releases will stay in the Ficus project, but any releases where I bump the SemVer minor version number (and certainly any releases where I bump the major version number) I'll also submit as a PR to FSharpx.Collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants