-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better Leaf
and Ensemble
structs
#90
Comments
@Eight1911 For a new For a new |
How about this? We change the structure in such a way that storing which leaf the labels end up in can be optional. The advantage of doing this is that the users now have a choice, without the disadvantage of having to use some awkward apis for struct Leaf{T}
label :: T
# the location in the tree.labels
# where the labels in this leaf are stored
region :: UnitRange{Int}
# one or two summary statistics for fast retrieval
# - probability of inaccuracy for classication
# - variance of leaf for regression
deviation :: Float64
end
struct Node{S, T}
featid :: Int
featval :: S
left :: Union{Leaf{T}, Node{S, T}}
right :: Union{Leaf{T}, Node{S, T}}
end
struct Tree{S, T}
root :: Union{Leaf{T}, Node{S, T}}
# nothing if we choose not to store
# otherwise, a vector of labels
labels :: Union{Nothing, Vector{T}}
# ... other helpful metadata like how
# the tree was made, etc.
end A little unrelated, but if you want something even more adventurous, you might also change how nodes refer to their children. That is, instead of linking them directly. Have the tree store the nodes in the array, and then use referral by index. That is, struct Node{S, T}
featid :: Int
featval :: S
left :: Int # index of the left child in `tree.nodes`
right :: Int # index of the right child in `tree.nodes`
end
struct Tree{S, T}
nodes :: Vector{Union{Leaf{T}, Node{S, T}}}
# nothing if we choose not to store
# otherwise, a vector of labels
labels :: Union{Nothing, Vector{T}}
# ... other helpful metadata like how
# the tree was made, the purity criterion used
# etc.
end The advantage is that it should be easier this way to storage in a linear format, which is helpful for porting to other languages. length(tree) is now On another note, how important is preserving backwards compatibility for |
I'm open to both approaches for the new structs. It'd be good to experiment a bit and see which is more efficient, in terms of model training execution time and memory allocation, and also how well models write to disk. |
Hi. Any updates on this? Would love to see the structs get more lightweight and efficient... |
Bump. Any updates on these changes? |
I've implemented GradientBoost and a more efficient version of AdaBoost that interface with
treeregressor
andtreeclassifier
and am now looking to port them to the light-weight front facing structs we use, i.e.,LeafOrNode
andEnsemble
.Trying to do so, I realized that one thing I feel somewhat against is the storage of every labels in the
Leaf
struct. So I want to discuss if it would be possible to change it so something more lightweight. These values are usually unnecessary, take up too much space, and can be recomputed if needed. For a lot of purposes, a single confidence parameter should suffice (e.g., the impurity of that leaf) And if that is not enough we might add, for example, another function that takes in a set of samples and returns an array containing the indices of the leaves they end up in. This would generalize the current implementation to any dataset and not just the training set, and from that we can very easily rewriteapply_tree_proba
.This may feel like an inconvenience, but for boosting and random forest ensembles that usually use shallow trees, most of the memory cost is in this storage of the samples, and the real inconvenience is dealing with models that are 4GB large or not being able to train your model because you ran out of memory.
One other thing is that the current
Ensemble
struct doesn't add very much to the current implementation because it's just a list of trees. This means that we have to deal with something like AdaBoost returning both anEnsemble
and a list of coefficients that should have been a part of that ensemble in the first place. So I'm also proposing that we encode these coefficients into the model.tldr; I'm proposing that we use the following structs instead
The text was updated successfully, but these errors were encountered: