Obscenely slow prediction #159

tecosaur · 2022-05-11T11:01:36Z

Hello,

I'd love to use DecisionTree.jl for a project I'm currently working on, as it's great in lot of ways. Speedy to train, players nicely with AbstractTrees, etc.

Unfortunately, saying the prediction performance is "not good" is putting things mildly. I did a test run with an simplified version of one of the data sets I'm working with, and recorded the training and prediction times of DecisionTree.jl as well as a number of other common random forest implementations.

Tool	Train time	Predict time	Ratio
DecisionTree.jl	0.6s	175s	292
randomForest	24.4s	4.2s	0.17
ranger	1.9s	0.5s	0.26
sklearn	63s	1.7s	0.03

The competitiveness of the training time gives me hope that the DecisionTrees.jl should be able to be competitive with prediction performance too 🙂.

ablaom · 2022-05-11T22:02:42Z

Thanks for reporting. That high prediction time is intriguing.

Could you please provide enough detail here for others to reproduce the results, using publicly available data, or synthetic data?

tecosaur · 2022-05-12T01:46:30Z

In that example above, I've run 1000 trees on a 1-dimensional binary classification data set with ~70,000 entries. It should be pretty easy to generate something like this.

If you have a look at https://github.com/tecosaur/TreeComparison and run julia bulkreport.jl a number of reports will be generated. See forest.jl for the code running each implementation. While the results aren't as extreme, the disparity in the predict/train time ratio is still quite apparent. For example, with the Iris data:

Tool	Train time	Predict time	Ratio
DecisionTree.jl	0.01s	0.32s	32
randomForest	1.28s	0.6s	0.47
ranger	0.08s	0.03s	0.38
sklearn	1.12s	0.1s	0.09

tecosaur · 2022-05-12T09:21:40Z

Ok, I've started looking into this, and I've identified at least two major sub-optimalities in the design. One is the implementation of tree evaluation/prediction, the other is the design of the Leaf struct.

I'm currently trying to replace Leaf with this structure:

struct Leaf{T, N}
    features :: NTuple{N, T}
    majority :: Int
    values   :: NTuple{N, Int}
    total    :: Int
end

Which should make a prediction with probability on a leaf O(1) instead of O(n). I am unfortunately finding a few parts of the code base hard to work with though, such as src/classification/tree.jl — functions with 18 positional arguments should be outlawed!

tecosaur · 2022-05-13T06:05:38Z

I've just done a bit more than the bare minimum, and so far the prediction with probability performance improvement is 2-10x with a sample iris dataset and a large-ish unidimensional data set. See: https://tecosaur.com/public/treeperf.html

ablaom · 2022-05-16T01:25:27Z

This looks like progress to me. Do you think we could get away with marking the proposed change to the Leaf struct as non-breaking? As far as I can tell, this non-public API, and in any case, we are preserving the existing properties and first parameter.

Happy to review a PR.

tecosaur · 2022-06-24T02:29:04Z

Ok, I was hoping to make further improvements, but it would probably be worth PRing the basic tree improvements I've made.

tecosaur mentioned this issue May 11, 2022

Some minor differences in random forest implementations #160

Open

tecosaur mentioned this issue Jun 24, 2022

Make prediction with probability free #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obscenely slow prediction #159

Obscenely slow prediction #159

tecosaur commented May 11, 2022 •

edited

Loading

ablaom commented May 11, 2022

tecosaur commented May 12, 2022

tecosaur commented May 12, 2022

tecosaur commented May 13, 2022

ablaom commented May 16, 2022

tecosaur commented Jun 24, 2022

Obscenely slow prediction #159

Obscenely slow prediction #159

Comments

tecosaur commented May 11, 2022 • edited Loading

ablaom commented May 11, 2022

tecosaur commented May 12, 2022

tecosaur commented May 12, 2022

tecosaur commented May 13, 2022

ablaom commented May 16, 2022

tecosaur commented Jun 24, 2022

tecosaur commented May 11, 2022 •

edited

Loading